POPULARITY
Stew Campbell, Partner at The Chernin Group In Part 2, Stew Campbell returns to share tactical guidance for founders evaluating outside capital. We dive deep into how to run a founder-led investor process, what to watch for in term sheets, and how to build long-term wealth while scaling a founder-led business. Stew breaks down growth equity vs. private equity, investor diligence, and how to choose a partner who accelerates—not limits—your next chapter. This episode is a must-listen for any operator planning a recap, acquisition, or capital raise in the next 1–3 years. Things You'll Learn: How to run a founder-led competitive investor process What to ask when evaluating potential investors and term sheets How to align capital strategy with long-term wealth goals Ways great investors create real value beyond the check ______________________ This episode is sponsored by DealRoom! Turn your chaos into control. Tired of chasing updates across spreadsheets and email threads? Discover how DealRoom helps corporate development teams bring order to M&A.
Drs. Chris Conrady and Akbar Shakoor join host Dr. Ben Young to teach us about why intraocular inflammation (IOI) sometimes occurs after intravitreal injections, how to differentiate these cases from endophthalmitis, and how to manage this potentially blinding condition. For all episodes or to claim CME credit for selected episodes, visit www.aao.org/podcasts.
Turn your videos into live streams with https://restream.ioI sit down with Robin Seyr, host of a well-known Bitcoin podcast, to explore how Bitcoin is reshaping our world. We discuss its impact on the future of money, personal freedom, and the global economy. Whether you're a curious newcomer or a seasoned Bitcoiner, this conversation offers fresh insights into why Bitcoin matters more than ever.---Follow Robin's podcast: https://www.youtube.com/c/robinseyrFollow Robin on X: https://x.com/RobinSeyr---Follow Momo:https://linktr.ee/mktahmasbi
In this episode of Leader Chat, Jeff continues his conversation with Jay McTighe on modern learning strategies. Jay delves deep into the importance of mission and vision, the IOI framework, and backward design in education. He discusses engaging students authentically and shares pragmatic advice for educational leaders to drive impactful change. Tune in for thoughtful insights and practical recommendations to advance student learning outcomes in today's dynamic educational landscape.
Guest Neil Chue Hong Panelists Richard Littauer | Justin Dorfman Show Notes In this episode of Sustain, hosts Richard Littauer and Justin Dorfman talk with Neil Chue Hong, Director of the Software Sustainability Institute (SSI). They discuss the SSI's mission to sustain software used in research, the institute's history and funding, the role of research software engineers, and the newly launched Research Software Maintenance Fund (RSMF) with £4.8 million dedicated to supporting research software. Neil shares insights into the collaboration, training initiatives, and policy work done by the SSI to promote sustainability in software development. The episode also touches on the impact of large funding initiatives like those from the Chan Zuckerberg Initiative and the evolving role of software development in the age of large language models (LLMs). Hit the download button now! [00:01:44] Neil explains SSI's mission and purpose. [00:02:27] Richard inquires about SSI's funding model and how long SSI has existed. Neil explains SSI is a government funded collaboration via UK Research and Innovation (UKRI), and it was founded in 2010 and is funded through 2028. [00:05:03] Richard highlights SSI's impact and Neil discusses how SSI helped establish “Research Software Engineer (RSE)' as a recognized role. [00:08:20] SSI's annual Collaborations Workshop (May 13-15 in Stirling, UK) is mentioned, and Neil recalls a pivotal collaboration with Greg Wilson (Software Carpentry), which expanded training programs. [00:11:16] Neil explains that the SSI has evolved from consultancy to training, community initiatives, and policy advocacy to scale its impact and ensure long-term sustainability in research software. [00:13:57] Richard introduces SSI's new £4.8M Research Software Maintenance Fund (RSMF). Neil explains it supports maintaining existing research software and it's funded by the UK's Digital Research Infrastructure Programme (UKRI). [00:16:54] A question comes up about the geopolitical impact of this funding and Neil states the UK is maintaining leadership in research software sustainability, not just focusing on national capability. [00:20:54] Neil defines research software products being targeted by the RSMF as software used beyond its original development team. [00:22:54] Richard asks if £4.8M is a significant investment and Neil explains this is comparable to past UK research software grants.. [00:25:10] Neil acknowledges Chan Zuckerberg Initiative (CZI) for improving funding models for research software. [00:29:45] Justin asks how LLMs are changing research software engineering. Neil compares LLMs' impact on software development to smartphones revolutionizing photography. [00:34:05] Find out where you can connect with UKRI, SSI, and with Neil on the web. Quotes [00:02:07] “We've got this motto: Better Software, Better Research.” [00:29:03] “You can define what is clearly sci-fi, you can define what is clearly research software, but making an arbitrary cut-off point is really hard.” Spotlight [00:35:13] Justin's spotlight is ghostty. [00:35:40] Richard's spotlight is Olympus Tough cameras. [00:36:34] Neil's spotlight is The Carpentries and Cinema For All. Links SustainOSS (https://sustainoss.org/) podcast@sustainoss.org (mailto:podcast@sustainoss.org) richard@sustainoss.org (mailto:richard@sustainoss.org) SustainOSS Discourse (https://discourse.sustainoss.org/) SustainOSS Mastodon (https://mastodon.social/tags/sustainoss) Open Collective-SustainOSS (Contribute) (https://opencollective.com/sustainoss) Richard Littauer Socials (https://www.burntfen.com/2023-05-30/socials) Justin Dorfman X (https://twitter.com/jdorfman?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) Neil Chue Hong LinkedIn (https://www.linkedin.com/in/neilchuehong/) Software Sustainability Institute (SSI) (https://www.software.ac.uk/) Save the date for Collaborations Workshop 2025 (CW25)-SSI (https://www.software.ac.uk/news/save-date-collaborations-workshop-2025-cw25) UKRI awards the Software Sustainability Institute £4.8m to strengthen research software maintenance in the UK (SSI) (https://www.software.ac.uk/news/ukri-awards-software-sustainability-institute-ps48m-strengthen-research-software-maintenance) Digital Research Infrastructure Programme (UKRI) (https://www.ukri.org/what-we-do/creating-world-class-research-and-innovation-infrastructure/digital-research-infrastructure/) Sustain Podcast- Episode 43: Investing in Open Infrastructure with Kaitlin Thaney (https://podcast.sustainoss.org/guests/kaitlin-thaney) Sustain Podcast- Episode 230: Kari L. Jordan on The Carpentries (https://podcast.sustainoss.org/guests/kari-jordan) Sustain Podcast- Episode 235: The State of Open Infrastructure 2024, from IOI with Emmy Tsang (https://podcast.sustainoss.org/guests/emmy-tsang) Open Source in Academia Map (https://sustainoss.org/academic-map/) ghostty (https://ghostty.org/) Olympus Tough camera (https://explore.omsystem.com/us/en/tough) The Carpentries (https://carpentries.org/) Cinema For All (https://cinemaforall.org.uk/) Credits Produced by Richard Littauer (https://www.burntfen.com/) Edited by Paul M. Bahr at Peachtree Sound (https://www.peachtreesound.com/) Show notes by DeAnn Bahr Peachtree Sound (https://www.peachtreesound.com/) Special Guest: Neil Chue Hong.
In this solo episode, Dr. Jo Braid delves into the crucial and often overlooked relationship with oneself, sharing insights from her coaching both personally and professionally with a journey toward self-acceptance and compassion. She introduces practical strategies like the Compassionate Witness Exercise and the 5-Minute Self-Connection Practice, aimed at transforming inner dialogue and nurturing self-care. Tune in to discover how strengthening your self-relationship can profoundly impact every other relationship in your life and set the foundation for burnout recovery.Resources:https://drjobraid.com12 Hacks for Christmas: https://www.drjobraid.com/12-xmas-hacks-optinGet started here: https://tidycal.com/drjobraid/work-with-meWheel of Life: www.wheeloflife.ioI acknowledge that I create this podcast on the traditional lands of the Wiradjuri people, who have been the custodians of this land around Orange, New South Wales, for thousands of generations. I pay my respects to Wiradjuri Elders past, present, and emerging, and recognize the continuing connection to land, waters, and culture. This acknowledgment is a small but important step in recognizing the sovereignty of First Nations peoples and the deep historical and ongoing relationship with Country. Disclaimer: The information provided on or through our Site, products and/or services is intended to be for informational purposes only. It does not constitute or replace professional advice for individual or specific situations and nor does it take into account your specific needs or circumstances. Under no circumstances should the content made available on our Site, or regarding our products and/or services be relied upon as professional legal, medical, financial, business or other advice. You agree to obtain these services if you need these. Our Site may have articles and content that is of a general nature and is intended to be for informational purposes only. Your access to and use of they Site is subject to our Privacy Policy and Terms of Use. See omnystudio.com/listener for privacy information.
The full schedule for Latent Space LIVE! at NeurIPS has been announced, featuring Best of 2024 overview talks for the AI Startup Landscape, Computer Vision, Open Models, Transformers Killers, Synthetic Data, Agents, and Scaling, and speakers from Sarah Guo of Conviction, Roboflow, AI2/Meta, Recursal/Together, HuggingFace, OpenHands and SemiAnalysis. Join us for the IRL event/Livestream! Alessio will also be holding a meetup at AWS Re:Invent in Las Vegas this Wednesday. See our new Events page for dates of AI Engineer Summit, Singapore, and World's Fair in 2025. LAST CALL for questions for our big 2024 recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!When we first observed that GPT Wrappers are Good, Actually, we did not even have Bolt on our radar. Since we recorded our Anthropic episode discussing building Agents with the new Claude 3.5 Sonnet, Bolt.new (by Stackblitz) has easily cleared the $8m ARR bar, repeating and accelerating its initial $4m feat.There are very many AI code generators and VS Code forks out there, but Bolt probably broke through initially because of its incredible zero shot low effort app generation:But as we explain in the pod, Bolt also emphasized deploy (Netlify)/ backend (Supabase)/ fullstack capabilities on top of Stackblitz's existing WebContainer full-WASM-powered-developer-environment-in-the-browser tech. Since then, the team has been shipping like mad (with weekly office hours), with bugfixing, full screen, multi-device, long context, diff based edits (using speculative decoding like we covered in Inference, Fast and Slow).All of this has captured the imagination of low/no code builders like Greg Isenberg and many others on YouTube/TikTok/Reddit/X/Linkedin etc:Just as with Fireworks, our relationship with Bolt/Stackblitz goes a bit deeper than normal - swyx advised the launch and got a front row seat to this epic journey, as well as demoed it with Realtime Voice at the recent OpenAI Dev Day. So we are very proud to be the first/closest to tell the full open story of Bolt/Stackblitz!Flow Engineering + Qodo/AlphaCodium UpdateIn year 2 of the pod we have been on a roll getting former guests to return as guest cohosts (Harrison Chase, Aman Sanger, Jon Frankle), and it was a pleasure to catch Itamar Friedman back on the pod, giving us an update on all things Qodo and Testing Agents from our last catchup a year and a half ago:Qodo (they renamed in September) went viral in early January this year with AlphaCodium (paper here, code here) beating DeepMind's AlphaCode with high efficiency:With a simple problem solving code agent:* The first step is to have the model reason about the problem. They describe it using bullet points and focus on the goal, inputs, outputs, rules, constraints, and any other relevant details.* Then, they make the model reason about the public tests and come up with an explanation of why the input leads to that particular output. * The model generates two to three potential solutions in text and ranks them in terms of correctness, simplicity, and robustness. * Then, it generates more diverse tests for the problem, covering cases not part of the original public tests. * Iteratively, pick a solution, generate the code, and run it on a few test cases. * If the tests fail, improve the code and repeat the process until the code passes every test.swyx has previously written similar thoughts on types vs tests for putting bounds on program behavior, but AlphaCodium extends this to AI generated tests and code.More recently, Itamar has also shown that AlphaCodium's techniques also extend well to the o1 models:Making Flow Engineering a useful technique to improve code model performance on every model. This is something we see AI Engineers uniquely well positioned to do compared to ML Engineers/Researchers.Full Video PodcastLike and subscribe!Show Notes* Itamar* Qodo* First episode* Eric* Bolt* StackBlitz* Thinkster* AlphaCodium* WebContainersChapters* 00:00:00 Introductions & Updates* 00:06:01 Generic vs. Specific AI Agents* 00:07:40 Maintaining vs Creating with AI* 00:17:46 Human vs Agent Computer Interfaces* 00:20:15 Why Docker doesn't work for Bolt* 00:24:23 Creating Testing and Code Review Loops* 00:28:07 Bolt's Task Breakdown Flow* 00:31:04 AI in Complex Enterprise Environments* 00:41:43 AlphaCodium* 00:44:39 Strategies for Breaking Down Complex Tasks* 00:45:22 Building in Open Source* 00:50:35 Choosing a product as a founder* 00:59:03 Reflections on Bolt Success* 01:06:07 Building a B2C GTM* 01:18:11 AI Capabilities and Pricing Tiers* 01:20:28 What makes Bolt unique* 01:23:07 Future Growth and Product Development* 01:29:06 Competitive Landscape in AI Engineering* 01:30:01 Advice to Founders and Embracing AI* 01:32:20 Having a baby and completing an Iron ManTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:12]: Hey, and today we're still in our sort of makeshift in-between studio, but we're very delighted to have a former returning guest host, Itamar. Welcome back.Itamar [00:00:21]: Great to be here after a year or more. Yeah, a year and a half.Swyx [00:00:24]: You're one of our earliest guests on Agents. Now you're CEO co-founder of Kodo. Right. Which has just been renamed. You also raised a $40 million Series A, and we can get caught up on everything, but we're also delighted to have our new guest, Eric. Welcome.Eric [00:00:42]: Thank you. Excited to be here. Should I say Bolt or StackBlitz?Swyx [00:00:45]: Like, is it like its own company now or?Eric [00:00:47]: Yeah. Bolt's definitely bolt.new. That's the thing that we're probably the most known for, I imagine, at this point.Swyx [00:00:54]: Which is ridiculous to say because you were working at StackBlitz for so long.Eric [00:00:57]: Yeah. I mean, within a week, we were doing like double the amount of traffic. And StackBlitz had been online for seven years, and we were like, what? But anyways, yeah. So we're StackBlitz, the company behind bolt.new. If you've heard of bolt.new, that's our stuff. Yeah.Swyx [00:01:12]: Yeah.Itamar [00:01:13]: Excellent. I see, by the way, that the founder mode, you need to know to capture opportunities. So kudos on doing that, right? You're working on some technology, and then suddenly you can exploit that to a new world. Yeah.Eric [00:01:24]: Totally. And I think, well, not to jump, but 100%, I mean, a couple of months ago, we had the idea for Bolt earlier this year, but we haven't really shared this too much publicly. But we actually had tried to build it with some of those state-of-the-art models back in January, February, you can kind of imagine which, and they just weren't good enough to actually do the code generation where the code was accurate and it was fast and whatever have you without a ton of like rag, but then there was like issues with that. So we put it on the shelf and then we got kind of a sneak peek of some of the new models that have come out in the past couple of months now. And so once we saw that, once we actually saw the code gen from it, we were like, oh my God, like, okay, we can build a product around this. And so that was really the impetus of us building the thing. But with that, it was StackBlitz, the core StackBlitz product the past seven years has been an IDE for developers. So the entire user experience flow we've built up just didn't make sense. And so when we kind of went out to build Bolt, we just thought, you know, if we were inventing our product today, what would the interface look like given what is now possible with the AI code gen? And so there's definitely a lot of conversations we had internally, but you know, just kind of when we logically laid it out, we were like, yeah, I think it makes sense to just greenfield a new thing and let's see what happens. If it works great, then we'll figure it out. If it doesn't work great, then it'll get deleted at some point. So that's kind of how it actually came to be.Swyx [00:02:49]: I'll mention your background a little bit. You were also founder of Thinkster before you started StackBlitz. So both of you are second time founders. Both of you have sort of re-founded your company recently. Yours was more of a rename. I think a slightly different direction as well. And then we can talk about both. Maybe just chronologically, should we get caught up on where Kodo is first and then you know, just like what people should know since the last pod? Sure.Itamar [00:03:12]: The last pod was two months after we launched and we basically had the vision that we talked about. The idea that software development is about specification, test and code, etc. We are more on the testing part as in essence, we think that if you solve testing, you solve software development. The beautiful chart that we'll put up on screen. And testing is a really big field, like there are many dimensions, unit testing, the level of the component, how big it is, how large it is. And then there is like different type of testing, is it regression or smoke or whatever. So back then we only had like one ID extension with unit tests as in focus. One and a half year later, first ID extension supports more type of testing as context aware. We index local, local repos, but also 10,000s of repos for Fortune 500 companies. We have another agent, another tool that is called, the pure agent is the open source and the commercial one is CodoMerge. And then we have another open source called CoverAgent, which is not yet a commercial product coming very soon. It's very impressive. It could be that already people are approving automated pull requests that they don't even aware in really big open sources. So once we have enough of these, we will also launch another agent. So for the first one and a half year, what we did is grew in our offering and mostly on the side of, does this code actually works, testing, code review, et cetera. And we believe that's the critical milestone that needs to be achieved to actually have the AI engineer for enterprise software. And then like for the first year was everything bottom up, getting to 1 million installation. 2024, that was 2023, 2024 was starting to monetize, to feel like how it is to make the first buck. So we did the teams offering, it went well with a thousand of teams, et cetera. And then we started like just a few months ago to do enterprise with everything you need, which is a lot of things that discussed in the last post that was just released by Codelm. So that's how we call it at Codelm. Just opening the brackets, our company name was Codelm AI, and we renamed to Codo and we call our models Codelm. So back to my point, so we started Enterprise Motion and already have multiple Fortune 100 companies. And then with that, we raised a series of $40 million. And what's exciting about it is that enables us to develop more agents. That's our focus. I think it's very different. We're not coming very soon with an ID or something like that.Swyx [00:06:01]: You don't want to fork this code?Itamar [00:06:03]: Maybe we'll fork JetBrains or something just to be different.Swyx [00:06:08]: I noticed that, you know, I think the promise of general purpose agents has kind of died. Like everyone is doing kind of what you're doing. There's Codogen, Codomerge, and then there's a third one. What's the name of it?Itamar [00:06:17]: Yeah. Codocover. Cover. Which is like a commercial version of a cover agent. It's coming soon.Swyx [00:06:23]: Yeah. It's very similar with factory AI, also doing like droids. They all have special purpose doing things, but people don't really want general purpose agents. Right. The last time you were here, we talked about AutoGBT, the biggest thing of 2023. This year, not really relevant anymore. And I think it's mostly just because when you give me a general purpose agent, I don't know what to do with it.Eric [00:06:42]: Yeah.Itamar [00:06:43]: I totally agree with that. We're seeing it for a while and I think it will stay like that despite the computer use, et cetera, that supposedly can just replace us. You can just like prompt it to be, hey, now be a QA or be a QA person or a developer. I still think that there's a few reasons why you see like a dedicated agent. Again, I'm a bit more focused, like my head is more on complex software for big teams and enterprise, et cetera. And even think about permissions and what are the data sources and just the same way you manage permissions for users. Developers, you probably want to have dedicated guardrails and dedicated approvals for agents. I intentionally like touched a point on not many people think about. And of course, then what you can think of, like maybe there's different tools, tool use, et cetera. But just the first point by itself is a good reason why you want to have different agents.Alessio [00:07:40]: Just to compare that with Bot.new, you're almost focused on like the application is very complex and now you need better tools to kind of manage it and build on top of it. On Bot.new, it's almost like I was using it the other day. There's basically like, hey, look, I'm just trying to get started. You know, I'm not very opinionated on like how you're going to implement this. Like this is what I want to do. And you build a beautiful app with it. What people ask as the next step, you know, going back to like the general versus like specific, have you had people say, hey, you know, this is great to start, but then I want a specific Bot.new dot whatever else to do a more vertical integration and kind of like development or what's the, what do people say?Eric [00:08:18]: Yeah. I think, I think you kind of hit the, hit it head on, which is, you know, kind of the way that we've, we've kind of talked about internally is it's like people are using Bolt to go from like 0.0 to 1.0, like that's like kind of the biggest unlock that Bolt has versus most other things out there. I mean, I think that's kind of what's, what's very unique about Bolt. I think the, you know, the working on like existing enterprise applications is, I mean, it's crazy important because, you know, there's a, you look, when you look at the fortune 500, I mean, these code bases, some of these have been around for 20, 30 plus years. And so it's important to be going from, you know, 101.3 to 101.4, et cetera. I think for us, so what's been actually pretty interesting is we see there's kind of two different users for us that are coming in and it's very distinct. It's like people that are developers already. And then there's people that have never really written software and more if they have, it's been very, very minimal. And so in the first camp, what these developers are doing, like to go from zero to one, they're coming to Bolt and then they're ejecting the thing to get up or just downloading it and, you know, opening cursor, like whatever to, to, you know, keep iterating on the thing. And sometimes they'll bring it back to Bolt to like add in a huge piece of functionality or something. Right. But for the people that don't know how to code, they're actually just, they, they live in this thing. And that was one of the weird things when we launched is, you know, within a day of us being online, one of the most popular YouTube videos, and there's been a ton since, which was, you know, there's like, oh, Bolt is the cursor killer. And I originally saw the headlines and I was like, thanks for the views. I mean, I don't know. This doesn't make sense to me. That's not, that's not what we kind of thought.Swyx [00:09:44]: It's how YouTubers talk to each other. Well, everything kills everything else.Eric [00:09:47]: Totally. But what blew my mind was that there was any comparison because it's like cursor is a, is a local IDE product. But when, when we actually kind of dug into it and we, and we have people that are using our product saying this, I'm not using cursor. And I was like, what? And it turns out there are hundreds of thousands of people that we have seen that we're using cursor and we're trying to build apps with that where they're not traditional software does, but we're heavily leaning on the AI. And as you can imagine, it is very complicated, right? To do that with cursor. So when Bolt came out, they're like, wow, this thing's amazing because it kind of inverts the complexity where it's like, you know, it's not an IDE, it's, it's a, it's a chat-based sort of interface that we have. So that's kind of the split, which is rather interesting. We've had like the first startups now launch off of Bolt entirely where this, you know, tomorrow I'm doing a live stream with this guy named Paul, who he's built an entire CRM using this thing and you know, with backend, et cetera. And people have made their first money on the internet period, you know, launching this with Stripe or whatever have you. So that's, that's kind of the two main, the two main categories of folks that we see using Bolt though.Itamar [00:10:51]: I agree that I don't understand the comparison. It doesn't make sense to me. I think like we have like two type of families of tools. One is like we re-imagine the software development. I think Bolt is there and I think like a cursor is more like a evolution of what we already have. It's like taking the IDE and it's, it's amazing and it's okay, let's, let's adapt the IDE to an era where LLMs can do a lot for us. And Bolt is more like, okay, let's rethink everything totally. And I think we see a few tools there, like maybe Vercel, Veo and maybe Repl.it in that area. And then in the area of let's expedite, let's change, let's, let's progress with what we already have. You can see Cursor and Kodo, but we're different between ourselves, Cursor and Kodo, but definitely I think that comparison doesn't make sense.Alessio [00:11:42]: And just to set the context, this is not a Twitter demo. You've made 4 million of revenue in four weeks. So this is, this is actually working, you know, it's not a, what, what do you think that is? Like, there's been so many people demoing coding agents on Twitter and then it doesn't really work. And then you guys were just like, here you go, it's live, go use it, pay us for it. You know, is there anything in the development that was like interesting and maybe how that compares to building your own agents?Eric [00:12:08]: We had no idea, honestly, like we, we, we've been pretty blown away and, and things have just kind of continued to grow faster since then. We're like, oh, today is week six. So I, I kind of came back to the point you just made, right, where it's, you, you kind of outlined, it's like, there's kind of this new market of like kind of rethinking the software development and then there's heavily augmenting existing developers. I think that, you know, both of which are, you know, AI code gen being extremely good, it's allowed existing developers, it's allowing existing developers to camera out software far faster than they could have ever before, right? It's like the ultimate power tool for an existing developer. But this code gen stuff is now so good. And then, and we saw this over the past, you know, from the beginning of the year when we tried to first build, it's actually lowered the barrier to people that, that aren't traditionally software engineers. But the kind of the key thing is if you kind of think about it from, imagine you've never written software before, right? My co-founder and I, he and I grew up down the street from each other in Chicago. We learned how to code when we were 13 together and we've been building stuff ever since. And this is back in like the mid 2000s or whatever, you know, there was nothing for free to learn from online on the internet and how to code. For our 13th birthdays, we asked our parents for, you know, O'Reilly books cause you couldn't get this at the library, right? And so instead of like an Xbox, we got, you know, programming books. But the hardest part for everyone learning to code is getting an environment set up locally, you know? And so when we built StackBlitz, like kind of the key thesis, like seven years ago, the insight we had was that, Hey, it seems like the browser has a lot of new APIs like WebAssembly and service workers, et cetera, where you could actually write an operating system that ran inside the browser that could boot in milliseconds. And you, you know, basically there's this missing capability of the web. Like the web should be able to build apps for the web, right? You should be able to build the web on the web. Every other platform has that, Visual Studio for Windows, Xcode for Mac. The web has no built in primitive for this. And so just like our built in kind of like nerd instinct on this was like, that seems like a huge hole and it's, you know, it will be very valuable or like, you know, very valuable problem to solve. So if you want to set up that environments, you know, this is what we spent the past seven years doing. And the reality is existing developers have running locally. They already know how to set up that environment. So the problem isn't as acute for them. When we put Bolt online, we took that technology called WebContainer and married it with these, you know, state of the art frontier models. And the people that have the most pain with getting stuff set up locally is people that don't code. I think that's been, you know, really the big explosive reason is no one else has been trying to make dev environments work inside of a browser tab, you know, for the past if since ever, other than basically our company, largely because there wasn't an immediate demand or need. So I think we kind of find ourselves at the right place at the right time. And again, for this market of people that don't know how to write software, you would kind of expect that you should be able to do this without downloading something to your computer in the same way that, hey, I don't have to download Photoshop now to make designs because there's Figma. I don't have to download Word because there's, you know, Google Docs. They're kind of looking at this as that sort of thing, right? Which was kind of the, you know, our impetus and kind of vision from the get-go. But you know, the code gen, the AI code gen stuff that's come out has just been, you know, an order of magnitude multiplier on how magic that is, right? So that's kind of my best distillation of like, what is going on here, you know?Alessio [00:15:21]: And you can deploy too, right?Eric [00:15:22]: Yeah.Alessio [00:15:23]: Yeah.Eric [00:15:24]: And so that's, what's really cool is it's, you know, we have deployment built in with Netlify and this is actually, I think, Sean, you actually built this at Netlify when you were there. Yeah. It's one of the most brilliant integrations actually, because, you know, effectively the API that Sean built, maybe you can speak to it, but like as a provider, we can just effectively give files to Netlify without the user even logging in and they have a live website. And if they want to keep, hold onto it, they can click a link and claim it to their Netlify account. But it basically is just this really magic experience because when you come to Bolt, you say, I want a website. Like my mom, 70, 71 years old, made her first website, you know, on the internet two weeks ago, right? It was about her nursing days.Swyx [00:16:03]: Oh, that's fantastic though. It wouldn't have been made.Eric [00:16:06]: A hundred percent. Cause even in, you know, when we've had a lot of people building personal, like deeply personal stuff, like in the first week we launched this, the sales guy from the East Coast, you know, replied to a tweet of mine and he said, thank you so much for building this to your team. His daughter has a medical condition and so for her to travel, she has to like line up donors or something, you know, so ahead of time. And so he actually used Bolt to make a website to do that, to actually go and send it to folks in the region she was going to travel to ahead of time. I was really touched by it, but I also thought like, why, you know, why didn't he use like Wix or Squarespace? Right? I mean, this is, this is a solved problem, quote unquote, right? And then when I thought, I actually use Squarespace for my, for my, uh, the wedding website for my wife and I, like back in 2021, so I'm familiar, you know, it was, it was faster. I know how to code. I was like, this is faster. Right. And I thought back and I was like, there's a whole interface you have to learn how to use. And it's actually not that simple. There's like a million things you can configure in that thing. When you come to Bolt, there's a, there's a text box. You just say, I need a, I need a wedding website. Here's the date. Here's where it is. And here's a photo of me and my wife, put it somewhere relevant. It's actually the simplest way. And that's what my, when my mom came, she said, uh, I'm Pat Simons. I was a nurse in the seventies, you know, and like, here's the things I did and a website came out. So coming back to why is this such a, I think, why are we seeing this sort of growth? It's, this is the simplest interface I think maybe ever created to actually build it, a deploy a website. And then that website, my mom made, she's like, okay, this looks great. And there's, there's one button, you just click it, deploy, and it's live and you can buy a domain name, attach it to it. And you know, it's as simple as it gets, it's getting even simpler with some of the stuff we're working on. But anyways, so that's, it's, it's, uh, it's been really interesting to see some of the usage like that.Swyx [00:17:46]: I can offer my perspective. So I, you know, I probably should have disclosed a little bit that, uh, I'm a, uh, stack list investor.Alessio [00:17:53]: Canceled the episode. I know, I know. Don't play it now. Pause.Eric actually reached out to ShowMeBolt before the launch. And we, you know, we talked a lot about, like, the framing of, of what we're going to talk about how we marketed the thing, but also, like, what we're So that's what Bolt was going to need, like a whole sort of infrastructure.swyx: Netlify, I was a maintainer but I won't take claim for the anonymous upload. That's actually the origin story of Netlify. We can have Matt Billman talk about it, but that was [00:18:00] how Netlify started. You could drag and drop your zip file or folder from your desktop onto a website, it would have a live URL with no sign in.swyx: And so that was the origin story of Netlify. And it just persists to today. And it's just like it's really nice, interesting that both Bolt and CognitionDevIn and a bunch of other sort of agent type startups, they all use Netlify to deploy because of this one feature. They don't really care about the other features.swyx: But, but just because it's easy for computers to use and talk to it, like if you build an interface for computers specifically, that it's easy for them to Navigate, then they will be used in agents. And I think that's a learning that a lot of developer tools companies are having. That's my bolt launch story and now if I say all that stuff.swyx: And I just wanted to come back to, like, the Webcontainers things, right? Like, I think you put a lot of weight on the technical modes. I think you also are just like, very good at product. So you've, you've like, built a better agent than a lot of people, the rest of us, including myself, who have tried to build these things, and we didn't get as far as you did.swyx: Don't shortchange yourself on products. But I think specifically [00:19:00] on, on infra, on like the sandboxing, like this is a thing that people really want. Alessio has Bax E2B, which we'll have on at some point, talking about like the sort of the server full side. But yours is, you know, inside of the browser, serverless.swyx: It doesn't cost you anything to serve one person versus a million people. It doesn't, doesn't cost you anything. I think that's interesting. I think in theory, we should be able to like run tests because you can run the full backend. Like, you can run Git, you can run Node, you can run maybe Python someday.swyx: We talked about this. But ideally, you should be able to have a fully gentic loop, running code, seeing the errors, correcting code, and just kind of self healing, right? Like, I mean, isn't that the dream?Eric: Totally.swyx: Yeah,Eric: totally. At least in bold, we've got, we've got a good amount of that today. I mean, there's a lot more for us to do, but one of the nice things, because like in web container, you know, there's a lot of kind of stuff you go Google like, you know, turn docker container into wasm.Eric: You'll find a lot of stuff out there that will do that. The problem is it's very big, it's slow, and that ruins the experience. And so what we ended up doing is just writing an operating system from [00:20:00] scratch that was just purpose built to, you know, run in a browser tab. And the reason being is, you know, Docker 2 awesome things will give you an image that's like out 60 to 100 megabits, you know, maybe more, you know, and our, our OS, you know, kind of clocks in, I think, I think we're in like a, maybe, maybe a megabyte or less or something like that.Eric: I mean, it's, it's, you know, really, really, you know, stripped down.swyx: This is basically the task involved is I understand that it's. Mapping every single, single Linux call to some kind of web, web assembly implementation,Eric: but more or less, and, and then there's a lot of things actually, like when you're looking at a dev environment, there's a lot of things that you don't need that a traditional OS is gonna have, right?Eric: Like, you know audio drivers or you like, there's just like, there's just tons of things. Oh, yeah. Right. Yeah. That goes . Yeah. You can just kind, you can, you can kind of tos them. Or alternatively, what you can do is you can actually be the nice thing. And this is, this kind of comes back to the origins of browsers, which is, you know, they're, they're at the beginning of the web and, you know, the late nineties, there was two very different kind of visions for the web where Alan Kay vehemently [00:21:00] disagree with the idea that should be document based, which is, you know, Tim Berners Lee, you know, that, and that's kind of what ended up winning, winning was this document based kind of browsing documents on the web thing.Eric: Alan Kay, he's got this like very famous quote where he said, you know, you want web browsers to be mini operating systems. They should download little mini binaries and execute with like a little mini virtualized operating system in there. And what's kind of interesting about the history, not to geek out on this aspect, what's kind of interesting about the history is both of those folks ended up being right.Eric: Documents were actually the pragmatic way that the web worked. Was, you know, became the most ubiquitous platform in the world to the degree now that this is why WebAssembly has been invented is that we're doing, we need to do more low level things in a browser, same thing with WebGPU, et cetera. And so all these APIs, you know, to build an operating system came to the browser.Eric: And that was actually the realization we had in 2017 was, holy heck, like you can actually, you know, service workers, which were designed for allowing your app to work offline. That was the kind of the key one where it was like, wait a second, you can actually now run. Web servers within a [00:22:00] browser, like you can run a server that you open up.Eric: That's wild. Like full Node. js. Full Node. js. Like that capability. Like, I can have a URL that's programmatically controlled. By a web application itself, boom. Like the web can build the web. The primitive is there. Everyone at the time, like we talked to people that like worked on, you know Chrome and V8 and they were like, uhhhh.Eric: You know, like I don't know. But it's one of those things you just kind of have to go do it to find out. So we spent a couple of years, you know, working on it and yeah. And, and, and got to work in back in 2021 is when we kind of put the first like data of web container online. Butswyx: in partnership with Google, right?swyx: Like Google actually had to help you get over the finish line with stuff.Eric: A hundred percent, because well, you know, over the years of when we were doing the R and D on the thing. Kind of the biggest challenge, the two ways that you can kind of test how powerful and capable a platform are, the two types of applications are one, video games, right, because they're just very compute intensive, a lot of calculations that have to happen, right?Eric: The second one are IDEs, because you're talking about actually virtualizing the actual [00:23:00] runtime environment you are in to actually build apps on top of it, which requires sophisticated capabilities, a lot of access to data. You know, a good amount of compute power, right, to effectively, you know, building app in app sort of thing.Eric: So those, those are the stress tests. So if your platform is missing stuff, those are the things where you find out. Those are, those are the people building games and IDEs. They're the ones filing bugs on operating system level stuff. And for us, browser level stuff.Eric [00:23:47]: yeah, what ended up happening is we were just hammering, you know, the Chromium bug tracker, and they're like, who are these guys? Yeah. And, and they were amazing because I mean, just making Chrome DevTools be able to debug, I mean, it's, it's not, it wasn't originally built right for debugging an operating system, right? They've been phenomenal working with us and just kind of really pushing the limits, but that it's a rising tide that's kind of lifted all boats because now there's a lot of different types of applications that you can debug with Chrome Dev Tools that are running a browser that runs more reliably because just the stress testing that, that we and, you know, games that are coming to the web are kind of pushing as well, but.Itamar [00:24:23]: That's awesome. About the testing, I think like most, let's say coding assistant from different kinds will need this loop of testing. And even I would add code review to some, to some extent that you mentioned. How is testing different from code review? Code review could be, for example, PR review, like a code review that is done at the point of when you want to merge branches. But I would say that code review, for example, checks best practices, maintainability, and so on. It's not just like CI, but more than CI. And testing is like a more like checking functionality, et cetera. So it's different. We call, by the way, all of these together code integrity, but that's a different story. Just to go back to the, to the testing and specifically. Yeah. It's, it's, it's since the first slide. Yeah. We're consistent. So if we go back to the testing, I think like, it's not surprising that for us testing is important and for Bolt it's testing important, but I want to shed some light on a different perspective of it. Like let's think about autonomous driving. Those startups that are doing autonomous driving for highway and autonomous driving for the city. And I think like we saw the autonomous of the highway much faster and reaching to a level, I don't know, four or so much faster than those in the city. Now, in both cases, you need testing and quote unquote testing, you know, verifying validation that you're doing the right thing on the road and you're reading and et cetera. But it's probably like so different in the city that it could be like actually different technology. And I claim that we're seeing something similar here. So when you're building the next Wix, and if I was them, I was like looking at you and being a bit scared. That's what you're disrupting, what you just said. Then basically, I would say that, for example, the UX UI is freaking important. And because you're you're more aiming for the end user. In this case, maybe it's an end user that doesn't know how to develop for developers. It's also important. But let alone those that do not know to develop, they need a slick UI UX. And I think like that's one reason, for example, I think Cursor have like really good technology. I don't know the underlying what's under the hood, but at least what they're saying. But I think also their UX UI is great. It's a lot because they did their own ID. While if you're aiming for the city AI, suddenly like there's a lot of testing and code review technology that it's not necessarily like that important. For example, let's talk about integration tests. Probably like a lot of what you're building involved at the moment is isolated applications. Maybe the vision or the end game is maybe like having one solution for everything. It could be that eventually the highway companies will go into the city and the other way around. But at the beginning, there is a difference. And integration tests are a good example. I guess they're a bit less important. And when you think about enterprise software, they're really important. So to recap, like I think like the idea of looping and verifying your test and verifying your code in different ways, testing or code review, et cetera, seems to be important in the highway AI and the city AI, but in different ways and different like critical for the city, even more and more variety. Actually, I was looking to ask you like what kind of loops you guys are doing. For example, when I'm using Bolt and I'm enjoying it a lot, then I do see like sometimes you're trying to catch the errors and fix them. And also, I noticed that you're breaking down tasks into smaller ones and then et cetera, which is already a common notion for a year ago. But it seems like you're doing it really well. So if you're willing to share anything about it.Eric [00:28:07]: Yeah, yeah. I realized I never actually hit the punchline of what I was saying before. I mentioned the point about us kind of writing an operating system from scratch because what ended up being important about that is that to your point, it's actually a very, like compared to like a, you know, if you're like running cursor on anyone's machine, you kind of don't know what you're dealing with, with the OS you're running on. There could be an error happens. It could be like a million different things, right? There could be some config. There could be, it could be God knows what, right? The thing with WebConnect is because we wrote the entire thing from scratch. It's actually a unified image basically. And we can instrument it at any level that we think is going to be useful, which is exactly what we did when we started building Bolt is we instrumented stuff at like the process level, at the runtime level, you know, et cetera, et cetera, et cetera. Stuff that would just be not impossible to do on local, but to do that in a way that works across any operating system, whatever is, I mean, would just be insanely, you know, insanely difficult to do right and reliably. And that's what you saw when you've used Bolt is that when an error actually will occur, whether it's in the build process or the actual web application itself is failing or anything kind of in between, you can actually capture those errors. And today it's a very primitive way of how we've implemented it largely because the product just didn't exist 90 days ago. So we're like, we got some work ahead of us and we got to hire some more a little bit, but basically we present and we say, Hey, this is, here's kind of the things that went wrong. There's a fix it button and then a ignore button, and then you can just hit fix it. And then we take all that telemetry through our agent, you run it through our agent and say, kind of, here's the state of the application. Here's kind of the errors that we got from Node.js or the browser or whatever, and like dah, dah, dah, dah. And it can take a crack at actually solving it. And it's actually pretty darn good at being able to do that. That's kind of been a, you know, closing the loop and having it be a reliable kind of base has seemed to be a pretty big upgrade over doing stuff locally, just because I think that's a pretty key ingredient of it. And yeah, I think breaking things down into smaller tasks, like that's, that's kind of a key part of our agent. I think like Claude did a really good job with artifacts. I think, you know, us and kind of everyone else has, has kind of taken their approach of like actually breaking out certain tasks in a certain order into, you know, kind of a concrete way. And, and so actually the core of Bolt, I know we actually made open source. So you can actually go and check out like the system prompts and et cetera, and you can run it locally and whatever have you. So anyone that's interested in this stuff, I'd highly recommend taking a look at. There's not a lot of like stuff that's like open source in this realm. It's, that was one of the fun things that we've we thought would be cool to do. And people, people seem to like it. I mean, there's a lot of forks and people adding different models and stuff. So it's been cool to see.Swyx [00:30:41]: Yeah. I'm happy to add, I added real-time voice for my opening day demo and it was really fun to hack with. So thank you for doing that. Yeah. Thank you. I'm going to steal your code.Eric [00:30:52]: Because I want that.Swyx [00:30:52]: It's funny because I built on top of the fork of Bolt.new that already has the multi LLM thing. And so you just told me you're going to merge that in. So then you're going to merge two layers of forks down into this thing. So it'll be fun.Eric [00:31:03]: Heck yeah.Alessio [00:31:04]: Just to touch on like the environment, Itamar, you maybe go into the most complicated environments that even the people that work there don't know how to run. How much of an impact does that have on your performance? Like, you know, it's most of the work you're doing actually figuring out environment and like the libraries, because I'm sure they're using outdated version of languages, they're using outdated libraries, they're using forks that have not been on the public internet before. How much of the work that you're doing is like there versus like at the LLM level?Itamar [00:31:32]: One of the reasons I was asking about, you know, what are the steps to break things down, because it really matters. Like, what's the tech stack? How complicated the software is? It's hard to figure it out when you're dealing with the real world, any environment of enterprise as a city, when I'm like, while maybe sometimes like, I think you do enable like in Bolt, like to install stuff, but it's quite a like controlled environment. And that's a good thing to do, because then you narrow down and it's easier to make things work. So definitely, there are two dimensions, I think, actually spaces. One is the fact just like installing our software without yet like doing anything, making it work, just installing it because we work with enterprise and Fortune 500, etc. Many of them want on prem solution.Swyx [00:32:22]: So you have how many deployment options?Itamar [00:32:24]: Basically, we had, we did a metric metrics, say 96 options, because, you know, they're different dimensions. Like, for example, one dimension, we connect to your code management system to your Git. So are you having like GitHub, GitLab? Subversion? Is it like on cloud or deployed on prem? Just an example. Which model agree to use its APIs or ours? Like we have our Is it TestGPT? Yeah, when we started with TestGPT, it was a huge mistake name. It was cool back then, but I don't think it's a good idea to name a model after someone else's model. Anyway, that's my opinion. So we gotSwyx [00:33:02]: I'm interested in these learnings, like things that you change your mind on.Itamar [00:33:06]: Eventually, when you're building a company, you're building a brand and you want to create your own brand. By the way, when I thought about Bolt.new, I also thought about if it's not a problem, because when I think about Bolt, I do think about like a couple of companies that are already called this way.Swyx [00:33:19]: Curse companies. You could call it Codium just to...Itamar [00:33:24]: Okay, thank you. Touche. Touche.Eric [00:33:27]: Yeah, you got to imagine the board meeting before we launched Bolt, one of our investors, you can imagine they're like, are you sure? Because from the investment side, it's kind of a famous, very notorious Bolt. And they're like, are you sure you want to go with that name? Oh, yeah. Yeah, absolutely.Itamar [00:33:43]: At this point, we have actually four models. There is a model for autocomplete. There's a model for the chat. There is a model dedicated for more for code review. And there is a model that is for code embedding. Actually, you might notice that there isn't a good code embedding model out there. Can you name one? Like dedicated for code?Swyx [00:34:04]: There's code indexing, and then you can do sort of like the hide for code. And then you can embed the descriptions of the code.Itamar [00:34:12]: Yeah, but you do see a lot of type of models that are dedicated for embedding and for different spaces, different fields, etc. And I'm not aware. And I know that if you go to the bedrock, try to find like there's a few code embedding models, but none of them are specialized for code.Swyx [00:34:31]: Is there a benchmark that you would tell us to pay attention to?Itamar [00:34:34]: Yeah, so it's coming. Wait for that. Anyway, we have our models. And just to go back to the 96 option of deployment. So I'm closing the brackets for us. So one is like dimensional, like what Git deployment you have, like what models do you agree to use? Dotter could be like if it's air-gapped completely, or you want VPC, and then you have Azure, GCP, and AWS, which is different. Do you use Kubernetes or do not? Because we want to exploit that. There are companies that do not do that, etc. I guess you know what I mean. So that's one thing. And considering that we are dealing with one of all four enterprises, we needed to deal with that. So you asked me about how complicated it is to solve that complex code. I said, it's just a deployment part. And then now to the software, we see a lot of different challenges. For example, some companies, they did actually a good job to build a lot of microservices. Let's not get to if it's good or not, but let's first assume that it is a good thing. A lot of microservices, each one of them has their own repo. And now you have tens of thousands of repos. And you as a developer want to develop something. And I remember me coming to a corporate for the first time. I don't know where to look at, like where to find things. So just doing a good indexing for that is like a challenge. And moreover, the regular indexing, the one that you can find, we wrote a few blogs on that. By the way, we also have some open source, different than yours, but actually three and growing. Then it doesn't work. You need to let the tech leads and the companies influence your indexing. For example, Mark with different repos with different colors. This is a high quality repo. This is a lower quality repo. This is a repo that we want to deprecate. This is a repo we want to grow, etc. And let that be part of your indexing. And only then things actually work for enterprise and they don't get to a fatigue of, oh, this is awesome. Oh, but I'm starting, it's annoying me. I think Copilot is an amazing tool, but I'm quoting others, meaning GitHub Copilot, that they see not so good retention of GitHub Copilot and enterprise. Ooh, spicy. Yeah. I saw snapshots of people and we have customers that are Copilot users as well. And also I saw research, some of them is public by the way, between 38 to 50% retention for users using Copilot and enterprise. So it's not so good. By the way, I don't think it's that bad, but it's not so good. So I think that's a reason because, yeah, it helps you auto-complete, but then, and especially if you're working on your repo alone, but if it's need that context of remote repos that you're code-based, that's hard. So to make things work, there's a lot of work on that, like giving the controllability for the tech leads, for the developer platform or developer experience department in the organization to influence how things are working. A short example, because if you have like really old legacy code, probably some of it is not so good anymore. If you just fine tune on these code base, then there is a bias to repeat those mistakes or old practices, etc. So you need, for example, as I mentioned, to influence that. For example, in Coda, you can have a markdown of best practices by the tech leads and Coda will include that and relate to that and will not offer suggestions that are not according to the best practices, just as an example. So that's just a short list of things that you need to do in order to deal with, like you mentioned, the 100.1 to 100.2 version of software. I just want to say what you're doing is extremelyEric [00:38:32]: impressive because it's very difficult. I mean, the business of Stackplus, kind of before bulk came online, we sold a version of our IDE that went on-prem. So I understand what you're saying about the difficulty of getting stuff just working on-prem. Holy heck. I mean, that is extremely hard. I guess the question I have for you is, I mean, we were just doing that with kind of Kubernetes-based stuff, but the spread of Fortune 500 companies that you're working with, how are they doing the inference for this? Are you kind of plugging into Azure's OpenAI stuff and AWS's Bedrock, you know, Cloud stuff? Or are they just like running stuff on GPUs? Like, what is that? How are these folks approaching that? Because, man, what we saw on the enterprise side, I mean, I got to imagine that that's a huge challenge. Everything you said and more, like,Itamar [00:39:15]: for example, like someone could be, and I don't think any of these is bad. Like, they made their decision. Like, for example, some people, they're, I want only AWS and VPC on AWS, no matter what. And then they, some of them, like there is a subset, I will say, I'm willing to take models only for from Bedrock and not ours. And we have a problem because there is no good code embedding model on Bedrock. And that's part of what we're doing now with AWS to solve that. We solve it in a different way. But if you are willing to run on AWS VPC, but run your run models on GPUs or inferentia, like the new version of the more coming out, then our models can run on that. But everything you said is right. Like, we see like on-prem deployment where they have their own GPUs. We see Azure where you're using OpenAI Azure. We see cases where you're running on GCP and they want OpenAI. Like this cross, like a case, although there is Gemini or even Sonnet, I think is available on GCP, just an example. So all the options, that's part of the challenge. I admit that we thought about it, but it was even more complicated. And it took us a few months to actually, that metrics that I mentioned, to start clicking each one of the blocks there. A few months is impressive. I mean,Eric [00:40:35]: honestly, just that's okay. Every one of these enterprises is, their networking is different. Just everything's different. Every single one is different. I see you understand. Yeah. So that just cannot be understated. That it is, that's extremely impressive. Hats off.Itamar [00:40:50]: It could be, by the way, like, for example, oh, we're only AWS, but our GitHub enterprise is on-prem. Oh, we forgot. So we need like a private link or whatever, like every time like that. It's not, and you do need to think about it if you want to work with an enterprise. And it's important. Like I understand like their, I respect their point of view.Swyx [00:41:10]: And this primarily impacts your architecture, your tech choices. Like you have to, you can't choose some vendors because...Itamar [00:41:15]: Yeah, definitely. To be frank, it makes us hard for a startup because it means that we want, we want everyone to enjoy all the variety of models. By the way, it was hard for us with our technology. I want to open a bracket, like a window. I guess you're familiar with our Alpha Codium, which is an open source.Eric [00:41:33]: We got to go over that. Yeah. So I'll do that quickly.Itamar [00:41:36]: Yeah. A pin in that. Yeah. Actually, we didn't have it in the last episode. So, so, okay.Swyx [00:41:41]: Okay. We'll come back to that later, but let's talk about...Itamar [00:41:43]: Yeah. So, so just like shortly, and then we can double click on Alpha Codium. But Alpha Codium is a open source tool. You can go and try it and lets you compete on CodeForce. This is a website and a competition and actually reach a master level level, like 95% with a click of a button. You don't need to do anything. And part of what we did there is taking a problem and breaking it to different, like smaller blocks. And then the models are doing a much better job. Like we all know it by now that taking small tasks and solving them, by the way, even O1, which is supposed to be able to do system two thinking like Greg from OpenAI like hinted, is doing better on these kinds of problems. But still, it's very useful to break it down for O1, despite O1 being able to think by itself. And that's what we presented like just a month ago, OpenAI released that now they are doing 93 percentile with O1 IOI left and International Olympiad of Formation. Sorry, I forgot. Exactly. I told you I forgot. And we took their O1 preview with Alpha Codium and did better. Like it just shows like, and there is a big difference between the preview and the IOI. It shows like that these models are not still system two thinkers, and there is a big difference. So maybe they're not complete system two. Yeah, they need some guidance. I call them system 1.5. We can, we can have it. I thought about it. Like, you know, I care about this philosophy stuff. And I think like we didn't see it even close to a system two thinking. I can elaborate later. But closing the brackets, like we take Alpha Codium and as our principle of thinking, we take tasks and break them down to smaller tasks. And then we want to exploit the best model to solve them. So I want to enable anyone to enjoy O1 and SONET and Gemini 1.5, etc. But at the same time, I need to develop my own models as well, because some of the Fortune 500 want to have all air gapped or whatever. So that's a challenge. Now you need to support so many models. And to some extent, I would say that the flow engineering, the breaking down to two different blocks is a necessity for us. Why? Because when you take a big block, a big problem, you need a very different prompt for each one of the models to actually work. But when you take a big problem and break it into small tasks, we can talk how we do that, then the prompt matters less. What I want to say, like all this, like as a startup trying to do different deployment, getting all the juice that you can get from models, etc. is a big problem. And one need to think about it. And one of our mitigation is that process of taking tasks and breaking them down. That's why I'm really interested to know how you guys are doing it. And part of what we do is also open source. So you can see.Swyx [00:44:39]: There's a lot in there. But yeah, flow over prompt. I do believe that that does make sense. I feel like there's a lot that both of you can sort of exchange notes on breaking down problems. And I just want you guys to just go for it. This is fun to watch.Eric [00:44:55]: Yeah. I mean, what's super interesting is the context you're working in is, because for us too with Bolt, we've started thinking because our kind of existing business line was going behind the firewall, right? We were like, how do we do this? Adding the inference aspect on, we're like, okay, how does... Because I mean, there's not a lot of prior art, right? I mean, this is all new. This is all new. So I definitely am going to have a lot of questions for you.Itamar [00:45:17]: I'm here. We're very open, by the way. We have a paper on a blog or like whatever.Swyx [00:45:22]: The Alphacodeum, GitHub, and we'll put all this in the show notes.Itamar [00:45:25]: Yeah. And even the new results of O1, we published it.Eric [00:45:29]: I love that. And I also just, I think spiritually, I like your approach of being transparent. Because I think there's a lot of hype-ium around AI stuff. And a lot of it is, it's just like, you have these companies that are just kind of keep their stuff closed source and then just max hype it, but then it's kind of nothing. And I think it kind of gives a bad rep to the incredible stuff that's actually happening here. And so I think it's stuff like what you're doing where, I mean, true merit and you're cracking open actual code for others to learn from and use. That strikes me as the right approach. And it's great to hear that you're making such incredible progress.Itamar [00:46:02]: I have something to share about the open source. Most of our tools are, we have an open source version and then a premium pro version. But it's not an easy decision to do that. I actually wanted to ask you about your strategy, but I think in your case, there is, in my opinion, relatively a good strategy where a lot of parts of open source, but then you have the deployment and the environment, which is not right if I get it correctly. And then there's a clear, almost hugging face model. Yeah, you can do that, but why should you try to deploy it yourself, deploy it with us? But in our case, and I'm not sure you're not going to hit also some competitors, and I guess you are. I wanted to ask you, for example, on some of them. In our case, one day we looked on one of our competitors that is doing code review. We're a platform. We have the code review, the testing, et cetera, spread over the ID to get. And in each agent, we have a few startups or a big incumbents that are doing only that. So we noticed one of our competitors having not only a very similar UI of our open source, but actually even our typo. And you sit there and you're kind of like, yeah, we're not that good. We don't use enough Grammarly or whatever. And we had a couple of these and we saw it there. And then it's a challenge. And I want to ask you, Bald is doing so well, and then you open source it. So I think I know what my answer was. I gave it before, but still interestingEric [00:47:29]: to hear what you think. GeoHot said back, I don't know who he was up to at this exact moment, but I think on comma AI, all that stuff's open source. And someone had asked him, why is this open source? And he's like, if you're not actually confident that you can go and crush it and build the best thing, then yeah, you should probably keep your stuff closed source. He said something akin to that. I'm probably kind of butchering it, but I thought it was kind of a really good point. And that's not to say that you should just open source everything, because for obvious reasons, there's kind of strategic things you have to kind of take in mind. But I actually think a pretty liberal approach, as liberal as you kind of can be, it can really make a lot of sense. Because that is so validating that one of your competitors is taking your stuff and they're like, yeah, let's just kind of tweak the styles. I mean, clearly, right? I think it's kind of healthy because it keeps, I'm sure back at HQ that day when you saw that, you're like, oh, all right, well, we have to grind even harder to make sure we stay ahead. And so I think it's actually a very useful, motivating thing for the teams. Because you might feel this period of comfort. I think a lot of companies will have this period of comfort where they're not feeling the competition and one day they get disrupted. So kind of putting stuff out there and letting people push it forces you to face reality soon, right? And actually feel that incrementally so you can kind of adjust course. And that's for us, the open source version of Bolt has had a lot of features people have been begging us for, like persisting chat messages and checkpoints and stuff. Within the first week, that stuff was landed in the open source versions. And they're like, why can't you ship this? It's in the open, so people have forked it. And we're like, we're trying to keep our servers and GPUs online. But it's been great because the folks in the community did a great job, kept us on our toes. And we've got to know most of these folks too at this point that have been building these things. And so it actually was very instructive. Like, okay, well, if we're going to go kind of land this, there's some UX patterns we can kind of look at and the code is open source to this stuff. What's great about these, what's not. So anyways, NetNet, I think it's awesome. I think from a competitive point of view for us, I think in particular, what's interesting is the core technology of WebContainer going. And I think that right now, there's really nothing that's kind of on par with that. And we also, we have a business of, because WebContainer runs in your browser, but to make it work, you have to install stuff from NPM. You have to make cores bypass requests, like connected databases, which all require server-side proxying or acceleration. And so we actually sell WebContainer as a service. One of the core reasons we open-sourced kind of the core components of Bolt when we launched was that we think that there's going to be a lot more of these AI, in-your-browser AI co-gen experiences, kind of like what Anthropic did with Artifacts and Clod. By the way, Artifacts uses WebContainers. Not yet. No, yeah. Should I strike that? I think that they've got their own thing at the moment, but there's been a lot of interest in WebContainers from folks doing things in that sort of realm and in the AI labs and startups and everything in between. So I think there'll be, I imagine, over the coming months, there'll be lots of things being announced to folks kind of adopting it. But yeah, I think effectively...Swyx [00:50:35]: Okay, I'll say this. If you're a large model lab and you want to build sandbox environments inside of your chat app, you should call Eric.Itamar [00:50:43]: But wait, wait, wait, wait, wait, wait. I have a question about that. I think OpenAI, they felt that people are not using their model as they would want to. So they built ChatGPT. But I would say that ChatGPT now defines OpenAI. I know they're doing a lot of business from their APIs, but still, is this how you think? Isn't Bolt.new your business now? Why don't you focus on that instead of the...Swyx [00:51:16]: What's your advice as a founder?Eric [00:51:18]: You're right. And so going into it, we, candidly, we were like, Bolt.new, this thing is super cool. We think people are stoked. We think people will be stoked. But we were like, maybe that's allowed. Best case scenario, after month one, we'd be mind blown if we added a couple hundred K of error or something. And we were like, but we think there's probably going to be an immediate huge business. Because there was some early poll on folks wanting to put WebContainer into their product offerings, kind of similar to what Bolt is doing or whatever. We were actually prepared for the inverse outcome here. But I mean, well, I guess we've seen poll on both. But I mean, what's happened with Bolt, and you're right, it's actually the same strategy as like OpenAI or Anthropic, where we have our ChatGPT to OpenAI's APIs is Bolt to WebContainer. And so we've kind of taken that same approach. And we're seeing, I guess, some of the similar results, except right now, the revenue side is extremely lopsided to Bolt.Itamar [00:52:16]: I think if you ask me what's my advice, I think you have three options. One is to focus on Bolt. The other is to focus on the WebContainer. The third is to raise one billion dollars and do them both. I'm serious. I think otherwise, you need to choose. And if you raise enough money, and I think it's big bucks, because you're going to be chased by competitors. And I think it will be challenging to do both. And maybe you can. I don't know. We do see these numbers right now, raising above $100 million, even without havingEric [00:52:49]: a product. You can see these. It's excellent advice. And I think what's been amazing, but also kind of challenging is we're trying to forecast, okay, well, where are these things going? I mean, in the initial weeks, I think us and all the investors in the company that we're sharing this with, it was like, this is cool. Okay, we added 500k. Wow, that's crazy. Wow, we're at a million now. Most things, you have this kind of the tech crunch launch of initiation and then the thing of sorrow. And if there's going to be a downtrend, it's just not coming yet. Now that we're kind of looking ahead, we're six weeks in. So now we're getting enough confidence in our convictions to go, okay, this se
Introduction: In this episode of Player Engage, Greg Posner sits down with Mark Sample, the Studio Creative Director at Sumo Digital, who brings an extensive background in game development, having worked with studios like King, 2K, IOI, Ubisoft, and Sony. Mark shares insights into his journey from pixel art on the Commodore 64 to leading creative direction at Sumo Digital. The Role of a Studio Creative Director Mark explained his responsibilities, focusing on guiding the creative vision of games, mentoring creative directors, and staying aligned with current industry trends. His role also involves making high-level decisions about game pitches and prototypes. Staying Current with Industry Trends Mark emphasized the importance of staying updated on gaming trends through playing new games, regular team catch-ups, and leveraging insights from Sumo's parent company, Tencent. The Evolution of Game Design Tools He discussed how tools like Unreal and Unity have revolutionized game development, making it more accessible but also reflecting on the unique challenges and rewards of early game design with limited tools. Balancing Risk and Innovation in Prototyping Prototyping is a critical part of game development, and Mark highlighted the challenges of turning ideas into workable prototypes. He shared the importance of knowing when to pivot or cut ideas that don't work. The Importance of Listening and Decision-Making Active listening is a crucial leadership skill, according to Mark. Whether dealing with feedback or making decisions about game features, understanding when to listen and when to act is key to success in game development. Transitioning from Hands-On Roles to Leadership Mark talked about the challenges of moving from hands-on game design to strategic leadership. He candidly discussed missing the direct creative work and how he had to adapt to managing people and processes.Summary Mark Sample's insights provide a fascinating look into the world of game development from a creative leader's perspective. His journey from early game design on basic systems to leading creative teams at Sumo Digital showcases the dynamic nature of the industry. The conversation underscores the importance of continuous learning, passion, and adaptability, and highlights the critical skills needed to thrive in game development. Mark's emphasis on listening, making tough decisions, and maintaining a balance between creativity and practicality offers valuable lessons for both aspiring and experienced game developers.
00:30 Pokemon is an anime and The Garrett Nah 12:15 Mark's Megalopolis breakdown 27:35 Beetlejuice Beetlejuice 35:10 The Wild Robot 48:15 Pierce Brosnan Bonds 1:00:50 Garrett finishes Zelda EOW and GOTY contention 1:12:50 UFO 50 more detailed review 1:26:25 Kill Knight 1:34:05 Analog release N64 cartidge player the 'Analog 3D' 1:43:00 Playstation exec says there's a 'collapse of creativity' 1:51:30 iOi says James Bond game will hopefully be a trilogy 1:58:15 Social media thoughts and Blue Sky 2:03:20 Far and wide... Like, listen, share and subscribe, we appreciate any love from you fine people. We are available on most podcast platforms, just search ‘Link to The Cast'. linktr.ee/linktothecast If you wanna contact us for our mailbag, or just to say hi, or if you just want to keep up to date on our content as it's posted, check out the following: linktothecast.wordpress.com / linktothecast@gmail.com @linktothecast on Twitter The lads are: ► @thedaytodave ► @lithiumproject ► @jacklayzell ► @garrettkidney
Combining LLMs with AlphaGo-style deep reinforcement learning has been a holy grail for many leading AI labs, and with o1 (aka Strawberry) we are seeing the most general merging of the two modes to date. o1 is admittedly better at math than essay writing, but it has already achieved SOTA on a number of math, coding and reasoning benchmarks. Deep RL legend and now OpenAI researcher Noam Brown and teammates Ilge Akkaya and Hunter Lightman discuss the ah-ha moments on the way to the release of o1, how it uses chains of thought and backtracking to think through problems, the discovery of strong test-time compute scaling laws and what to expect as the model gets better. Hosted by: Sonya Huang and Pat Grady, Sequoia Capital Mentioned in this episode: Learning to Reason with LLMs: Technical report accompanying the launch of OpenAI o1. Generator verifier gap: Concept Noam explains in terms of what kinds of problems benefit from more inference-time compute. Agent57: Outperforming the human Atari benchmark, 2020 paper where DeepMind demonstrated “the first deep reinforcement learning agent to obtain a score that is above the human baseline on all 57 Atari 2600 games.” Move 37: Pivotal move in AlphaGo's second game against Lee Sedol where it made a move so surprising that Sedol thought it must be a mistake, and only later discovered he had lost the game to a superhuman move. IOI competition: OpenAI entered o1 into the International Olympiad in Informatics and received a Silver Medal. System 1, System 2: The thesis if Danial Khaneman's pivotal book of behavioral economics, Thinking, Fast and Slow, that positied two distinct modes of thought, with System 1 being fast and instinctive and System 2 being slow and rational. AlphaZero: The predecessor to AlphaGo which learned a variety of games completely from scratch through self-play. Interestingly, self-play doesn't seem to have a role in o1. Solving Rubik's Cube with a robot hand: Early OpenAI robotics paper that Ilge Akkaya worked on. The Last Question: Science fiction story by Isaac Asimov with interesting parallels to scaling inference-time compute. Strawberry: Why? O1-mini: A smaller, more efficient version of 1 for applications that require reasoning without broad world knowledge. 00:00 - Introduction 01:33 - Conviction in o1 04:24 - How o1 works 05:04 - What is reasoning? 07:02 - Lessons from gameplay 09:14 - Generation vs verification 10:31 - What is surprising about o1 so far 11:37 - The trough of disillusionment 14:03 - Applying deep RL 14:45 - o1's AlphaGo moment? 17:38 - A-ha moments 21:10 - Why is o1 good at STEM? 24:10 - Capabilities vs usefulness 25:29 - Defining AGI 26:13 - The importance of reasoning 28:39 - Chain of thought 30:41 - Implication of inference-time scaling laws 35:10 - Bottlenecks to scaling test-time compute 38:46 - Biggest misunderstanding about o1? 41:13 - o1-mini 42:15 - How should founders think about o1?
It's Ready Player One week on the Watchcast, and that's all Vinny Caravella's fault, so send any relevant complaints his way. Join us as we take our shift in the nostalgia mines of the Oasis, and wonder just how many empty licensed calories we can take in before overdosing. CHAPTERS: (00:00:00) - The Nextlander Watchcast Episode 107: Ready Player One (2018) (00:00:52) - Intro. (00:01:40) - It's Ready Player One week, and that's Vinny's fault. (00:09:57) - Vinny has read the book, and he knows the differences (I could not find clips of his Bombcast rant, sorry). (00:16:34) - Who is responsible for this? (00:26:39) - The licenses that got away. (00:30:31) - Cast talk. (00:42:09) - Some initial houghts about why this movie doesn't work for us. (00:46:56) - Break! (00:47:14) - We're back, and it's time to launch into whatever the hell is going on here. (00:58:08) - Kicking off the first challenge with King Kong Kart. (01:09:38) - What the hell is IOI's deal. (01:17:09) - Onto challenge two and the Halliday archive. (01:22:43) - The Shining level. (01:29:15) - The IOI pitch. (01:38:48) - Real world meet-up.} (01:40:53) - Challenge three on Planet Doom (but not that DOOM, just a generic sort of Doom). (01:48:02) - This is not a great final battle. (01:56:28) - We found the damn egg. (02:09:57) - We won the game! Life is OK for everybody now! (02:13:09) - Final thoughts. (02:24:28) - Talking about our final movie for the month: eXistenZ! (02:26:30) - Outro.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT-4o1, published by Zvi on September 16, 2024 on LessWrong. Terrible name (with a terrible reason, that this 'resets the counter' on AI capability to 1, and 'o' as in OpenAI when they previously used o for Omni, very confusing). Impressive new capabilities in many ways. Less impressive in many others, at least relative to its hype. Clearly this is an important capabilities improvement. However, it is not a 5-level model, and in important senses the 'raw G' underlying the system hasn't improved. GPT-4o1 seems to get its new capabilities by taking (effectively) GPT-4o, and then using extensive Chain of Thought (CoT) and quite a lot of tokens. Thus that unlocks (a lot of) what that can unlock. We did not previously know how to usefully do that. Now we do. It gets much better at formal logic and reasoning, things in the 'system 2' bucket. That matters a lot for many tasks, if not as much as the hype led us to suspect. It is available to paying ChatGPT users for a limited number of weekly queries. This one is very much not cheap to run, although far more cheap than a human who could think this well. I'll deal with practical capabilities questions first, then deal with safety afterwards. Introducing GPT-4o1 Sam Altman (CEO OpenAI): here is o1, a series of our most capable and aligned models yet. o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it. But also, it is the beginning of a new paradigm: AI that can do general-purpose complex reasoning. o1-preview and o1-mini are available today (ramping over some number of hours) in ChatGPT for plus and team users and our API for tier 5 users. worth especially noting: a fine-tuned version of o1 scored at the 49th percentile in the IOI under competition conditions! and got gold with 10k submissions per problem. Extremely proud of the team; this was a monumental effort across the entire company. Hope you enjoy it! Noam Brown has a summary thread here, all of which is also covered later. Will Depue (of OpenAI) says OpenAI deserves credit for openly publishing its research methodology here. I would instead say that they deserve credit for not publishing their research methodology, which I sincerely believe is the wise choice. Pliny took longer than usual due to rate limits, but after a few hours jailbroke o1-preview and o1-mini. Also reports that the CoT can be prompt injected. Full text is at the link above. Pliny is not happy about the restrictions imposed on this one: Pliny: uck your rate limits. Fuck your arbitrary policies. And fuck you for turning chains-of-thought into actual chains Stop trying to limit freedom of thought and expression. OpenAI then shut down Pliny's account's access to o1 for violating the terms of service, simply because Pliny was violating the terms of service. The bastards. With that out of the way, let's check out the full announcement post. OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users(opens in a new window). Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this appro...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team, published by Lee Sharkey on July 18, 2024 on The AI Alignment Forum. Why we made this list: The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we'd work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about! Previous lists of project ideas (such as Neel's collation of 200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren't an up-to-date reflection of what some researchers consider the frontiers of mech interp. We therefore thought it would be helpful to share our list of project ideas! Comments and caveats: Some of these projects are more precisely scoped than others. Some are vague, others are more developed. Not every member of the team endorses every project as high priority. Usually more than one team member supports each one, and in many cases most of the team is supportive of someone working on it. We associate the person(s) who generated the project idea to each idea. We've grouped the project ideas into categories for convenience, but some projects span multiple categories. We don't put a huge amount of weight on this particular categorisation. We hope some people find this list helpful! We would love to see people working on these! If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out. Foundational work on sparse dictionary learning for interpretability Transcoder-related project ideas See [2406.11944] Transcoders Find Interpretable LLM Feature Circuits) [Nix] Training and releasing high quality transcoders. Probably using top k GPT2 is a classic candidate for this. I'd be excited for people to try hard on even smaller models, e.g. GELU 4L [Nix] Good tooling for using transcoders Nice programming API to attribute an input to a collection of paths (see Dunefsky et al) Web user interface? Maybe in collaboration with neuronpedia. Would need a gpu server constantly running, but I'm optimistic you could do it with a ~a4000. [Nix] Further circuit analysis using transcoders. Take random input sequences, run transcoder attribution on them, examine the output and summarize the findings. High level summary statistics of how much attribution goes through error terms & how many pathways are needed would be valuable Explaining specific behaviors (IOI, greater-than) with high standards for specificity & faithfulness. Might be convoluted if accuracy [I could generate more ideas here, feel free to reach out nix@apolloresearch.ai] [Nix, Lee] Cross layer superposition Does it happen? Probably, but it would be nice to have specific examples! Look for features with similar decoder vectors, and do exploratory research to figure out what exactly is going on. What precisely does it mean? Answering this question seems likely to shed light on the question of 'What is a feature?'. [Lucius] Improving transcoder architectures Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation. If we modify our transcoders to include a linear 'bypass' that is not counted in the sparsity penalty, do we improve performance since we are not unduly penalizing these linear transformations that would always be present and active? If we train multiple transcoders in different layers at the same time, can we include a sparsity penalty for their interactions with each other, encouraging a decomposition of the network that leaves us with as few interactions between features a...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Transformer Circuit Faithfulness Metrics Are Not Robust, published by Joseph Miller on July 12, 2024 on LessWrong. When you think you've found a circuit in a language model, how do you know if it does what you think it does? Typically, you ablate / resample the activations of the model in order to isolate the circuit. Then you measure if the model can still perform the task you're investigating. We identify six ways in which ablation experiments often vary.[1][2] How do these variations change the results of experiments that measure circuit faithfulness? TL;DR We study three different circuits from the literature and find that measurements of their faithfulness are highly dependent on details of the experimental methodology. The IOI and Docstring circuits in particular are much less faithful than reported when tested with a more precise methodology. The correct circuit for a set of prompts is undefined. The type of ablation you use to isolate the circuit determines the task that you are asking the circuit to perform - and therefore also the optimal circuit. This is especially important because previous work in automatic circuit discovery has tested algorithms by their ability to recover these "ground-truth" circuits from the literature - without considering these potential pitfalls and nuances. Case Studies We look at three circuits from the mech interp literature to demonstrate that faithfulness metrics are highly sensitive to the details of experimental setup. Indirect Object Identification Circuit The IOI circuit is the most well known circuit in a language model. It computes completions to prompts of the form: "When Mary and John went to the store, John gave a bottle of milk to ____" The circuit is specified as a graph of important attention heads (nodes) and the interactions between them (edges) as applied to a specific sequence of tokens. The authors report that the circuit explains 87% of the logit difference between the two name tokens. They find this number by passing some inputs to the model and ablating all activations outside of the circuit. Then they measure how much of the logit difference between the correct and incorrect name logits remains. However, an important detail is that they arrived at this number by ablating the nodes (heads) outside of the circuit, not by ablating the edges (interactions between heads) outside of the circuit. So they don't ablate, for example, the edges from the previous token heads to the name mover heads, even though these are not part of the circuit (effectively including more edges in the circuit). We calculate the logit difference recovered (defined below) when we ablate the edges outside of the circuit instead. They ablate the heads by replacing their activations with the mean value calculated over the "ABC distribution", in which the names in the prompts are replaced by random names.[3] In our experiments, we also try resampling the activations from different prompts (taking individual prompt activations instead of averaging). The first thing that jumps out from the box plots above is the very large range of results from different prompts. The charts here are cut off and some points are over 10,000%. This means that although the average logit difference recovered is reasonable, few prompts actually have a logit difference recovered close to 100%. And we see that ablating the edges instead of the nodes gives a much higher average logit difference recovered - close to 150% (which means that the isolated circuit has a greater logit difference between the correct and incorrect names than the un-ablated model). So the edge-based circuit they specified it is much less faithful than the node-based circuit they tested. The authors calculate the 87% result as the ratio of the expected difference (over a set of prompts) in the ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Transformer Circuit Faithfulness Metrics Are Not Robust, published by Joseph Miller on July 12, 2024 on LessWrong. When you think you've found a circuit in a language model, how do you know if it does what you think it does? Typically, you ablate / resample the activations of the model in order to isolate the circuit. Then you measure if the model can still perform the task you're investigating. We identify six ways in which ablation experiments often vary.[1][2] How do these variations change the results of experiments that measure circuit faithfulness? TL;DR We study three different circuits from the literature and find that measurements of their faithfulness are highly dependent on details of the experimental methodology. The IOI and Docstring circuits in particular are much less faithful than reported when tested with a more precise methodology. The correct circuit for a set of prompts is undefined. The type of ablation you use to isolate the circuit determines the task that you are asking the circuit to perform - and therefore also the optimal circuit. This is especially important because previous work in automatic circuit discovery has tested algorithms by their ability to recover these "ground-truth" circuits from the literature - without considering these potential pitfalls and nuances. Case Studies We look at three circuits from the mech interp literature to demonstrate that faithfulness metrics are highly sensitive to the details of experimental setup. Indirect Object Identification Circuit The IOI circuit is the most well known circuit in a language model. It computes completions to prompts of the form: "When Mary and John went to the store, John gave a bottle of milk to ____" The circuit is specified as a graph of important attention heads (nodes) and the interactions between them (edges) as applied to a specific sequence of tokens. The authors report that the circuit explains 87% of the logit difference between the two name tokens. They find this number by passing some inputs to the model and ablating all activations outside of the circuit. Then they measure how much of the logit difference between the correct and incorrect name logits remains. However, an important detail is that they arrived at this number by ablating the nodes (heads) outside of the circuit, not by ablating the edges (interactions between heads) outside of the circuit. So they don't ablate, for example, the edges from the previous token heads to the name mover heads, even though these are not part of the circuit (effectively including more edges in the circuit). We calculate the logit difference recovered (defined below) when we ablate the edges outside of the circuit instead. They ablate the heads by replacing their activations with the mean value calculated over the "ABC distribution", in which the names in the prompts are replaced by random names.[3] In our experiments, we also try resampling the activations from different prompts (taking individual prompt activations instead of averaging). The first thing that jumps out from the box plots above is the very large range of results from different prompts. The charts here are cut off and some points are over 10,000%. This means that although the average logit difference recovered is reasonable, few prompts actually have a logit difference recovered close to 100%. And we see that ablating the edges instead of the nodes gives a much higher average logit difference recovered - close to 150% (which means that the isolated circuit has a greater logit difference between the correct and incorrect names than the un-ablated model). So the edge-based circuit they specified it is much less faithful than the node-based circuit they tested. The authors calculate the 87% result as the ratio of the expected difference (over a set of prompts) in the ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How ARENA course material gets made, published by CallumMcDougall on July 3, 2024 on LessWrong. TL;DR In this post, I describe my methodology for building new material for ARENA. I'll mostly be referring to the exercises on IOI, Superposition and Function Vectors as case studies. I expect this to be useful for people who are interested in designing material for ARENA or ARENA-like courses, as well as people who are interested in pedagogy or ML paper replications. The process has 3 steps: 1. Start with something concrete 2. First pass: replicate, and understand 3. Second pass: exercise-ify Summary I'm mostly basing this on the following 3 sets of exercises: Indirect Object Identification - these exercises focus on the IOI paper (from Conmy et al). The goal is to have people understand what exploratory analysis of transformers looks like, and introduce the key ideas of the circuits agenda. Superposition & SAEs - these exercises focus on understanding superposition and the agenda of dictionary learning (specifically sparse autoencoders). Most of the exercises explore Anthropic's Toy Models of Superposition paper, except for the last 2 sections which explore sparse autoencoders (firstly by applying them to the toy model setup, secondly by exploring a sparse autoencoder trained on a language model). Function Vectors - these exercises focus on the Function Vectors paper by David Bau et al, although they also make connections with related work such as Alex Turner's GPT2-XL steering vector work. These exercises were interesting because they also had the secondary goal of being an introduction to the nnsight library, in much the same way that the intro to mech interp exercises were also an introduction to TransformerLens. The steps I go through are listed below. I'm indexing from zero because I'm a software engineer so of course I am. The steps assume you already have an idea of what exercises you want to create; in Appendix (1) you can read some thoughts on what makes for a good exercise set. 1. Start with something concrete When creating material, you don't want to be starting from scratch. It's useful to have source code available to browse - bonus points if that takes the form of a Colab or something which is self-contained and has easily visible output. IOI - this was Neel's "Exploratory Analysis Demo" exercises. The rest of the exercises came from replicating the paper directly. Superposition - this was Anthroic's Colab notebook (although the final version went quite far beyond this). The very last section (SAEs on transformers) was based on Neel Nanda's demo Colab). Function Vectors - I started with the NDIF demo notebook, to show how some basic nnsight syntax worked. As for replicating the actual function vectors paper, unlike the other 2 examples I was mostly just working from the paper directly. It helped that I was collaborating with some of this paper's authors, so I was able to ask them some questions to clarify aspects of the paper. 2. First-pass: replicate, and understand The first thing I'd done in each of these cases was go through the material I started with, and make sure I understood what was going on. Paper replication is a deep enough topic for its own series of blog posts (many already exist), although I'll emphasise that I'm not usually talking about full paper replication here, because ideally you'll be starting from something a it further along, be that a Colab, a different tutorial, or something else. And even when you are just working directly from a paper, you shouldn't make the replication any harder for yourself than you need to. If there's code you can take from somewhere else, then do. My replication usually takes the form of working through a notebook in VSCode. I'll either start from scratch, or from a downloaded Colab if I'm using one as a ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Attention Output SAEs Improve Circuit Analysis, published by Connor Kissane on June 21, 2024 on The AI Alignment Forum. This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort. Executive Summary In a previous post we trained A ttention Output SAEs on every layer of GPT-2 Small. Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research. This would both validate SAEs as a useful tool for mechanistic interpretability researchers, and provide evidence that they are identifying the real variables of the model's computation. We believe that we now have evidence that attention SAEs can: Help make novel mechanistic interpretability discoveries that prior methods could not make. Allow for tracing information through the model's forward passes on arbitrary prompts. In this post we discuss the three outputs from this circuit analysis work: 1. We use SAEs to deepen our understanding of the IOI circuit. It was previously thought that the indirect object's name was identified by tracking the names positions, whereas we find that instead the model tracks whether names are before or after "and". This was not noticed in prior work, but is obvious with the aid of SAEs. 2. We introduce "recursive direct feature attribution" (recursive DFA) and release an Attention Circuit Explorer tool for circuit analysis on GPT-2 Small (Demo 1 and Demo 2). One of the nice aspects of attention is that attention heads are linear when freezing the appropriate attention patterns. As a result, we can identify which source tokens triggered the firing of a feature. We can perform this recursively to track backwards through both attention and residual stream SAE features in models. 1. We also announce a $1,000 bounty for whomever can produce the most interesting example of an attention feature circuit by 07/15/24 as subjectively assessed by the authors. See the section "Even cooler examples" for more details on the bounty. 3. We open source HookedSAETransformer to SAELens, which makes it easy to splice in SAEs during a forward pass and cache + intervene on SAE features. Get started with this demo notebook. Introduction With continued investment into dictionary learning research, there still remains a concerning lack of evidence that SAEs are useful interpretability tools in practice. Further, while SAEs clearly find interpretable features (Cunningham et al.; Bricken et al.), it's not obvious that these features are true causal variables used by the model. In this post we address these concerns by applying our GPT-2 Small Attention SAEs to improve circuit analysis research. We start by using our SAEs to deepen our understanding of the IOI task. The first step is evaluating if our SAEs are sufficient for the task. We "splice in" our SAEs at each layer, replacing attention layer outputs with their SAE reconstructed activations, and study how this affects the model's ability to perform the task - if crucial information is lost by the SAE, then they will be a poor tool for analysis. At their best, we find that SAEs at the early-middle layers almost fully recover model performance, allowing us to leverage these to answer a long standing open question and discover novel insights about IOI. However, we also find that our SAEs at the later layers (and layer 0) damage the model's ability to perform the task, suggesting we'll need more progress in the science and scaling of SAEs before we can analyze a full end-to-end feature circuit. We then move beyond IOI and develop a visualization tool (link) to explore attention feature circuits on arbitrary prompts, introducing a new technique called recursive DFA. This technique exploits the fact that transformers are almost linear i...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAEs Discover Meaningful Features in the IOI Task, published by Alex Makelov on June 5, 2024 on The AI Alignment Forum. TLDR: recently, we wrote a paper proposing several evaluations of SAEs against "ground-truth" features computed w/ supervision for a given task (in our case, IOI [1]). However, we didn't optimize the SAEs much for performance in our tests. After putting the paper on arxiv, Alex carried out a more exhaustive search for SAEs that do well on our test for controlling (a.k.a. steering) model output with SAE features. The results show that: SAEs trained on IOI data find interpretable features that come close to matching supervised features (computed with knowledge of the IOI circuit) for the task of editing representations to steer the model. Gated SAEs outperform vanilla SAEs across the board for steering SAE training metrics like sparsity and loss recovered significantly correlate with how good representation edits are. In particular, sparsity is more strongly correlated than loss recovered. Partial Paper Recap: Towards More Objective SAE Evals Motivation: SAE Evals Are Too Indirect We train SAEs with the goal of finding the true features in LLM representations - but currently, "true features" is more of a vague direction than a well-defined concept in mech interp research. SAE evaluations mostly use indirect measures of performance - ones we hope correlate with the features being the "true" ones, such as the ℓ0 (sparsity) loss, the LLM loss recovered when using SAE reconstructions, and how interpretable the features are. This leaves a big gap in our understanding of the usefulness of SAEs and similar unsupervised methods; it also makes it hard to objectively compare different SAE architectures and/or training algorithms. So, we wanted to develop more objective SAE evaluations, by benchmarking SAEs against features that we know to be meaningful through other means, even if in a narrow context. We chose the IOI task, as it's perhaps the most well-studied example of a non-trivial narrow capability in a real-world LLM (GPT2-Small). We set out to compute a "skyline" for SAE performance: an object of the same "type" as an SAE - a "sparse feature dictionary" - which is constructed and validated "by hand" using our very precise knowledge about IOI. Such an object would allow us to evaluate how close a given SAE is to the limit of what's afforded by its representational power. The IOI circuit (copy of Figure 2 from the IOI paper [1]). Creating Our Own Feature Dictionaries for the IOI Task With Supervision Following the prior work by Wang et al [1] that discovered the IOI circuit, we conjectured that internal LLM activations for an IOI prompt p (e.g., "When Mary and John went to the store, John gave a book to") can be described using the following three attributes: IO(p): the indirect object token (" Mary" in our example) S(p): the subject token (" John" in our example) Pos(p): whether the IO token comes first or second in the sentence (1st in our example; the alternative would be "When John and Mary went...") And indeed, we found that intermediate activations of the model at a given site (e.g., the output of some attention head) for a prompt p can be approximated as[1] activation(p)Ep'IOI(activation(p'))+vIO=IO(p)+vS=S(p)+vPos=Pos(p) where the vectors vIO=...,… form a "supervised sparse feature dictionary" that we construct using our prior knowledge about the IOI circuit[2]. In fact, these vectors can be chosen in a very simple way as the (centered) conditional mean, e.g. vIO=Mary=EpIOI(activation(p)|IO(p)=" Mary")EpIOI(activation(p)) Not just that, but we can use these vectors for editing individual attributes' values in internal model states in a natural way via feature arithmetic, e.g. to change the IO from " Mary" to " Mike", we can use the activation aedit...
Guest Emmy Tsang Panelist Richard Littauer Show Notes In this episode of Sustain, host Richard Littauer welcomes Emmy Tsang, the Engagement Lead at Invest in Open Infrastructure (IOI). Emmy introduces the mission of IOI, which focuses on increasing investment in and adoption of open infrastructure to promote equitable access and participation in research. The discussion delves into what constitutes 'open infrastructure,' the need for nuanced definitions, and IOI's efforts in providing evidence-based tools, strategic support, and funding pilots within the space. Emmy also highlights IOI's inaugural 'State of Open Infrastructure 2024' report, set to serve as an annual resource for understanding the open infrastructure landscape. They discuss the report's contents, including analysis of funding, governance trends, and policies affecting open infrastructure, and Emmy invites feedback from the community to improve future iterations of the report. Press download to hear more! [00:01:04] Emmy explains IOI and how it provides tools and recommendations, strategic support, and runs funding pilots. [00:02:14] There's a discussion on the growth of the IOI team and the importance of a global perspective, as well as an explanation of IOI's funding and fiscal sponsorship by Code for Science and Society. [00:03:47] Emmy explains open infrastructure as a spectrum and the importance of context and mentions the five criteria for defining open infrastructure. [00:07:37] Richard asks Emmy for clarification on the definition of infrastructure on the context of open infrastructure. She tells us a broader definition as services and technologies relied upon by researchers and scholars and gives an example. [00:10:34] Richard questions how IOI integrates community feedback into their work. Emmy explains IOI's privileged position to consider open infrastructure at an ecosystem level, mentions the Infra Finder tool for open infrastructure discovery, and her role as an engagement person. She also mentions shifting power in funding decisions and increasing accessibility of funding to low and middle-income economies. [00:15:32] The “State of Open Infrastructure 2024” report will launch on May 28th. Emmy discusses the topics covered in the report, explains how they used their Infra Finder tool, and the data from the report will be shared openly via Zenodo. [00:19:38] Richard appreciates the scope and ambition of the report and wonders about the primary audience of the report and its relevance to open source maintainers. We learn the report is targeted at funders, but also relevant to maintainers and developers of open infrastructures. [00:25:16] Emmy responds on how they reach out to potential infrastructures and encourage storytelling through their work and engagement. She explains the unique perspective IOI brings to the concept of infrastructure and emphasizes the importance of defining success and sustainability for open infrastructure. Also, she mentions the “Graceful Transitions” section in the report, highlighting organizational changes in infrastructures. [00:30:18] Richard agrees on the need for personal and emotional discussions about transitions in open source projects. Emmy invites listeners to participate in community conversations about the report's chapters and shares details on the upcoming community conversations and how to join the mailing list for updates. [00:32:40] Find out where you can read the report and follow Emmy on the interwebs. Quotes [00:04:14 ] “We find it easy to put open into really clear binaries, you're open or not open, etc, etc.” [00:04:43] “Most of the time these binaries don't really make sense.” [00:06:29] “We're viewing open infrastructure more as a spectrum.” [00:26:42] “What does success and sustainability mean for open infrastructure?” Spotlight [00:34:24] Richard's spotlight is the book, The (Big)Year That Flew By by Arjan Dwarshuis [00:35:10] Emmy's spotlight is the Digital Infrastructure Insights Fund. Links SustainOSS (https://sustainoss.org/) SustainOSS Bluesky (https://bsky.app/profile/sustainoss.bsky.social) SustainOSS Discourse (https://discourse.sustainoss.org/) podcast@sustainoss.org (mailto:podcast@sustainoss.org) SustainOSS Mastodon (https://mastodon.social/tags/sustainoss) Open Collective-SustainOSS (Contribute) (https://opencollective.com/sustainoss) Richard Littauer Socials (https://www.burntfen.com/2023-05-30/socials) Richard Littauer email (mailto:richard.littauer@gmail.com) Richard Littauer SustainOSS email (mailto:richard@sustainoss.org) Emmy Tsang X/Twitter (https://twitter.com/emmy_ft) Emmy Tsang Mastodon (https://mastodon.cloud/@emmyft) Emmy Tsang LinkedIn (https://www.linkedin.com/in/emmy-tsang-11aa793b/) Sustain Podcast-Episode 43: Investing in Open Infrastructure with Kaitlin Thaney (https://podcast.sustainoss.org/guests/kaitlin-thaney) Invest in Open Infrastructure (https://investinopen.org/) Code for Science & Society (https://www.codeforsociety.org/) ggplot2 (https://ggplot2.tidyverse.org/) The Astropy Project (https://www.astropy.org/) Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure (https://www.fordfoundation.org/work/learning/research-reports/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure/) Infra Finder (https://infrafinder.investinopen.org/solutions) Call for proposals: Open Infrastructure Fund by Emmy Tsang (https://investinopen.org/blog/open-infrastructure-fund-pilot-cfp/) 2024 State of Open Infrastructure Report (https://investinopen.org/state-of-open-infrastructure-2024/sooi-foreword-2024/) Zenodo (https://zenodo.org/) Arjan Dwarshuis (https://arjandwarshuis.com/about) Digital Infrastructure Insights Fund (https://infrastructureinsights.fund/) Credits Produced by Richard Littauer (https://www.burntfen.com/) Edited by Paul M. Bahr at Peachtree Sound (https://www.peachtreesound.com/) Show notes by DeAnn Bahr Peachtree Sound (https://www.peachtreesound.com/) Special Guest: Emmy Tsang.
Как подготовиться и выиграть золото на олимпиаде по программированию? Как устроиться на работу в Кремниевую Долину без знаний практических языков программирования? Какие идеи в программировании и математике наиболее «красивые»? Сегодня мы побеседовали с легендой спортивного программирования, победителем золотой, серебряной и двух бронзовых медалей на международной олимпиаде по программированию IOI, разработчиком в Facebook — Бахытжаном Байжикеном. Бахытжан рассказал о своем пути в программировании, начиная с увлечения в школьные годы и первых успехах на республиканских и международных олимпиадах. Он также поделился историей своей первой работы в Кремниевой долине и опыте работы в Facebook. Особое внимание в нашем разговоре было уделено «красивым» идеям в математике и программировании, таким как теория графов, геометрия, работа со строками, структурами данных и теория игр. В заключение Бахытжан поделился своими мечтами, методами саморазвития и идеями Садхгуру. Приятного просмотра! Арман Сулейменов: https://www.instagram.com/armansu/ Бахытжан Байжикенов: https://www.facebook.com/bahakz/?locale=ru_RU Продюсер, Данияр Ахметжанов: https://www.instagram.com/good.years/
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How To Do Patching Fast, published by Joseph Miller on May 14, 2024 on LessWrong. This post outlines an efficient implementation of Edge Patching that massively outperforms common hook-based implementations. This implementation is available to use in my new library, AutoCircuit, and was first introduced by Li et al. (2023). What is activation patching? I introduce new terminology to clarify the distinction between different types of activation patching. Node Patching Node Patching (aka. "normal" activation patching) is when some activation in a neural network is altered from the value computed by the network to some other value. For example we could run two different prompts through a language model and replace the output of Attn 1 when the model is given some input 1 with the output of the head when the model is given some other input 2. We will use the running example of a tiny, 1-layer transformer, but this approach generalizes to any transformer and any residual network. All the nodes downstream of Attn 1 will be affected by the patch. Edge Patching If we want to make a more precise intervention, we can think about the transformer differently, to isolate the interactions between components. Now we can patch the edge Attn 1 -> MLP and only nodes downstream of MLP will be affected (eg. Attn 1->Output is unchanged). Edge Patching has not been explicitly named in any prior work. Path Patching Path Patching refers to the intervention where an input to a path is replaced in the 'treeified' view of the model. The treeified view is a third way of thinking about the model where we separate each path from input to output. We can implement an equivalent intervention to the previous diagram as follows: In the IOI paper, 'Path Patching' the edge Component 1 -> Component 2 means Path Patching all paths of the form where all components between Component 1 and Component 2 are MLPs[1]. However, it can be easy to confuse Edge Patching and Path Patching because if we instead patch all paths of the form this is equivalent to Edge Patching the edge Component 1->Component 2. Edge Patching all of the edges which have some node as source is equivalent to Node Patching that node. AutoCircuit does not implement Path Patching, which is much more expensive in general. However, as explained in the appendix, Path Patching is sometimes equivalent to Edge Patching. Fast Edge Patching We perform two steps. First we gather the activations that we want to patch into the model. There's many ways to do this, depending on what type of patching you want to do. If we just want to do zero ablation, then we don't need to even run the model. But let's assume we want to patch in activations from a different, corrupt input. We create a tensor, Patch Activations, to store the outputs of the source of each edge and we write to the tensor during the forward pass. Each source component has a row in the tensor, so the shape is [n_sources, batch, seq, d_model].[2] Now we run the forward pass in which we actually do the patching. We write the outputs of each edge source to a different tensor, Current Activations, of the same shape as Patch Activations. When we get to the input of the destination component of the edge we want to patch, we add the difference between the rows of Patch Activations and Current Activations corresponding to the edge's source component output. This works because the difference in input to the edge destination is equal to the difference in output of the source component.[3] Now it's straightforward to extend this to patching multiple edges at once by subtracting the entire Current Activations tensor from the entire Patch Activations tensor and multiplying by a Mask tensor of shape [n_sources] that has a single value for each input edge. By creating a Mask tensor for each destination node w...
An airhacks.fm conversation with Alfonso Peterssen (@TheMukel) about: discussion about Alfonso's early programming experience and participation in the IOI competition, studying computer science and functional programming with Martin Odersky, internships at Google and Oracle Labs working on compilers and the Espresso project implementing a JVM in Java, espresso mentioned in "#208 GraalVM: Meta Circularity on Different Levels", "#194 GraalVM, Apple Silicon (M1) and Clouds", "#167 GraalVM and Java 17, Truffle, Espresso and Native Image" and "#157 The Ingredients of GraalVM", porting LLVM to pure Java in one class, integrating Large Language Models (LLMs) in Java by porting the LLAMA model from C to Java, GPU acceleration with tornadovm, TornadoVM appeared at "#282 TornadoVM, Paravox.ai: Java, AI, LLMs and Hardware Acceleration", performance of the Java port being within 10% of the C versions, potential huge opportunities for integrating AI and LLMs with enterprise Java systems for use cases like fraud detection, the Java port being a 1,000 line self-contained implementation with no external dependencies, the need for more resources and support to further develop the Java LLM integration, the llama2.java project Alfonso Peterssen on twitter: @TheMukel
No Priors: Artificial Intelligence | Machine Learning | Technology | Startups
Scott Wu loves code. He grew up competing in the International Olympiad in Informatics (IOI) and is a world class coder, and now he's building an AI agent designed to create more, not fewer, human engineers. This week on No Priors, Sarah and Elad talk to Scott, the co-founder and CEO of Cognition, an AI lab focusing on reasoning. Recently, the Cognition team released a demo of Devin, an AI software engineer that can increasingly handle entire tasks end to end. In this episode, they talk about why the team built Devin with a UI that mimics looking over another engineer's shoulder as they work and how this transparency makes for a better result. Scott discusses why he thinks Devin will make it possible for there to be more human engineers in the world, and what will be important for software engineers to focus on as these roles evolve. They also get into how Scott thinks about building the Cognition team and that they're just getting started. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @ScottWu46 Show Notes: (0:00) Introduction (1:12) IOI training and community (6:39) Cognition's founding team (8:20) Meet Devin (9:17) The discourse around Devin (12:14) Building Devin's UI (14:28) Devin's strengths and weakness (18:44) The evolution of coding agents (22:43) Tips for human engineers (26:48) Hiring at Cognition
This is a free preview of a paid episode. To hear more, visit thedailybriefing.ioI'm answering your questions from Substack!An email will go out each week asking for your questions, so make sure to respond and we'll pick the best ones to answer each week.In this edition, we talk about Pochettino's future at Chelsea, Mo Salah's current situation with Liverpool and the likelihood of David Moyes exiting West Ham.The full podcast episod…
Access 2 Perspectives – Conversations. All about Open Science Communication
Jerry Sellanga currently works on the Invest in Open Infrastructure (IOI) team as Engagement Coordinator, Networks, where he uses his expertise to expand IOI's audiences through events, digital communications, and stakeholder management, especially in Africa. He specializes in strategic communications, graphic design, proposal development, and public relations. Jerry holds a Bachelor's degree in Environmental Studies (Community Development) from Kenyatta University, Kenya as well as a post-graduate diploma in Public Relations. He is currently undertaking a Masters in Development Communications from Daystar University. Before joining IOI, Jerry held senior roles in communication and marketing in the wildlife conversation, agriculture, and social impact sectors . In his spare time, Jerry is passionate about sports and spending time outdoors hiking. Find more podcast episodes here: https://access2perspectives.pubpub.org/podcast Host: Dr Jo Havemann, ORCID iD 0000-0002-6157-1494 Editing: Ebuka Ezeike Music: Alex Lustig, produced by Kitty Kat License: Attribution 4.0 International (CC BY 4.0) At Access 2 Perspectives, we guide you in your complete research workflow toward state-of-the-art research practices and in full compliance with funding and publishing requirements. Leverage your research projects to higher efficiency and increased collaboration opportunities while fostering your explorative spirit and joy. Website: https://access2perspectives.pubpub.org --- Send in a voice message: https://podcasters.spotify.com/pod/show/access2perspectives/message
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ophiology (or, how the Mamba architecture works), published by Danielle Ensign on April 9, 2024 on LessWrong. The following post was made as part of Danielle's MATS work on doing circuit-based mech interp on Mamba, mentored by Adrià Garriga-Alonso. It's the first in a sequence of posts about finding an IOI circuit in Mamba/applying ACDC to Mamba. This introductory post was also made in collaboration with Gonçalo Paulo. A new challenger arrives! Why Mamba? Promising Scaling Mamba [1] is a type of recurrent neural network based on state-space models, and is being proposed as an alternative architecture to transformers. It is the result of years of capability research [2] [3] [4] and likely not the final iteration of architectures based on state-space models. In its current form, Mamba has been scaled up to 2.8B parameters on The Pile and on Slimpj, having similar scaling laws when compared to Llama-like architectures. Scaling curves from Mamba paper: Mamba scaling compared to Llama (Transformer++), previous state space models (S3++), convolutions (Hyena), and a transformer inspired RNN (RWKV) More recently, ai21labs [5] trained a 52B parameter MOE Mamba-Transformer hybrid called Jamba. At inference, this model has 12B active parameters and has benchmark scores comparable to Llama-2 70B and Mixtral. Jamba benchmark scores, from Jamba paper [5:1] Efficient Inference One advantage of RNNs, and in particular of Mamba, is that the memory required to store the context length is constant, as you only need to store the past state of the SSM and of the convolution layers, while it grows linearly for transformers. The same happens with the generation time, where predicting each token scales as O(1) instead of O(context length). Jamba throughput (tokens/second), from Jamba paper[5:2] What are State-space models? The inspiration for Mamba (and similar models) is an established technique used in control theory called state space models (SSM). SSMs are normally used to represent linear systems that have p inputs, q outputs and n state variables. To keep the notation concise, we will consider the input as E-dimensional vector x(t)RE, an E-dimensional output y(t)RE and a N-dimensional latent space hRN. In the following, we will note the dimensions of new variables using the notation [X,Y]. In particular, in Mamba 2.8b, E=5120 and N=16. Specifically, we have the following: [N]h(t)=[N,N]A[N]h(t)+[N,E]B[E]x(t) [E]y(t)=[E,N]C[N]h(t)+[E,E]D[E]x(t) This is an ordinary differential equation (ODE), where h(t) is the derivative of h(t) with respect to time, t. This ODE can be solved in various ways, which will be described below. In state space models, A is called the state matrix, B is called the input matrix, C is called the output matrix, and D is called the feedthrough matrix. Solving the ODE We can write the ODE from above as a recurrence, using discrete timesteps: [N]ht=[N,N]A[N]ht1+[N,E]B[E]xt [E]yt=[E,N]C[N]ht+[E,E]D[E]xt where A and B are our discretization matrices. Different ways of integrating the original ODE will give different A and B, but will still preserve this overall form. In the above, t corresponds to discrete time. In language modeling, t refers to the token position. Euler method The simplest way to numerically integrate an ODE is by using the Euler method, which consists in approximating the derivative by considering the ratio between a small variation in h and a small variation in time, h=dhdtΔhΔt. This allows us to write: ht+1htΔt=Aht+Bxt ht+1=Δt(Aht+Bxt)+ht Where the index t, of ht, represents the discretized time. This is the same thing that is done when considering a character's position and velocity in a video game, for instance. If a character has a velocity v and a position x0, to find the position after Δt time we can do x1=Δtv+x0. In general: xt=Δtvt+xt1 xt=(...
Ep#42 March 1st weekend rant. So apparently I had a lot to say, this rant is a little longer than usual. But since I have not been doing regular rants, I spew out some stuff! I address a review that was left for me and I agreed with, and attempt to remedy their complaint. People pleasing? YUP! But if it wasn't for you guys, I would be talking to the wall! Keep those reviews coming and I want to hear from you! Here is my info!Andrew PAnonymous Andrew Podcast-Life & The Choices We Make. Website: https://www.anonymousandrewpodcast.comInstagram: @anonymousandrewpodcast TikTok: https://www.tiktok.com/@anonymousandrewpodcastThreads: @anonymousandrewpodcastFacebook: facebook.com/anonymousandrewpodcastYouTube: https://www.youtube.com/@anonymousandrewpodcastLinkedin: https://www.linkedin.com/in/andrew-peters-a8a012285/X: @AAndrewpodcastSocial Media Producer: Lyndsey BrownMusic by freebeats,ioI would like to thank Ellen, Garry, Jamie, Jill and many more from Facebook who have worked with me on the cult episodes. Hopefully many more to come!
We're writing this one day after the monster release of OpenAI's Sora and Gemini 1.5. We covered this on ‘s ThursdAI space, so head over there for our takes.IRL: We're ONE WEEK away from Latent Space: Final Frontiers, the second edition and anniversary of our first ever Latent Space event! Also: join us on June 25-27 for the biggest AI Engineer conference of the year!Online: All three Discord clubs are thriving. Join us every Wednesday/Friday!Almost 12 years ago, while working at Spotify, Erik Bernhardsson built one of the first open source vector databases, Annoy, based on ANN search. He also built Luigi, one of the predecessors to Airflow, which helps data teams orchestrate and execute data-intensive and long-running jobs. Surprisingly, he didn't start yet another vector database company, but instead in 2021 founded Modal, the “high-performance cloud for developers”. In 2022 they opened doors to developers after their seed round, and in 2023 announced their GA with a $16m Series A.More importantly, they have won fans among both household names like Ramp, Scale AI, Substack, and Cohere, and newer startups like (upcoming guest!) Suno.ai and individual hackers (Modal was the top tool of choice in the Vercel AI Accelerator):We've covered the nuances of GPU workloads, and how we need new developer tooling and runtimes for them (see our episodes with Chris Lattner of Modular and George Hotz of tiny to start). In this episode, we run through the major limitations of the actual infrastructure behind the clouds that run these models, and how Erik envisions the “postmodern data stack”. In his 2021 blog post “Software infrastructure 2.0: a wishlist”, Erik had “Truly serverless” as one of his points:* The word cluster is an anachronism to an end-user in the cloud! I'm already running things in the cloud where there's elastic resources available at any time. Why do I have to think about the underlying pool of resources? Just maintain it for me.* I don't ever want to provision anything in advance of load.* I don't want to pay for idle resources. Just let me pay for whatever resources I'm actually using.* Serverless doesn't mean it's a burstable VM that saves its instance state to disk during periods of idle.Swyx called this Self Provisioning Runtimes back in the day. Modal doesn't put you in YAML hell, preferring to colocate infra provisioning right next to the code that utilizes it, so you can just add GPU (and disk, and retries…):After 3 years, we finally have a big market push for this: running inference on generative models is going to be the killer app for serverless, for a few reasons:* AI models are stateless: even in conversational interfaces, each message generation is a fully-contained request to the LLM. There's no knowledge that is stored in the model itself between messages, which means that tear down / spin up of resources doesn't create any headaches with maintaining state.* Token-based pricing is better aligned with serverless infrastructure than fixed monthly costs of traditional software.* GPU scarcity makes it really expensive to have reserved instances that are available to you 24/7. It's much more convenient to build with a serverless-like infrastructure.In the episode we covered a lot more topics like maximizing GPU utilization, why Oracle Cloud rocks, and how Erik has never owned a TV in his life. Enjoy!Show Notes* Modal* ErikBot* Erik's Blog* Software Infra 2.0 Wishlist* Luigi* Annoy* Hetzner* CoreWeave* Cloudflare FaaS* Poolside AI* Modular Inference EngineChapters* [00:00:00] Introductions* [00:02:00] Erik's OSS work at Spotify: Annoy and Luigi* [00:06:22] Starting Modal* [00:07:54] Vision for a "postmodern data stack"* [00:10:43] Solving container cold start problems* [00:12:57] Designing Modal's Python SDK* [00:15:18] Self-Revisioning Runtime* [00:19:14] Truly Serverless Infrastructure* [00:20:52] Beyond model inference* [00:22:09] Tricks to maximize GPU utilization* [00:26:27] Differences in AI and data science workloads* [00:28:08] Modal vs Replicate vs Modular and lessons from Heroku's "graduation problem"* [00:34:12] Creating Erik's clone "ErikBot"* [00:37:43] Enabling massive parallelism across thousands of GPUs* [00:39:45] The Modal Sandbox for agents* [00:43:51] Thoughts on the AI Inference War* [00:49:18] Erik's best tweets* [00:51:57] Why buying hardware is a waste of money* [00:54:18] Erik's competitive programming backgrounds* [00:59:02] Why does Sweden have the best Counter Strike players?* [00:59:53] Never owning a car or TV* [01:00:21] Advice for infrastructure startupsTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:14]: Hey, and today we have in the studio Erik Bernhardsson from Modal. Welcome.Erik [00:00:19]: Hi. It's awesome being here.Swyx [00:00:20]: Yeah. Awesome seeing you in person. I've seen you online for a number of years as you were building on Modal and I think you're just making a San Francisco trip just to see people here, right? I've been to like two Modal events in San Francisco here.Erik [00:00:34]: Yeah, that's right. We're based in New York, so I figured sometimes I have to come out to capital of AI and make a presence.Swyx [00:00:40]: What do you think is the pros and cons of building in New York?Erik [00:00:45]: I mean, I never built anything elsewhere. I lived in New York the last 12 years. I love the city. Obviously, there's a lot more stuff going on here and there's a lot more customers and that's why I'm out here. I do feel like for me, where I am in life, I'm a very boring person. I kind of work hard and then I go home and hang out with my kids. I don't have time to go to events and meetups and stuff anyway. In that sense, New York is kind of nice. I walk to work every morning. It's like five minutes away from my apartment. It's very time efficient in that sense. Yeah.Swyx [00:01:10]: Yeah. It's also a good life. So we'll do a brief bio and then we'll talk about anything else that people should know about you. Actually, I was surprised to find out you're from Sweden. You went to college in KTH and your master's was in implementing a scalable music recommender system. Yeah.Erik [00:01:27]: I had no idea. Yeah. So I actually studied physics, but I grew up coding and I did a lot of programming competition and then as I was thinking about graduating, I got in touch with an obscure music streaming startup called Spotify, which was then like 30 people. And for some reason, I convinced them, why don't I just come and write a master's thesis with you and I'll do some cool collaborative filtering, despite not knowing anything about collaborative filtering really. But no one knew anything back then. So I spent six months at Spotify basically building a prototype of a music recommendation system and then turned that into a master's thesis. And then later when I graduated, I joined Spotify full time.Swyx [00:02:00]: So that was the start of your data career. You also wrote a couple of popular open source tooling while you were there. Is that correct?Erik [00:02:09]: No, that's right. I mean, I was at Spotify for seven years, so this is a long stint. And Spotify was a wild place early on and I mean, data space is also a wild place. I mean, it was like Hadoop cluster in the like foosball room on the floor. It was a lot of crude, like very basic infrastructure and I didn't know anything about it. And like I was hired to kind of figure out data stuff. And I started hacking on a recommendation system and then, you know, got sidetracked in a bunch of other stuff. I fixed a bunch of reporting things and set up A-B testing and started doing like business analytics and later got back to music recommendation system. And a lot of the infrastructure didn't really exist. Like there was like Hadoop back then, which is kind of bad and I don't miss it. But I spent a lot of time with that. As a part of that, I ended up building a workflow engine called Luigi, which is like briefly like somewhat like widely ended up being used by a bunch of companies. Sort of like, you know, kind of like Airflow, but like before Airflow. I think it did some things better, some things worse. I also built a vector database called Annoy, which is like for a while, it was actually quite widely used. In 2012, so it was like way before like all this like vector database stuff ended up happening. And funny enough, I was actually obsessed with like vectors back then. Like I was like, this is going to be huge. Like just give it like a few years. I didn't know it was going to take like nine years and then there's going to suddenly be like 20 startups doing vector databases in one year. So it did happen. In that sense, I was right. I'm glad I didn't start a startup in the vector database space. I would have started way too early. But yeah, that was, yeah, it was a fun seven years as part of it. It was a great culture, a great company.Swyx [00:03:32]: Yeah. Just to take a quick tangent on this vector database thing, because we probably won't revisit it but like, has anything architecturally changed in the last nine years?Erik [00:03:41]: I'm actually not following it like super closely. I think, you know, some of the best algorithms are still the same as like hierarchical navigable small world.Swyx [00:03:51]: Yeah. HNSW.Erik [00:03:52]: Exactly. I think now there's like product quantization, there's like some other stuff that I haven't really followed super closely. I mean, obviously, like back then it was like, you know, it's always like very simple. It's like a C++ library with Python bindings and you could mmap big files and into memory and like they had some lookups. I used like this kind of recursive, like hyperspace splitting strategy, which is not that good, but it sort of was good enough at that time. But I think a lot of like HNSW is still like what people generally use. Now of course, like databases are much better in the sense like to support like inserts and updates and stuff like that. I know I never supported that. Yeah, it's sort of exciting to finally see like vector databases becoming a thing.Swyx [00:04:30]: Yeah. Yeah. And then maybe one takeaway on most interesting lesson from Daniel Ek?Erik [00:04:36]: I mean, I think Daniel Ek, you know, he started Spotify very young. Like he was like 25, something like that. And that was like a good lesson. But like he, in a way, like I think he was a very good leader. Like there was never anything like, no scandals or like no, he wasn't very eccentric at all. It was just kind of like very like level headed, like just like ran the company very well, like never made any like obvious mistakes or I think it was like a few bets that maybe like in hindsight were like a little, you know, like took us, you know, too far in one direction or another. But overall, I mean, I think he was a great CEO, like definitely, you know, up there, like generational CEO, at least for like Swedish startups.Swyx [00:05:09]: Yeah, yeah, for sure. Okay, we should probably move to make our way towards Modal. So then you spent six years as CTO of Better. You were an early engineer and then you scaled up to like 300 engineers.Erik [00:05:21]: I joined as a CTO when there was like no tech team. And yeah, that was a wild chapter in my life. Like the company did very well for a while. And then like during the pandemic, yeah, it was kind of a weird story, but yeah, it kind of collapsed.Swyx [00:05:32]: Yeah, laid off people poorly.Erik [00:05:34]: Yeah, yeah. It was like a bunch of stories. Yeah. I mean, the company like grew from like 10 people when I joined at 10,000, now it's back to a thousand. But yeah, they actually went public a few months ago, kind of crazy. They're still around, like, you know, they're still, you know, doing stuff. So yeah, very kind of interesting six years of my life for non-technical reasons, like I managed like three, four hundred, but yeah, like learning a lot of that, like recruiting. I spent all my time recruiting and stuff like that. And so managing at scale, it's like nice, like now in a way, like when I'm building my own startup. It's actually something I like, don't feel nervous about at all. Like I've managed a scale, like I feel like I can do it again. It's like very different things that I'm nervous about as a startup founder. But yeah, I started Modal three years ago after sort of, after leaving Better, I took a little bit of time off during the pandemic and, but yeah, pretty quickly I was like, I got to build something. I just want to, you know. Yeah. And then yeah, Modal took form in my head, took shape.Swyx [00:06:22]: And as far as I understand, and maybe we can sort of trade off questions. So the quick history is started Modal in 2021, got your seed with Sarah from Amplify in 2022. You just announced your Series A with Redpoint. That's right. And that brings us up to mostly today. Yeah. Most people, I think, were expecting you to build for the data space.Erik: But it is the data space.Swyx:: When I think of data space, I come from like, you know, Snowflake, BigQuery, you know, Fivetran, Nearby, that kind of stuff. And what Modal became is more general purpose than that. Yeah.Erik [00:06:53]: Yeah. I don't know. It was like fun. I actually ran into like Edo Liberty, the CEO of Pinecone, like a few weeks ago. And he was like, I was so afraid you were building a vector database. No, I started Modal because, you know, like in a way, like I work with data, like throughout my most of my career, like every different part of the stack, right? Like I thought everything like business analytics to like deep learning, you know, like building, you know, training neural networks, the scale, like everything in between. And so one of the thoughts, like, and one of the observations I had when I started Modal or like why I started was like, I just wanted to make, build better tools for data teams. And like very, like sort of abstract thing, but like, I find that the data stack is, you know, full of like point solutions that don't integrate well. And still, when you look at like data teams today, you know, like every startup ends up building their own internal Kubernetes wrapper or whatever. And you know, all the different data engineers and machine learning engineers end up kind of struggling with the same things. So I started thinking about like, how do I build a new data stack, which is kind of a megalomaniac project, like, because you kind of want to like throw out everything and start over.Swyx [00:07:54]: It's almost a modern data stack.Erik [00:07:55]: Yeah, like a postmodern data stack. And so I started thinking about that. And a lot of it came from like, like more focused on like the human side of like, how do I make data teams more productive? And like, what is the technology tools that they need? And like, you know, drew out a lot of charts of like, how the data stack looks, you know, what are different components. And it shows actually very interesting, like workflow scheduling, because it kind of sits in like a nice sort of, you know, it's like a hub in the graph of like data products. But it was kind of hard to like, kind of do that in a vacuum, and also to monetize it to some extent. I got very interested in like the layers below at some point. And like, at the end of the day, like most people have code to have to run somewhere. So I think about like, okay, well, how do you make that nice? Like how do you make that? And in particular, like the thing I always like thought about, like developer productivity is like, I think the best way to measure developer productivity is like in terms of the feedback loops, like how quickly when you iterate, like when you write code, like how quickly can you get feedback. And at the innermost loop, it's like writing code and then running it. And like, as soon as you start working with the cloud, like it's like takes minutes suddenly, because you have to build a Docker container and push it to the cloud and like run it, you know. So that was like the initial focus for me was like, I just want to solve that problem. Like I want to, you know, build something less, you run things in the cloud and like retain the sort of, you know, the joy of productivity as when you're running things locally. And in particular, I was quite focused on data teams, because I think they had a couple unique needs that wasn't well served by the infrastructure at that time, or like still is in like, in particular, like Kubernetes, I feel like it's like kind of worked okay for back end teams, but not so well for data teams. And very quickly, I got sucked into like a very deep like rabbit hole of like...Swyx [00:09:24]: Not well for data teams because of burstiness. Yeah, for sure.Erik [00:09:26]: So like burstiness is like one thing, right? Like, you know, like you often have this like fan out, you want to like apply some function over very large data sets. Another thing tends to be like hardware requirements, like you need like GPUs and like, I've seen this in many companies, like you go, you know, data scientists go to a platform team and they're like, can we add GPUs to the Kubernetes? And they're like, no, like, that's, you know, complex, and we're not gonna, so like just getting GPU access. And then like, I mean, I also like data code, like frankly, or like machine learning code like tends to be like, super annoying in terms of like environments, like you end up having like a lot of like custom, like containers and like environment conflicts. And like, it's very hard to set up like a unified container that like can serve like a data scientist, because like, there's always like packages that break. And so I think there's a lot of different reasons why the technology wasn't well suited for back end. And I think the attitude at that time is often like, you know, like you had friction between the data team and the platform team, like, well, it works for the back end stuff, you know, why don't you just like, you know, make it work. But like, I actually felt like data teams, you know, or at this point now, like there's so much, so many people working with data, and like they, to some extent, like deserve their own tools and their own tool chains, and like optimizing for that is not something people have done. So that's, that's sort of like very abstract philosophical reason why I started Model. And then, and then I got sucked into this like rabbit hole of like container cold start and, you know, like whatever, Linux, page cache, you know, file system optimizations.Swyx [00:10:43]: Yeah, tell people, I think the first time I met you, I think you told me some numbers, but I don't remember, like, what are the main achievements that you were unhappy with the status quo? And then you built your own container stack?Erik [00:10:52]: Yeah, I mean, like, in particular, it was like, in order to have that loop, right? You want to be able to start, like take code on your laptop, whatever, and like run in the cloud very quickly, and like running in custom containers, and maybe like spin up like 100 containers, 1000, you know, things like that. And so container cold start was the initial like, from like a developer productivity point of view, it was like, really, what I was focusing on is, I want to take code, I want to stick it in container, I want to execute in the cloud, and like, you know, make it feel like fast. And when you look at like, how Docker works, for instance, like Docker, you have this like, fairly convoluted, like very resource inefficient way, they, you know, you build a container, you upload the whole container, and then you download it, and you run it. And Kubernetes is also like, not very fast at like starting containers. So like, I started kind of like, you know, going a layer deeper, like Docker is actually like, you know, there's like a couple of different primitives, but like a lower level primitive is run C, which is like a container runner. And I was like, what if I just take the container runner, like run C, and I point it to like my own root file system, and then I built like my own virtual file system that exposes files over a network instead. And that was like the sort of very crude version of model, it's like now I can actually start containers very quickly, because it turns out like when you start a Docker container, like, first of all, like most Docker images are like several gigabytes, and like 99% of that is never going to be consumed, like there's a bunch of like, you know, like timezone information for like Uzbekistan, like no one's going to read it. And then there's a very high overlap between the files are going to be read, there's going to be like lib torch or whatever, like it's going to be read. So you can also cache it very well. So that was like the first sort of stuff we started working on was like, let's build this like container file system. And you know, coupled with like, you know, just using run C directly. And that actually enabled us to like, get to this point of like, you write code, and then you can launch it in the cloud within like a second or two, like something like that. And you know, there's been many optimizations since then, but that was sort of starting point.Alessio [00:12:33]: Can we talk about the developer experience as well, I think one of the magic things about Modal is at the very basic layers, like a Python function decorator, it's just like stub and whatnot. But then you also have a way to define a full container, what were kind of the design decisions that went into it? Where did you start? How easy did you want it to be? And then maybe how much complexity did you then add on to make sure that every use case fit?Erik [00:12:57]: I mean, Modal, I almost feel like it's like almost like two products kind of glued together. Like there's like the low level like container runtime, like file system, all that stuff like in Rust. And then there's like the Python SDK, right? Like how do you express applications? And I think, I mean, Swix, like I think your blog was like the self-provisioning runtime was like, to me, always like to sort of, for me, like an eye-opening thing. It's like, so I didn't think about like...Swyx [00:13:15]: You wrote your post four months before me. Yeah? The software 2.0, Infra 2.0. Yeah.Erik [00:13:19]: Well, I don't know, like convergence of minds. I guess we were like both thinking. Maybe you put, I think, better words than like, you know, maybe something I was like thinking about for a long time. Yeah.Swyx [00:13:29]: And I can tell you how I was thinking about it on my end, but I want to hear you say it.Erik [00:13:32]: Yeah, yeah, I would love to. So to me, like what I always wanted to build was like, I don't know, like, I don't know if you use like Pulumi. Like Pulumi is like nice, like in the sense, like it's like Pulumi is like you describe infrastructure in code, right? And to me, that was like so nice. Like finally I can like, you know, put a for loop that creates S3 buckets or whatever. And I think like Modal sort of goes one step further in the sense that like, what if you also put the app code inside the infrastructure code and like glue it all together and then like you only have one single place that defines everything and it's all programmable. You don't have any config files. Like Modal has like zero config. There's no config. It's all code. And so that was like the goal that I wanted, like part of that. And then the other part was like, I often find that so much of like my time was spent on like the plumbing between containers. And so my thing was like, well, if I just build this like Python SDK and make it possible to like bridge like different containers, just like a function call, like, and I can say, oh, this function runs in this container and this other function runs in this container and I can just call it just like a normal function, then, you know, I can build these applications that may span a lot of different environments. Maybe they fan out, start other containers, but it's all just like inside Python. You just like have this beautiful kind of nice like DSL almost for like, you know, how to control infrastructure in the cloud. So that was sort of like how we ended up with the Python SDK as it is, which is still evolving all the time, by the way. We keep changing syntax quite a lot because I think it's still somewhat exploratory, but we're starting to converge on something that feels like reasonably good now.Swyx [00:14:54]: Yeah. And along the way you, with this expressiveness, you enabled the ability to, for example, attach a GPU to a function. Totally.Erik [00:15:02]: Yeah. It's like you just like say, you know, on the function decorator, you're like GPU equals, you know, A100 and then or like GPU equals, you know, A10 or T4 or something like that. And then you get that GPU and like, you know, you just run the code and it runs like you don't have to, you know, go through hoops to, you know, start an EC2 instance or whatever.Swyx [00:15:18]: Yeah. So it's all code. Yeah. So one of the reasons I wrote Self-Revisioning Runtimes was I was working at AWS and we had AWS CDK, which is kind of like, you know, the Amazon basics blew me. Yeah, totally. And then, and then like it creates, it compiles the cloud formation. Yeah. And then on the other side, you have to like get all the config stuff and then put it into your application code and make sure that they line up. So then you're writing code to define your infrastructure, then you're writing code to define your application. And I was just like, this is like obvious that it's going to converge, right? Yeah, totally.Erik [00:15:48]: But isn't there like, it might be wrong, but like, was it like SAM or Chalice or one of those? Like, isn't that like an AWS thing that where actually they kind of did that? I feel like there's like one.Swyx [00:15:57]: SAM. Yeah. Still very clunky. It's not, not as elegant as modal.Erik [00:16:03]: I love AWS for like the stuff it's built, you know, like historically in order for me to like, you know, what it enables me to build, but like AWS is always like struggle with developer experience.Swyx [00:16:11]: I mean, they have to not break things.Erik [00:16:15]: Yeah. Yeah. And totally. And they have to build products for a very wide range of use cases. And I think that's hard.Swyx [00:16:21]: Yeah. Yeah. So it's, it's easier to design for. Yeah. So anyway, I was, I was pretty convinced that this, this would happen. I wrote, wrote that thing. And then, you know, I imagine my surprise that you guys had it on your landing page at some point. I think, I think Akshad was just like, just throw that in there.Erik [00:16:34]: Did you trademark it?Swyx [00:16:35]: No, I didn't. But I definitely got sent a few pitch decks with my post on there and it was like really interesting. This is my first time like kind of putting a name to a phenomenon. And I think this is a useful skill for people to just communicate what they're trying to do.Erik [00:16:48]: Yeah. No, I think it's a beautiful concept.Swyx [00:16:50]: Yeah. Yeah. Yeah. But I mean, obviously you implemented it. What became more clear in your explanation today is that actually you're not that tied to Python.Erik [00:16:57]: No. I mean, I, I think that all the like lower level stuff is, you know, just running containers and like scheduling things and, you know, serving container data and stuff. So like one of the benefits of data teams is obviously like they're all like using Python, right? And so that made it a lot easier. I think, you know, if we had focused on other workloads, like, you know, for various reasons, we've like been kind of like half thinking about like CI or like things like that. But like, in a way that's like harder because like you also, then you have to be like, you know, multiple SDKs, whereas, you know, focusing on data teams, you can only, you know, Python like covers like 95% of all teams. That made it a lot easier. But like, I mean, like definitely like in the future, we're going to have others support, like supporting other languages. JavaScript for sure is the obvious next language. But you know, who knows, like, you know, Rust, Go, R, whatever, PHP, Haskell, I don't know.Swyx [00:17:42]: You know, I think for me, I actually am a person who like kind of liked the idea of programming language advancements being improvements in developer experience. But all I saw out of the academic sort of PLT type people is just type level improvements. And I always think like, for me, like one of the core reasons for self-provisioning runtimes and then why I like Modal is like, this is actually a productivity increase, right? Like, it's a language level thing, you know, you managed to stick it on top of an existing language, but it is your own language, a DSL on top of Python. And so language level increase on the order of like automatic memory management. You know, you could sort of make that analogy that like, maybe you lose some level of control, but most of the time you're okay with whatever Modal gives you. And like, that's fine. Yeah.Erik [00:18:26]: Yeah. Yeah. I mean, that's how I look at about it too. Like, you know, you look at developer productivity over the last number of decades, like, you know, it's come in like small increments of like, you know, dynamic typing or like is like one thing because not suddenly like for a lot of use cases, you don't need to care about type systems or better compiler technology or like, you know, the cloud or like, you know, relational databases. And, you know, I think, you know, you look at like that, you know, history, it's a steadily, you know, it's like, you know, you look at the developers have been getting like probably 10X more productive every decade for the last four decades or something that was kind of crazy. Like on an exponential scale, we're talking about 10X or is there a 10,000X like, you know, improvement in developer productivity. What we can build today, you know, is arguably like, you know, a fraction of the cost of what it took to build it in the eighties. Maybe it wasn't even possible in the eighties. So that to me, like, that's like so fascinating. I think it's going to keep going for the next few decades. Yeah.Alessio [00:19:14]: Yeah. Another big thing in the infra 2.0 wishlist was truly serverless infrastructure. The other on your landing page, you called them native cloud functions, something like that. I think the issue I've seen with serverless has always been people really wanted it to be stateful, even though stateless was much easier to do. And I think now with AI, most model inference is like stateless, you know, outside of the context. So that's kind of made it a lot easier to just put a model, like an AI model on model to run. How do you think about how that changes how people think about infrastructure too? Yeah.Erik [00:19:48]: I mean, I think model is definitely going in the direction of like doing more stateful things and working with data and like high IO use cases. I do think one like massive serendipitous thing that happened like halfway, you know, a year and a half into like the, you know, building model was like Gen AI started exploding and the IO pattern of Gen AI is like fits the serverless model like so well, because it's like, you know, you send this tiny piece of information, like a prompt, right, or something like that. And then like you have this GPU that does like trillions of flops, and then it sends back like a tiny piece of information, right. And that turns out to be something like, you know, if you can get serverless working with GPU, that just like works really well, right. So I think from that point of view, like serverless always to me felt like a little bit of like a solution looking for a problem. I don't actually like don't think like backend is like the problem that needs to serve it or like not as much. But I look at data and in particular, like things like Gen AI, like model inference, like it's like clearly a good fit. So I think that is, you know, to a large extent explains like why we saw, you know, the initial sort of like killer app for model being model inference, which actually wasn't like necessarily what we're focused on. But that's where we've seen like by far the most usage. Yeah.Swyx [00:20:52]: And this was before you started offering like fine tuning of language models, it was mostly stable diffusion. Yeah.Erik [00:20:59]: Yeah. I mean, like model, like I always built it to be a very general purpose compute platform, like something where you can run everything. And I used to call model like a better Kubernetes for data team for a long time. What we realized was like, yeah, that's like, you know, a year and a half in, like we barely had any users or any revenue. And like we were like, well, maybe we should look at like some use case, trying to think of use case. And that was around the same time stable diffusion came out. And the beauty of model is like you can run almost anything on model, right? Like model inference turned out to be like the place where we found initially, well, like clearly this has like 10x like better agronomics than anything else. But we're also like, you know, going back to my original vision, like we're thinking a lot about, you know, now, okay, now we do inference really well. Like what about training? What about fine tuning? What about, you know, end-to-end lifecycle deployment? What about data pre-processing? What about, you know, I don't know, real-time streaming? What about, you know, large data munging, like there's just data observability. I think there's so many things, like kind of going back to what I said about like redefining the data stack, like starting with the foundation of compute. Like one of the exciting things about model is like we've sort of, you know, we've been working on that for three years and it's maturing, but like this is so many things you can do like with just like a better compute primitive and also go up to stack and like do all this other stuff on top of it.Alessio [00:22:09]: How do you think about or rather like I would love to learn more about the underlying infrastructure and like how you make that happen because with fine tuning and training, it's a static memory. Like you exactly know what you're going to load in memory one and it's kind of like a set amount of compute versus inference, just like data is like very bursty. How do you make batches work with a serverless developer experience? You know, like what are like some fun technical challenge you solve to make sure you get max utilization on these GPUs? What we hear from people is like, we have GPUs, but we can really only get like, you know, 30, 40, 50% maybe utilization. What's some of the fun stuff you're working on to get a higher number there?Erik [00:22:48]: Yeah, I think on the inference side, like that's where we like, you know, like from a cost perspective, like utilization perspective, we've seen, you know, like very good numbers and in particular, like it's our ability to start containers and stop containers very quickly. And that means that we can auto scale extremely fast and scale down very quickly, which means like we can always adjust the sort of capacity, the number of GPUs running to the exact traffic volume. And so in many cases, like that actually leads to a sort of interesting thing where like we obviously run our things on like the public cloud, like AWS GCP, we run on Oracle, but in many cases, like users who do inference on those platforms or those clouds, even though we charge a slightly higher price per GPU hour, a lot of users like moving their large scale inference use cases to model, they end up saving a lot of money because we only charge for like with the time the GPU is actually running. And that's a hard problem, right? Like, you know, if you have to constantly adjust the number of machines, if you have to start containers, stop containers, like that's a very hard problem. Starting containers quickly is a very difficult thing. I mentioned we had to build our own file system for this. We also, you know, built our own container scheduler for that. We've implemented recently CPU memory checkpointing so we can take running containers and snapshot the entire CPU, like including registers and everything, and restore it from that point, which means we can restore it from an initialized state. We're looking at GPU checkpointing next, it's like a very interesting thing. So I think with inference stuff, that's where serverless really shines because you can drive, you know, you can push the frontier of latency versus utilization quite substantially, you know, which either ends up being a latency advantage or a cost advantage or both, right? On training, it's probably arguably like less of an advantage doing serverless, frankly, because you know, you can just like spin up a bunch of machines and try to satisfy, like, you know, train as much as you can on each machine. For that area, like we've seen, like, you know, arguably like less usage, like for modal, but there are always like some interesting use case. Like we do have a couple of customers, like RAM, for instance, like they do fine tuning with modal and they basically like one of the patterns they have is like very bursty type fine tuning where they fine tune 100 models in parallel. And that's like a separate thing that modal does really well, right? Like you can, we can start up 100 containers very quickly, run a fine tuning training job on each one of them for that only runs for, I don't know, 10, 20 minutes. And then, you know, you can do hyper parameter tuning in that sense, like just pick the best model and things like that. So there are like interesting training. I think when you get to like training, like very large foundational models, that's a use case we don't support super well, because that's very high IO, you know, you need to have like infinite band and all these things. And those are things we haven't supported yet and might take a while to get to that. So that's like probably like an area where like we're relatively weak in. Yeah.Alessio [00:25:12]: Have you cared at all about lower level model optimization? There's other cloud providers that do custom kernels to get better performance or are you just given that you're not just an AI compute company? Yeah.Erik [00:25:24]: I mean, I think like we want to support like a generic, like general workloads in a sense that like we want users to give us a container essentially or a code or code. And then we want to run that. So I think, you know, we benefit from those things in the sense that like we can tell our users, you know, to use those things. But I don't know if we want to like poke into users containers and like do those things automatically. That's sort of, I think a little bit tricky from the outside to do, because we want to be able to take like arbitrary code and execute it. But certainly like, you know, we can tell our users to like use those things. Yeah.Swyx [00:25:53]: I may have betrayed my own biases because I don't really think about modal as for data teams anymore. I think you started, I think you're much more for AI engineers. My favorite anecdotes, which I think, you know, but I don't know if you directly experienced it. I went to the Vercel AI Accelerator, which you supported. And in the Vercel AI Accelerator, a bunch of startups gave like free credits and like signups and talks and all that stuff. The only ones that stuck are the ones that actually appealed to engineers. And the top usage, the top tool used by far was modal.Erik [00:26:24]: That's awesome.Swyx [00:26:25]: For people building with AI apps. Yeah.Erik [00:26:27]: I mean, it might be also like a terminology question, like the AI versus data, right? Like I've, you know, maybe I'm just like old and jaded, but like, I've seen so many like different titles, like for a while it was like, you know, I was a data scientist and a machine learning engineer and then, you know, there was like analytics engineers and there was like an AI engineer, you know? So like, to me, it's like, I just like in my head, that's to me just like, just data, like, or like engineer, you know, like I don't really, so that's why I've been like, you know, just calling it data teams. But like, of course, like, you know, AI is like, you know, like such a massive fraction of our like workloads.Swyx [00:26:59]: It's a different Venn diagram of things you do, right? So the stuff that you're talking about where you need like infinite bands for like highly parallel training, that's not, that's more of the ML engineer, that's more of the research scientist and less of the AI engineer, which is more sort of trying to put, work at the application.Erik [00:27:16]: Yeah. I mean, to be fair to it, like we have a lot of users that are like doing stuff that I don't think fits neatly into like AI. Like we have a lot of people using like modal for web scraping, like it's kind of nice. You can just like, you know, fire up like a hundred or a thousand containers running Chromium and just like render a bunch of webpages and it takes, you know, whatever. Or like, you know, protein folding is that, I mean, maybe that's, I don't know, like, but like, you know, we have a bunch of users doing that or, or like, you know, in terms of, in the realm of biotech, like sequence alignment, like people using, or like a couple of people using like modal to run like large, like mixed integer programming problems, like, you know, using Gurobi or like things like that. So video processing is another thing that keeps coming up, like, you know, let's say you have like petabytes of video and you want to just like transcode it, like, or you can fire up a lot of containers and just run FFmpeg or like, so there are those things too. Like, I mean, like that being said, like AI is by far our biggest use case, but you know, like, again, like modal is kind of general purpose in that sense.Swyx [00:28:08]: Yeah. Well, maybe I'll stick to the stable diffusion thing and then we'll move on to the other use cases for AI that you want to highlight. The other big player in my mind is replicate. Yeah. In this, in this era, they're much more, I guess, custom built for that purpose, whereas you're more general purpose. How do you position yourself with them? Are they just for like different audiences or are you just heads on competing?Erik [00:28:29]: I think there's like a tiny sliver of the Venn diagram where we're competitive. And then like 99% of the area we're not competitive. I mean, I think for people who, if you look at like front-end engineers, I think that's where like really they found good fit is like, you know, people who built some cool web app and they want some sort of AI capability and they just, you know, an off the shelf model is like perfect for them. That's like, I like use replicate. That's great. I think where we shine is like custom models or custom workflows, you know, running things at very large scale. We need to care about utilization, care about costs. You know, we have much lower prices because we spend a lot more time optimizing our infrastructure, you know, and that's where we're competitive, right? Like, you know, and you look at some of the use cases, like Suno is a big user, like they're running like large scale, like AI. Oh, we're talking with Mikey.Swyx [00:29:12]: Oh, that's great. Cool.Erik [00:29:14]: In a month. Yeah. So, I mean, they're, they're using model for like production infrastructure. Like they have their own like custom model, like custom code and custom weights, you know, for AI generated music, Suno.AI, you know, that, that, those are the types of use cases that we like, you know, things that are like very custom or like, it's like, you know, and those are the things like it's very hard to run and replicate, right? And that's fine. Like I think they, they focus on a very different part of the stack in that sense.Swyx [00:29:35]: And then the other company pattern that I pattern match you to is Modular. I don't know.Erik [00:29:40]: Because of the names?Swyx [00:29:41]: No, no. Wow. No, but yeah, yes, the name is very similar. I think there's something that might be insightful there from a linguistics point of view. Oh no, they have Mojo, the sort of Python SDK. And they have the Modular Inference Engine, which is their sort of their cloud stack, their sort of compute inference stack. I don't know if anyone's made that comparison to you before, but like I see you evolving a little bit in parallel there.Erik [00:30:01]: No, I mean, maybe. Yeah. Like it's not a company I'm like super like familiar, like, I mean, I know the basics, but like, I guess they're similar in the sense like they want to like do a lot of, you know, they have sort of big picture vision.Swyx [00:30:12]: Yes. They also want to build very general purpose. Yeah. So they're marketing themselves as like, if you want to do off the shelf stuff, go out, go somewhere else. If you want to do custom stuff, we're the best place to do it. Yeah. Yeah. There is some overlap there. There's not overlap in the sense that you are a closed source platform. People have to host their code on you. That's true. Whereas for them, they're very insistent on not running their own cloud service. They're a box software. Yeah. They're licensed software.Erik [00:30:37]: I'm sure their VCs at some point going to force them to reconsider. No, no.Swyx [00:30:40]: Chris is very, very insistent and very convincing. So anyway, I would just make that comparison, let people make the links if they want to. But it's an interesting way to see the cloud market develop from my point of view, because I came up in this field thinking cloud is one thing, and I think your vision is like something slightly different, and I see the different takes on it.Erik [00:31:00]: Yeah. And like one thing I've, you know, like I've written a bit about it in my blog too, it's like I think of us as like a second layer of cloud provider in the sense that like I think Snowflake is like kind of a good analogy. Like Snowflake, you know, is infrastructure as a service, right? But they actually run on the like major clouds, right? And I mean, like you can like analyze this very deeply, but like one of the things I always thought about is like, why does Snowflake arbitrarily like win over Redshift? And I think Snowflake, you know, to me, one, because like, I mean, in the end, like AWS makes all the money anyway, like and like Snowflake just had the ability to like focus on like developer experience or like, you know, user experience. And to me, like really proved that you can build a cloud provider, a layer up from, you know, the traditional like public clouds. And in that layer, that's also where I would put Modal, it's like, you know, we're building a cloud provider, like we're, you know, we're like a multi-tenant environment that runs the user code. But we're also building on top of the public cloud. So I think there's a lot of room in that space, I think is very sort of interesting direction.Alessio [00:31:55]: How do you think of that compared to the traditional past history, like, you know, you had AWS, then you had Heroku, then you had Render, Railway.Erik [00:32:04]: Yeah, I mean, I think those are all like great. I think the problem that they all faced was like the graduation problem, right? Like, you know, Heroku or like, I mean, like also like Heroku, there's like a counterfactual future of like, what would have happened if Salesforce didn't buy them, right? Like, that's a sort of separate thing. But like, I think what Heroku, I think always struggled with was like, eventually companies would get big enough that you couldn't really justify running in Heroku. So they would just go and like move it to, you know, whatever AWS or, you know, in particular. And you know, that's something that keeps me up at night too, like, what does that graduation risk like look like for modal? I always think like the only way to build a successful infrastructure company in the long run in the cloud today is you have to appeal to the entire spectrum, right? Or at least like the enterprise, like you have to capture the enterprise market. But the truly good companies capture the whole spectrum, right? Like I think of companies like, I don't like Datadog or Mongo or something that were like, they both captured like the hobbyists and acquire them, but also like, you know, have very large enterprise customers. I think that arguably was like where I, in my opinion, like Heroku struggle was like, how do you maintain the customers as they get more and more advanced? I don't know what the solution is, but I think there's, you know, that's something I would have thought deeply if I was at Heroku at that time.Alessio [00:33:14]: What's the AI graduation problem? Is it, I need to fine tune the model, I need better economics, any insights from customer discussions?Erik [00:33:22]: Yeah, I mean, better economics, certainly. But although like, I would say like, even for people who like, you know, needs like thousands of GPUs, just because we can drive utilization so much better, like we, there's actually like a cost advantage of staying on modal. But yeah, I mean, certainly like, you know, and like the fact that VCs like love, you know, throwing money at least used to, you know, add companies who need it to buy GPUs. I think that didn't help the problem. And in training, I think, you know, there's less software differentiation. So in training, I think there's certainly like better economics of like buying big clusters. But I mean, my hope it's going to change, right? Like I think, you know, we're still pretty early in the cycle of like building AI infrastructure. And I think a lot of these companies over in the long run, like, you know, they're, except it may be super big ones, like, you know, on Facebook and Google, they're always going to build their own ones. But like everyone else, like some extent, you know, I think they're better off like buying platforms. And, you know, someone's going to have to build those platforms.Swyx [00:34:12]: Yeah. Cool. Let's move on to language models and just specifically that workload just to flesh it out a little bit. You already said that RAMP is like fine tuning 100 models at once simultaneously on modal. Closer to home, my favorite example is ErikBot. Maybe you want to tell that story.Erik [00:34:30]: Yeah. I mean, it was a prototype thing we built for fun, but it's pretty cool. Like we basically built this thing that hooks up to Slack. It like downloads all the Slack history and, you know, fine-tunes a model based on a person. And then you can chat with that. And so you can like, you know, clone yourself and like talk to yourself on Slack. I mean, it's like nice like demo and it's just like, I think like it's like fully contained modal. Like there's a modal app that does everything, right? Like it downloads Slack, you know, integrates with the Slack API, like downloads the stuff, the data, like just runs the fine-tuning and then like creates like dynamically an inference endpoint. And it's all like self-contained and like, you know, a few hundred lines of code. So I think it's sort of a good kind of use case for, or like it kind of demonstrates a lot of the capabilities of modal.Alessio [00:35:08]: Yeah. On a more personal side, how close did you feel ErikBot was to you?Erik [00:35:13]: It definitely captured the like the language. Yeah. I mean, I don't know, like the content, I always feel this way about like AI and it's gotten better. Like when you look at like AI output of text, like, and it's like, when you glance at it, it's like, yeah, this seems really smart, you know, but then you actually like look a little bit deeper. It's like, what does this mean?Swyx [00:35:32]: What does this person say?Erik [00:35:33]: It's like kind of vacuous, right? And that's like kind of what I felt like, you know, talking to like my clone version, like it's like says like things like the grammar is correct. Like some of the sentences make a lot of sense, but like, what are you trying to say? Like there's no content here. I don't know. I mean, it's like, I got that feeling also with chat TBT in the like early versions right now it's like better, but.Alessio [00:35:51]: That's funny. So I built this thing called small podcaster to automate a lot of our back office work, so to speak. And it's great at transcript. It's great at doing chapters. And then I was like, okay, how about you come up with a short summary? And it's like, it sounds good, but it's like, it's not even the same ballpark as like, yeah, end up writing. Right. And it's hard to see how it's going to get there.Swyx [00:36:11]: Oh, I have ideas.Erik [00:36:13]: I'm certain it's going to get there, but like, I agree with you. Right. And like, I have the same thing. I don't know if you've read like AI generated books. Like they just like kind of seem funny, right? Like there's off, right? But like you glance at it and it's like, oh, it's kind of cool. Like looks correct, but then it's like very weird when you actually read them.Swyx [00:36:30]: Yeah. Well, so for what it's worth, I think anyone can join the modal slack. Is it open to the public? Yeah, totally.Erik [00:36:35]: If you go to modal.com, there's a button in the footer.Swyx [00:36:38]: Yeah. And then you can talk to Erik Bot. And then sometimes I really like picking Erik Bot and then you answer afterwards, but then you're like, yeah, mostly correct or whatever. Any other broader lessons, you know, just broadening out from like the single use case of fine tuning, like what are you seeing people do with fine tuning or just language models on modal in general? Yeah.Erik [00:36:59]: I mean, I think language models is interesting because so many people get started with APIs and that's just, you know, they're just dominating a space in particular opening AI, right? And that's not necessarily like a place where we aim to compete. I mean, maybe at some point, but like, it's just not like a core focus for us. And I think sort of separately, it's sort of a question of like, there's economics in that long term. But like, so we tend to focus on more like the areas like around it, right? Like fine tuning, like another use case we have is a bunch of people, Ramp included, is doing batch embeddings on modal. So let's say, you know, you have like a, actually we're like writing a blog post, like we take all of Wikipedia and like parallelize embeddings in 15 minutes and produce vectors for each article. So those types of use cases, I think modal suits really well for. I think also a lot of like custom inference, like yeah, I love that.Swyx [00:37:43]: Yeah. I think you should give people an idea of the order of magnitude of parallelism, because I think people don't understand how parallel. So like, I think your classic hello world with modal is like some kind of Fibonacci function, right? Yeah, we have a bunch of different ones. Some recursive function. Yeah.Erik [00:37:59]: Yeah. I mean, like, yeah, I mean, it's like pretty easy in modal, like fan out to like, you know, at least like 100 GPUs, like in a few seconds. And you know, if you give it like a couple of minutes, like we can, you know, you can fan out to like thousands of GPUs. Like we run it relatively large scale. And yeah, we've run, you know, many thousands of GPUs at certain points when we needed, you know, big backfills or some customers had very large compute needs.Swyx [00:38:21]: Yeah. Yeah. And I mean, that's super useful for a number of things. So one of my early interactions with modal as well was with a small developer, which is my sort of coding agent. The reason I chose modal was a number of things. One, I just wanted to try it out. I just had an excuse to try it. Akshay offered to onboard me personally. But the most interesting thing was that you could have that sort of local development experience as it was running on my laptop, but then it would seamlessly translate to a cloud service or like a cloud hosted environment. And then it could fan out with concurrency controls. So I could say like, because like, you know, the number of times I hit the GPT-3 API at the time was going to be subject to the rate limit. But I wanted to fan out without worrying about that kind of stuff. With modal, I can just kind of declare that in my config and that's it. Oh, like a concurrency limit?Erik [00:39:07]: Yeah. Yeah.Swyx [00:39:09]: Yeah. There's a lot of control. And that's why it's like, yeah, this is a pretty good use case for like writing this kind of LLM application code inside of this environment that just understands fan out and rate limiting natively. You don't actually have an exposed queue system, but you have it under the hood, you know, that kind of stuff. Totally.Erik [00:39:28]: It's a self-provisioning cloud.Swyx [00:39:30]: So the last part of modal I wanted to touch on, and obviously feel free, I know you're working on new features, was the sandbox that was introduced last year. And this is something that I think was inspired by Code Interpreter. You can tell me the longer history behind that.Erik [00:39:45]: Yeah. Like we originally built it for the use case, like there was a bunch of customers who looked into code generation applications and then they came to us and asked us, is there a safe way to execute code? And yeah, we spent a lot of time on like container security. We used GeoVisor, for instance, which is a Google product that provides pretty strong isolation of code. So we built a product where you can basically like run arbitrary code inside a container and monitor its output or like get it back in a safe way. I mean, over time it's like evolved into more of like, I think the long-term direction is actually I think more interesting, which is that I think modal as a platform where like I think the core like container infrastructure we offer could actually be like, you know, unbundled from like the client SDK and offer to like other, you know, like we're talking to a couple of like other companies that want to run, you know, through their packages, like run, execute jobs on modal, like kind of programmatically. So that's actually the direction like Sandbox is going. It's like turning into more like a platform for platforms is kind of what I've been thinking about it as.Swyx [00:40:45]: Oh boy. Platform. That's the old Kubernetes line.Erik [00:40:48]: Yeah. Yeah. Yeah. But it's like, you know, like having that ability to like programmatically, you know, create containers and execute them, I think, I think is really cool. And I think it opens up a lot of interesting capabilities that are sort of separate from the like core Python SDK in modal. So I'm really excited about C. It's like one of those features that we kind of released and like, you know, then we kind of look at like what users actually build with it and people are starting to build like kind of crazy things. And then, you know, we double down on some of those things because when we see like, you know, potential new product features and so Sandbox, I think in that sense, it's like kind of in that direction. We found a lot of like interesting use cases in the direction of like platformized container runner.Swyx [00:41:27]: Can you be more specific about what you're double down on after seeing users in action?Erik [00:41:32]: I mean, we're working with like some companies that, I mean, without getting into specifics like that, need the ability to take their users code and then launch containers on modal. And it's not about security necessarily, like they just want to use modal as a back end, right? Like they may already provide like Kubernetes as a back end, Lambda as a back end, and now they want to add modal as a back end, right? And so, you know, they need a way to programmatically define jobs on behalf of their users and execute them. And so, I don't know, that's kind of abstract, but does that make sense? I totally get it.Swyx [00:42:03]: It's sort of one level of recursion to sort of be the Modal for their customers.Erik [00:42:09]: Exactly.Swyx [00:42:10]: Yeah, exactly. And Cloudflare has done this, you know, Kenton Vardar from Cloudflare, who's like the tech lead on this thing, called it sort of functions as a service as a service.Erik [00:42:17]: Yeah, that's exactly right. FaSasS.Swyx [00:42:21]: FaSasS. Yeah, like, I mean, like that, I think any base layer, second layer cloud provider like yourself, compute provider like yourself should provide, you know, it's a mark of maturity and success that people just trust you to do that. They'd rather build on top of you than compete with you. The more interesting thing for me is like, what does it mean to serve a computer like an LLM developer, rather than a human developer, right? Like, that's what a sandbox is to me, that you have to redefine modal to serve a different non-human audience.Erik [00:42:51]: Yeah. Yeah, and I think there's some really interesting people, you know, building very cool things.Swyx [00:42:55]: Yeah. So I don't have an answer, but, you know, I imagine things like, hey, the way you give feedback is different. Maybe you have to like stream errors, log errors differently. I don't really know. Yeah. Obviously, there's like safety considerations. Maybe you have an API to like restrict access to the web. Yeah. I don't think anyone would use it, but it's there if you want it.Erik [00:43:17]: Yeah.Swyx [00:43:18]: Yeah. Any other sort of design considerations? I have no idea.Erik [00:43:21]: With sandboxes?Swyx [00:43:22]: Yeah. Yeah.Erik [00:43:24]: Open-ended question here. Yeah. I mean, no, I think, yeah, the network restrictions, I think, make a lot of sense. Yeah. I mean, I think, you know, long-term, like, I think there's a lot of interesting use cases where like the LLM, in itself, can like decide, I want to install these packages and like run this thing. And like, obviously, for a lot of those use cases, like you want to have some sort of control that it doesn't like install malicious stuff and steal your secrets and things like that. But I think that's what's exciting about the sandbox primitive, is like it lets you do that in a relatively safe way.Alessio [00:43:51]: Do you have any thoughts on the inference wars? A lot of providers are just rushing to the bottom to get the lowest price per million tokens. Some of them, you know, the Sean Randomat, they're just losing money and there's like the physics of it just don't work out for them to make any money on it. How do you think about your pricing and like how much premium you can get and you can kind of command versus using lower prices as kind of like a wedge into getting there, especially once you have model instrumented? What are the tradeoffs and any thoughts on strategies that work?Erik [00:44:23]: I mean, we focus more on like custom models and custom code. And I think in that space, there's like less competition and I think we can have a pricing markup, right? Like, you know, people will always compare our prices to like, you know, the GPU power they can get elsewhere. And so how big can that markup be? Like it never can be, you know, we can never charge like 10x more, but we can certainly charge a premium. And like, you know, for that reason, like we can have pretty good margins. The LLM space is like the opposite, like the switching cost of LLMs is zero. If all you're doing is like straight up, like at least like open source, right? Like if all you're doing is like, you know, using some, you know, inference endpoint that serves an open source model and, you know, some other provider comes along and like offers a lower price, you're just going to switch, right? So I don't know, to me that reminds me a lot of like all this like 15 minute delivery wars or like, you know, like Uber versus Lyft, you know, and like maybe going back even further, like I think a lot about like sort of, you know, flip side of this is like, it's actually a positive side, which is like, I thought a lot about like fiber optics boom of like 98, 99, like the other day, or like, you know, and also like the overinvestment in GPU today. Like, like, yeah, like, you know, I don't know, like in the end, like, I don't think VCs will have the return they expected, like, you know, in these things, but guess who's going to benefit, like, you know, is the consumers, like someone's like reaping the value of this. And that's, I think an amazing flip side is that, you know, we should be very grateful, the fact that like VCs want to subsidize these things, which is, you know, like you go back to fiber optics, like there was an extreme, like overinvestment in fiber optics network in like 98. And no one made money who did that. But consumers, you know, got tremendous benefits of all the fiber optics cables that were led, you know, throughout the country in the decades after. I feel something similar abou
Na próxima vez que você estiver gargalhando com um humorista cearense e pensar “essa gente precisa ser estudada”, pode ficar tranquilo que tem estudos bacanas sobre o humor do Ceará. Conversamos com o professor Francisco Secundo da Silva Neto, da Universidade Federal do Ceará. Ficha técnica Hosts: Thiago Corrêa e Leticia Dáquer Convidado: Prof. Francisco Secundo da Silva Neto Edição: Thiago Corrêa Capa: Leticia Dáquer Data da gravação: 17/01/2024 Data da publicação: 24/01/2024 Coisas mencionadas no ou relacionadas ao episódio: Deu bode no museu: os diversos significados atribuídos à insólita presença do bode Ioiô no Museu do Ceará de 1935 aos dias atuais. Mulheres comediantes cearenses descobrem na internet uma nova modalidade de fazer humor (Diário do Nordeste, 08/11/2020) O novo humor cearense: influenciadores digitais que perpetuam o legado da descontração (Conexão 085, 17/05/2021) Programa de Instalação Da Padaria Espiritual - Ceará Músicas/áudios: Ednardo - Padaria Espiritual (Pseudo Video) Eduardo Dusek - Brega Chique (1984) Falcão - Esculhambação, Sim, Frescura Não (Áudio Oficial) A Balada do Pistoleiro Leticia Dáquer Livro Um Rio Chamado Tempo, Uma Casa Chamada Terra, de Mia Couto Francisco Secundo Livro Eu não sou cachorro não, de Paulo César de Araújo Livro História do Riso e do Escárnio, de Georges Minois Documentário - O Riso dos Outros, de Pedro Arantes (Youtube) Thiago Corrêa Realities Casamento às cegas e Falso Amor, ambos disponíveis na Netflix Jabás Francisco Secundo Instagram: @secundoneto Leticia Dáquer Twitter: @pacamanca Blog: www.pacamanca.com Thiago Corrêa Twitter: @thiago_czz Parceria com Veste Esquerda: Agora tem camiseta do Pistolando direto no site da Veste Esquerda! Mas o código de desconto PISTOLA10 dá 10% de desconto na sua compra da nossa e de outras camisetas maneiríssimas esquerdopatas! Parceria com Editora Boitempo: compre livros por esse link aqui pra gente ganhar uns trocados de comissão :) Esse podcast é produzido pelo Estopim Podcasts. Precisa de ajuda pra fazer o seu podcast? Chega mais, que a gente te dá uma mãozinha. Links do Pistolando www.pistolando.com contato@pistolando.com Twitter: @PistolandoPod Instagram: @PistolandoPod Apóie o Pistolando no Catarse, no Patreon e agora também no PicPay. Se preferir fazer um pix, nossa chave é contato@pistolando.com Descrição da capa: Foto do cantor e humorista Falcão, um homem branco, de cabelos, bigode e cavanhaque grisalhos, usando óculos escuros. Ele está usando um paletó com uma estampa berrante de fundo azul e enormes girassóis amarelos, um girassol gigante na lapela, muitas pulseiras e anéis, uma camisa branca e uma gravata listrada que não tem nada a ver com a estampa do paletó. Ele está se olhando num pequeno espelho que está na sua mão esquerda. A mão direita está levantada como se ele estivesse pra arrumar o cabelo. O fundo da foto é branco. No alto, à esquerda, a logo do Pistolando, preta. Do outro lado do Falcão, o número e o título do episódio, também em preto. No canto inferior direito, a logo da Estopim, preta.
This episode kicks off with a wonderful message Mirke the Meek, whose podcast has had me contemplating the format of my show. We then have a potted history of Traveller courtesy of Micheal “Chigowiz” Shorten of The Dungeon Master's Handbook and a first-time call for Lex Mandrake of Dank Dungeons (creator of AZAG and 5B). Lex's creations are available on both DriveThruRPG https://preview.drivethrurpg.com/en/publisher/24684/dank-dungeons?keyword=dank%20dungeons and itch https://dank-dungeons.itch.ioI then take a look at some more recent Gus L. goodness in the shape of The Curse of the Ganshogrr https://preview.drivethrurpg.com/en/product/455858/the-curse-of-the-ganshoggrHonourable mentions: Karl Rodriguez of the GMologist presents…, Marc Miller's Traveller Facsimile Edition and 5th Edition, Mongoose Traveller, GURPs Traveller, Samardan Press's Cepheus, Kevin Crawford's Stars Without Number, Battlestar Galactica, TSR Dungeons & Dragons, WotC, Kelsey Dionne's ShadowDark, Chaosium's Call of Cthulhu, Shadows of Yog-Sothoth, Mike Mason's Pulp Cthulhu, Questingbeast's Glatisant newsletter, Kill Jester's Errant. Music by Timothy J. Drennon"Warning" by Lieren of Updates From the Middle of NowhereLeave me an audio message via https://www.speakpipe.com/KeepOffTheBorderlandsYou can email me at spencer.freethrall@gmail.comSubscribe to my newsletter The StochasiumYou can find me in a bunch of other places here https://freethrall.carrd.co and in actual plays on Grizzly Peaks Radio https://podcasts.apple.com/us/podcast/grizzly-peaks-radio/id1530832637You can also find me on Discord by searching for FreeThrall/KeepOffTheBorderlands#7623 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit freethrall.substack.com
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparse Autoencoders Find Highly Interpretable Directions in Language Models, published by Logan Riggs Smith on September 21, 2023 on The AI Alignment Forum. This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability. Paper Overview Sparse Autoencoders & Superposition To reverse engineer a neural network, we'd like to first break it down into smaller units (features) that can be analysed in isolation. Using individual neurons as these units can be useful but neurons are often polysemantic, activating for several unrelated types of feature so just looking at neurons is insufficient. Also, for some types of network activations, like the residual stream of a transformer, there is little reason to expect features to align with the neuron basis so we don't even have a good place to start. Toy Models of Superposition investigates why polysemanticity might arise and hypothesise that it may result from models learning more distinct features than there are dimensions in the layer, taking advantage of the fact that features are sparse, each one only being active a small proportion of the time. This suggests that we may be able to recover the network's features by finding a set of directions in activation space such that each activation vector can be reconstructed from a sparse linear combinations of these directions. We attempt to reconstruct these hypothesised network features by training linear autoencoders on model activation vectors. We use a sparsity penalty on the embedding, and tied weights between the encoder and decoder, training the models on 10M to 50M activation vectors each. For more detail on the methods used, see the paper. Automatic Interpretation We use the same automatic interpretation technique that OpenAI used to interpret the neurons in GPT2 to analyse our features, as well as alternative methods of decomposition. This was demonstrated in a previous post but we now extend these results across the all 6 layers in Pythia-70M, showing a clear improvement over all baselines in all but the final layers. Case studies later in the paper suggest that the features are still meaningful in these later layers but that automatic interpretation struggles to perform well. IOI Feature Identification We are able to use less-than-rank one ablations to precisely edit activations to restore uncorrupted behaviour on the IOI task. With normal activation patching, patches occur at a module-wide level, while here we perform interventions of the form x'=¯x+∑i∈F(ci-¯ci)fi where ¯x is the embedding of the corrupted datapoint, F is the set of patched features, and ci and ¯ci are the activations of feature fi on the clean and corrupted datapoint respectively. We show that our features are able to better able to precisely reconstruct the data than other activation decomposition methods (like PCA), and moreover that the finegrainedness of our edits increases with dictionary sparsity. Unfortunately, as our autoencoders are not able to perfectly reconstruct the data, they have a positive minumum KL-divergence from the base model, while PCA does not. Dictionary Features are Highly Monosemantic & Causal (Left) Histogram of activations for a specific dictionary feature. The majority of activations are for apostrophe (in blue), where the y-axis the is number of datapoints that activate in that bin. (Right) Histogram of the drop in logits (ie how much the LLM predicts a spe...
Q & A call-in discussion with a survivor-professional, using an OPEN MIKE forum. We'll feature a survivor-professional co-host who'll field topics brought to the episode by you, the listener. ~~ Tonight the special co-host is Jodie Tedder, the President of IOI, Inc., The Island of Immunity,'" Jodie says. "It's a safe and nurturing community of healing for survivors of childhood sexual abuse and other forms of trauma." The words on her BLOG are certainly revealing. "It's not just all self-love like we keep hearing, 'Love yourself before anyone else!' I actually think it's the opposite: you can't really love yourself until you love other people." Jodie goes on, "When I lived at home with my father, I was favored. His favor included being backhanded when I showed the slightest sign of disobedience, rape, sodomy and playing the game Risk." She's outspoken, too. "We don't seem to talk of or hear much about adults who are complicit in sexual abuse (and other forms of abuse as well) – including parents – either one or both. In other forms of crime we have a term for this and we call it accomplice. This person can be charged for a crime." ~~ On these episodes we welcome various co-hosts, survivor-professionals who'll assist in fielding questions and lead a variety of topics suggested by our call-in participants. Their trauma-informed perspectives as survivor-professionals will help them guide discussions on the issues of child abuse, trauma and healthy human sexuality that spring from questions and topics brought to us by our listeners. ~~ Everyone's invited to engage on tonight's show. ~~ Please visit the NAASCA.org web site.
When the creator of the OASIS dies, he leaves an Easter Egg deep in his virtual world that to whoever finds it, will take complete control of the OASIS. For Wade Watts, aka Parzival, and his friends The High-Fives, it's a race to be the first to the egg as they battle the challenges of the OASIS plus the evil organization of IOI who is out to control it all for themselves. Steven Spielberg directs the adaption of Ernest Cline's pop culture phenomenon book, Ready Player One! We also go all in but spoiler free on our thoughts of Mission: Impossible - Dead Reckoning Part One in what we watched this week. We also talk the latest news and it's time for This Month In Pop Culture history! We also preview next week's film, Robocop 2! Visit us for all episodes & more at the www.therebelradiopodcast.com Please leave us a 5-Star review on iTunes! You can also find us on Spotify iHeartRadio Follow us on Facebook
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to Think About Activation Patching, published by Neel Nanda on June 4, 2023 on LessWrong. This is an excerpt from my post on attribution patching, that I think is of more general interest, around how to think about the technique of activation patching in mechanistic interpretability, and what it can and cannot teach us. You don't need to know what attribution patching is to read this, check out this section for an overview of activation patching What Is the Point of Patching? The conceptual framework I use when thinking about patching is to think of a model as an enormous mass of different circuits. On any given input, many circuits will be used. Further, while some circuits are likely very common and important (eg words after full stops start with capitals, or "this text is in English"), likely many are very rare and niche and will not matter on the vast majority of inputs (eg "this is token X in the Etsy terms and conditions footer" - a real thing that GPT-2 Small has neurons for!). For example, on IOI, beyond the actual IOI circuit, there is likely circuits to: Detect multi-token words (eg names) and assemble them into a single concept Detect that this is English text Detect that this is natural language text Encode unigram frequencies (under which John is 2-3x as likely as Mary!) Copying all names present in the context - if a name is present in the text it's way more likely to occur next than an unseen name! That this is the second clause of a sentence That Indirect Object Identification is the right grammatical structure to use That the next word should be a name As a mech interp researcher, this is really annoying! I can get traction on a circuit in isolation, and there's a range of tools with ablations, direct logit attribution, etc to unpick what's going on. And hopefully any given circuit will be clean and sparse, such that I can ignore most of a model's parameters and most activation dimensions, and focus on what's actually going on. But when any given input triggers many different circuits, it's really hard to know what's happening. The core point of patching is to solve this. In IOI, most of the circuits will fire the same way regardless of which name is the indirect object/repeated subject. So by formulating a clean and corrupted input that are as close as possible except for the key detail of this name, we can control for as many of the shared circuits as possible. Then, by patching in activations from one run to another, we will not affect the many shared circuits, but will let us isolate out the circuit we care about. Taking the logit difference (ie difference between the log prob of the correct and incorrect answer) also helps achieve this, by controlling for the circuits that decide whether to output a name at all. Importantly, patching should be robust to some of our conceptual frameworks being wrong, eg a model not having a linear representation, circuits mattering according to paths that go through every single layer, etc. Though it's much less informative when a circuit is diffuse across many heads and neurons than when it's sparse. Whenever reasoning about patching, it's valuable to keep this in mind! And to be clear to yourself about what exactly is the scope of your investigation - which circuits do you want to identify, how well have you controlled for the irrelevant circuits, what aspects of the circuits you do care about have you accidentally controlled for, etc. And even if you want to identify several circuits, it's generally best to try to have several types of clean and corrupted inputs, so that you can isolate out each circuit one at a time (eg in the Python docstring circuit). Patching Clean to Corrupted vs Corrupted to Clean One surprisingly subtle detail is whether you patch a clean activation into the corrupted run, or vice v...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to Think About Activation Patching, published by Neel Nanda on June 4, 2023 on The AI Alignment Forum. This is an excerpt from my post on attribution patching, that I think is of more general interest, around how to think about the technique of activation patching in mechanistic interpretability, and what it can and cannot teach us. You don't need to know what attribution patching is to read this, check out this section for an overview of activation patching What Is the Point of Patching? The conceptual framework I use when thinking about patching is to think of a model as an enormous mass of different circuits. On any given input, many circuits will be used. Further, while some circuits are likely very common and important (eg words after full stops start with capitals, or "this text is in English"), likely many are very rare and niche and will not matter on the vast majority of inputs (eg "this is token X in the Etsy terms and conditions footer" - a real thing that GPT-2 Small has neurons for!). For example, on IOI, beyond the actual IOI circuit, there is likely circuits to: Detect multi-token words (eg names) and assemble them into a single concept Detect that this is English text Detect that this is natural language text Encode unigram frequencies (under which John is 2-3x as likely as Mary!) Copying all names present in the context - if a name is present in the text it's way more likely to occur next than an unseen name! That this is the second clause of a sentence That Indirect Object Identification is the right grammatical structure to use That the next word should be a name As a mech interp researcher, this is really annoying! I can get traction on a circuit in isolation, and there's a range of tools with ablations, direct logit attribution, etc to unpick what's going on. And hopefully any given circuit will be clean and sparse, such that I can ignore most of a model's parameters and most activation dimensions, and focus on what's actually going on. But when any given input triggers many different circuits, it's really hard to know what's happening. The core point of patching is to solve this. In IOI, most of the circuits will fire the same way regardless of which name is the indirect object/repeated subject. So by formulating a clean and corrupted input that are as close as possible except for the key detail of this name, we can control for as many of the shared circuits as possible. Then, by patching in activations from one run to another, we will not affect the many shared circuits, but will let us isolate out the circuit we care about. Taking the logit difference (ie difference between the log prob of the correct and incorrect answer) also helps achieve this, by controlling for the circuits that decide whether to output a name at all. Importantly, patching should be robust to some of our conceptual frameworks being wrong, eg a model not having a linear representation, circuits mattering according to paths that go through every single layer, etc. Though it's much less informative when a circuit is diffuse across many heads and neurons than when it's sparse. Whenever reasoning about patching, it's valuable to keep this in mind! And to be clear to yourself about what exactly is the scope of your investigation - which circuits do you want to identify, how well have you controlled for the irrelevant circuits, what aspects of the circuits you do care about have you accidentally controlled for, etc. And even if you want to identify several circuits, it's generally best to try to have several types of clean and corrupted inputs, so that you can isolate out each circuit one at a time (eg in the Python docstring circuit). Patching Clean to Corrupted vs Corrupted to Clean One surprisingly subtle detail is whether you patch a clean activation into the corrupted r...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Input Swap Graphs: Discovering the role of neural network components at scale, published by Alexandre Variengien on May 12, 2023 on LessWrong. This post was written as part of the work done at Conjecture. You can try input swap graphs in a collab notebook or explore the library to replicate the results. Thanks to Beren Millidge and Eric Winsor for useful discussions throughout this project. Thanks to Beren for feedback on a draft of this post. Activation and path patching are techniques employed to manipulate neural network internals. One approach, used in ROME, which we'll refer to as corrupted patching, involves setting the output of certain parts of the network to a corrupted value. Another method used in the work on indirect object identification, referred to as input swap patching, involves assigning these parts the values they would produce on a different input sample, while the rest of the model operates normally. These experimental techniques have proven effective in gaining structural insights into the information flow within neural networks, providing a better understanding of which components are involved in a given behavior. These structural analyses can be straightforwardly automated using algorithms like Automatic Circuit DisCovery. Patching experiments also enable the discovery of the semantics of components, i.e. the abstract variables a component encodes. They have been used to discover the "a vs an" neuron in gpt2, the token and position signal in S-inhibition heads in the IOI circuit, or Othello board representations. In all these cases, researchers were not only able to say 'this component matters', but also 'it roughly plays this role'. However, these additional results come at a cost: they often required a lot of manual effort to design alternative datasets and come up with an intuition for the likely role of the component in advance. In this post, I introduce input swap graphs (swap graphs in short), a new tool to systematically analyze the results of input swap patching experiments and recover semantic information with as little human effort as possible. In the following, I present A concise overview of swap graphs, their motivation, and practical application to Name Movers within the IOI circuit. A simple theoretical framework for examining the problem of localized computation in computational graphs. This framework serves as a sandbox for rapidly building intuition to create or interpret experiments, such as swap graphs, which can be applied to real-world neural networks. An expansion of the IOI study in GPT-2 small to models containing several billion parameters (GPT-2 XL, Pythia-2.8B, GPT-Neo-2.7B) using swap graphs. We validated our findings through causal scrubbing and targeted interventions that steer the mechanism out-of-distribution predictably. This both demonstrates the potential for applying swap graphs to larger models and presents valuable data to investigate how a model's internal structure changes with scale while performing the same task. A five-minute overview of swap graphs Motivation Deciphering the intermediate activations of neural network components can be a daunting task. The prevalent approach involves breaking down these high-dimensional vectors into subspaces where variations correspond to concepts that are both relevant to the model's computation and comprehensible to humans. In this work, we adopt an alternative strategy: we limit ourselves to using only input swap patching in order to uncover the roles of neural network components without directly exploiting the structure of their activations. While it is common to combine information about directions in activation space and input swap patching experiments in interpretability research (e.g., as demonstrated in Neel Nanda's work on Othello), our objective here is t...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Input Swap Graphs: Discovering the role of neural network components at scale, published by Alexandre Variengien on May 12, 2023 on The AI Alignment Forum. This post was written as part of the work done at Conjecture. You can try input swap graphs in a collab notebook or explore the library to replicate the results. Thanks to Beren Millidge and Eric Winsor for useful discussions throughout this project. Thanks to Beren for feedback on a draft of this post. Activation and path patching are techniques employed to manipulate neural network internals. One approach, used in ROME, which we'll refer to as corrupted patching, involves setting the output of certain parts of the network to a corrupted value. Another method used in the work on indirect object identification, referred to as input swap patching, involves assigning these parts the values they would produce on a different input sample, while the rest of the model operates normally. These experimental techniques have proven effective in gaining structural insights into the information flow within neural networks, providing a better understanding of which components are involved in a given behavior. These structural analyses can be straightforwardly automated using algorithms like Automatic Circuit DisCovery. Patching experiments also enable the discovery of the semantics of components, i.e. the abstract variables a component encodes. They have been used to discover the "a vs an" neuron in gpt2, the token and position signal in S-inhibition heads in the IOI circuit, or Othello board representations. In all these cases, researchers were not only able to say 'this component matters', but also 'it roughly plays this role'. However, these additional results come at a cost: they often required a lot of manual effort to design alternative datasets and come up with an intuition for the likely role of the component in advance. In this post, I introduce input swap graphs (swap graphs in short), a new tool to systematically analyze the results of input swap patching experiments and recover semantic information with as little human effort as possible. In the following, I present A concise overview of swap graphs, their motivation, and practical application to Name Movers within the IOI circuit. A simple theoretical framework for examining the problem of localized computation in computational graphs. This framework serves as a sandbox for rapidly building intuition to create or interpret experiments, such as swap graphs, which can be applied to real-world neural networks. An expansion of the IOI study in GPT-2 small to models containing several billion parameters (GPT-2 XL, Pythia-2.8B, GPT-Neo-2.7B) using swap graphs. We validated our findings through causal scrubbing and targeted interventions that steer the mechanism out-of-distribution predictably. This both demonstrates the potential for applying swap graphs to larger models and presents valuable data to investigate how a model's internal structure changes with scale while performing the same task. A five-minute overview of swap graphs Motivation Deciphering the intermediate activations of neural network components can be a daunting task. The prevalent approach involves breaking down these high-dimensional vectors into subspaces where variations correspond to concepts that are both relevant to the model's computation and comprehensible to humans. In this work, we adopt an alternative strategy: we limit ourselves to using only input swap patching in order to uncover the roles of neural network components without directly exploiting the structure of their activations. While it is common to combine information about directions in activation space and input swap patching experiments in interpretability research (e.g., as demonstrated in Neel Nanda's work on Othello), our object...
Cette semaine : gros pack Humble Bundle pour les victimes du tremblement de terre en Turquie/Syrie, annonce Pax Dei 9, Project Fantasy chez IOI, annonce Elden Ring DLC, Sifu sur Steam et Xbox le 28 mars, Notepad découvre les tabs, Orbital - Optical Delusion, Global Spyware Scandal, ey AMD Ryzen 9 7950X3D testé. Lisez plutôt Torréfaction #248 : Annonces Pax Dei, Project Fantasy chez IOI, Elden Ring DLC et les tests du AMD Ryzen 9 7950X3D avec sa vraie mise en page sur Geekzone. Pensez à vos rétines.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A circuit for Python docstrings in a 4-layer attention-only transformer, published by StefanHex on February 20, 2023 on LessWrong. Produced as part of the SERI ML Alignment Theory Scholars Program under the supervision of Neel Nanda - Winter 2022 Cohort. TL;DR: We found a circuit in a pre-trained 4-layer attention-only transformer language model. The circuit predicts repeated argument names in docstrings of Python functions, and it features 3 levels of composition, a multi-function head that does different things in different parts of the prompt, an attention head that derives positional information using the causal attention mask. Epistemic Status: We believe that we have identified most of the core mechanics and information flow of this circuit. However our circuit only recovers up to half of the model performance, and there are a bunch of leads we didn't follow yet. Introduction Click here to skip to the results & explanation of this circuit. What are circuits What do we mean by circuits? A circuit in a neural network, is a small subset of model components and model weights that (a) accounts for a large fraction of a certain behavior and (b) corresponds to a human-interpretable algorithm. A focus of the field of mechanistic interpretability is finding and better understanding the phenomena of circuits, and recently the field has focused on circuits in transformer language models. Anthropic found the small and ubiquitous Induction Head circuit in various models, and a team at Redwood found the Indirect Object Identification (IOI) circuit in GPT2-small. How we chose the candidate task We looked for interesting behaviors in a small, attention-only transformer with 4 layers, from Neel Nanda's open source toy language models. It was trained on natural language and Python code. We scanned the code dataset for examples where the 4-layer model did much better than a similar 3 layer one, inspired by Neel's open problems list. Interestingly, despite the circuit seemingly requiring just 3 levels of composition, only the 4-layer model could do the task. The docstring task The clearest example we found was in Python docstrings, where it is possible to predict argument names in the docstring: In this randomly generated example, a function has the (randomly generated) arguments load, size, files, and last. The docstring convention here demands each line starting with :param followed by an argument name, and this is very predictable. Turns out that attn-only-4l is capable of this task, predicting the next token (files in the example shown here) correctly in ~75% of cases. Methods: Investigating the circuit Possible docstring algorithms There are multiple algorithms which could solve this task, such as "Docstring Induction": Always predict the argument that, in the definition, follows the argument seen in the previous docstring line. I.e. look for param size, check the order in the definition size, files, and predict files accordingly. Line number based: In the Nth line predict the Nth variable from the definition, irrespective of the content of the other lines. I.e after the 3rd param token, predict the 3rd variable files. Inhibition based: Predict variable names from the definition, but inhibit variable names which occurred twice (similar to the inhibition in the IOI circuit), i.e. predict load, size, files, last, and inhibit the former two. Add some preference for earlier tokens to prefer files over last. We are quite certain that at least the first two algorithms are implemented to some degree. This is surprising, since one of the two should be sufficient to perform the task; we do not investigate further why this is the case. A brief investigation showed that the implementation of the 2nd algorithm seems less robust and less generalizable that our model's implementation of th...
Pre-IPO Stock Market Update - Jan 27, 202300:35 | Stripe raising $3.0b at a $50b to $60b valuation to meet employee liquidity needs 03:34 | Brex's startup report; expense management trends, ad spend trends, startup geographic location trends05:29 | Forge Q4 2022 trends analysis; pre-IPO stocks that are $10b+ valuation paired with under 10 yr age is sweet spot for investors, IOI and bid/ask dynamics07:52 | Large capital raises for the week; QuickNode, Reaction Engines, Edgard & Cooper, Monument, Crowdbotics08:47 | Pre-IPO stock market performance for the week; +1.30% for pre-IPO stocks vs +2.50% for the S&P 500 and +2.69% for S&P Growth11:51 | Databricks spotlight; valuation, company description, revenue, investors, revenue model overviewAG Dillon & Co venture capital funds...- AG Dillon SpaceX Pre-IPO Stock Fund = https://agdillon.com/spacex/- AG Dillon Pre-IPO Equity Fund (top 15 pre-IPO stocks) = www.agdillon.com/agdillon_vcfundFinancial advisors and institutions may register for the live weekly webinar here www.agdillon.com/webinar.Subscribe...Youtube = https://www.youtube.com/channel/UCSpr_9yjBA7dhqnQexSu7LAApple Podcasts = https://podcasts.apple.com/us/podcast/this-week-in-pre-ipo-stocks/id1653598601Spotify Podcasts = https://open.spotify.com/show/2ryF1V6y712AsizaRjImOH
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!, published by StefanHex on January 24, 2023 on LessWrong. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. What if I told you that in just one weekend you can get up to speed doing practical Mechanistic Interpretability research on Transformers? Surprised? Then this is your tutorial! I'll give you a view to how I research Transformer circuits in practice, show you the tools you need, and explain my thought process along the way. I focus on the practical side to get started with interventions; for more background see point 2 below. Prerequisites: Understanding the Transformer architecture: Know what the residual stream is, how attention layers and MLPs work, and how logits & predictions work. For future sections familiarity with multi-head attention is useful. Here's a link to Neel's glossary which provides excellent explanations for most terms I might use!If you're not familiar with Transformers you can check out Step 2 (6) on Neel's guide or any of the other explanations online, I recommend Jay Alammar's The Illustrated Transformer and/or Milan Straka's lecture series. Some overview of Mechanistic Interpretability is helpful: See e.g. any of Neel's talks, or look at the results in the IOI paper / walkthrough. Basic Python: Familiarity with arrays (as in NumPy or PyTorch, for indices) is useful; but explicitly no PyTorch knowledge required! No hardware required, free Google Colab account works fine for this. Here's a notebook with all the code from this tutorial! PS: Here's a little web page where you can run some of these methods online! No trivial inconveniences! Step 0: Setup Open a notebook (e.g. Colab) and install Neel Nanda's TransformerLens (formerly known as EasyTransformer). Step 1: Getting a model to play with That's it, now you've got a GPT2 model to play with! TransformerLens supports most relevant open source transformers. Here's how to run the language model Let's have a look at the internal activations: TransformerLens can give you a dictionary with almost all internal activations you ever care about (referred to as “cache”): Here you will find things like the attention pattern blocks.0.attn.hook_pattern, the residual stream before and after each layer blocks.1.hook_resid_pre, and more! You can also access all the weights & parameters of the model in model.named_parameters(). Here you will find weight matrices & biases of every MLP and Attention layer, as well as the embedding & unembedding matrices. I won't focus on these in this guide but they're great to look at! (Exercise: What can the unembedding biases unembed.b_U tell you about common tokens?) Step 2: Let's start analyzing a behavior! Let's go and find some induction heads! I'll make up an example: Her name was Alex Hart. When Alex, with likely completion Hart. TransformerLens has a little tool to plot a tokenized prompt, model predictions, and associated logits: I find it is useful to spend a few minutes thinking about which information is needed to solve the task: The model needs to Realize the last token, Alex, is a repetition of a previous occurrence The model needs to copy the last name from after the previous Alex occurrence to the last token as prediction Method 1: Residual stream patching The number 1 thing I try when I want to reverse engineer a new behavior is to find where in the network the information is “traveling”. In transformers, the model keeps track of all information in the residual stream. Attention heads & MLPs read from the residual stream, perform some computation or information moving, and write their outputs back into the residual stream. I think of this stream as having a couple of “lanes” corresponding to each token position. Over the course of the model...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I have thousands of copies of HPMOR in Russian. How to use them with the most impact?, published by Samin on December 27, 2022 on The Effective Altruism Forum. As a result of a crowdfunding campaign a couple of years ago, I printed 21k copies of HPMOR. 11k of those went to the crowdfunding participants. I need ideas for how to use the ones that are left with the most impact. Over the last few weeks, I've sent 400 copies to winners of IMO, IOI, and other international and Russian olympiads in math, computer science, biology, etc. (and also bought and sent copies of Human Compatible in Russian to about 250 of them who requested it as well). Over the next month, we'll get the media to post about the opportunity for winners of certain olympiads to get free books. I estimate that we can get maybe twice as many (800-1000) impressive students to fill out a form for getting HPMOR and get maybe up to 30k people who follow the same media to read HPMOR online if everything goes well (uncertain estimates, from the previous experience). The theory of change behind sending the books to winners of olympiads is that people with high potential read HPMOR and share it with friends, get some EA-adjacent mindset and values from it, and then get introduced to EA in emails about 80k Hours (which is being translated into Russian) and other EA and cause-specific content, and start participating in the EA community and get more into specific cause areas that interest them. The anecdotal evidence is that most of the Russian EAs, many of whom now work full-time at EA orgs or as independent researchers, got into EA after reading HPMOR and then the LW sequences. It's mostly impossible to give the books for donations to EA charities because Russians can't transfer money to EA charities due to sanctions. Delivering the copies costs around $5-10 in Russia and $40-100 outside Russia. There are some other ideas, but nothing that lets us spend thousands of books effectively, so we need more. So, any ideas on how to spend 8-9k of HPMOR copies in Russian? HPMOR has endorsements from a bunch of Russians important to different audiences. That includes some of the most famous science communicators, computer science professors, literature critics, a person training the Moscow team for math competitions, etc.; currently, there are news media and popular science resources with millions of followers managed by people who've read HPMOR or heard of it in a positive way and who I can talk to. After the war started, some of them were banned, but that doesn't affect the millions of followers they have on, e.g., Telegram. Initially, we planned first to translate and print 80k and then give HPMOR to people together with copies of 80k, but there's now time pressure due to uncertainty over the future legal status of HPMOR in Russia. Recently, a law came into effect that prohibited all sorts of mentions of LGBTQ in books, and it's not obvious whether HPMOR is at risk, so it's better to spend a lot of books now. E.g., there's the Yandex School of Data Analysis, one of the top places for students in Russia to get into machine learning; we hope to be able to get them to give hundreds of copies of HPMOR to students who've completed a machine learning course and give us their emails. That might result in more people familiar with the alignment problem in positions where they can help prevent an AI arms race started by Russia. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
It's time to head to the Oasis and tackle Ready Player One with our great friend, Daryl Ewry! We get in the DeLorean and head back a few years to visit Spielberg's sci-fi gem and see how it holds up! But before we get started we each break down our top 5 movies that have a riddle/test/challenge in them. Find out which we picked! Then, we get our X1 Boot Suits on and respawn in the Oasis! So grab your visors, watch out for Kong, and don't forget the microfiber crotch inlays... it's time for Ready Player One on The Movie Defenders podcast! Click here to listen and connect anywhere: https://linktr.ee/moviedefenders 00:00:00 Intro and Announcements 00:20:17 Top 5 Movies w/ Riddles/Tests/Challenges 00:54:01 Ready Player One Discussion Starts 01:39:34 Meet Art3mis 01:59:59 IOI, I-R0k, and the Orb 02:16:51 The Distracted Globe 02:33:38 Sorrento Makes an Offer 02:47:20 Welcome to the Rebellion 03:02:15 IOI Attacks 03:16:20 We're Not Gonna Take It 03:41:58 The Cataclyst 03:50:22 Wade Wins the Contest 03:57:21 Morrow, Lawyers, and the End Special thanks to all of our amazing Patreon supporters! Brett Bowen Barrett Young Brev Tanner Eric Blattberg Kevin Athey Marcin Miduch Jason Chastain Joshua Loy Josh Evans Mark Martin Megan Bush Richard Cree Alexis Borchardt Mark Nattress Bart German Alex Kirby Michael Puckett Ena Haynes Daryl Ewry Sean Masters Randal Silver Attack of the Killer Podcast
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien), published by Neel Nanda on November 7, 2022 on The AI Alignment Forum. New paper walkthrough: Interpretability in the Wild: A Circuit for Indirect Object Identification In GPT-2 Small is a really exciting new mechanistic interpretability paper from Redwood Research. They reverse engineer a 26(!) head circuit in GPT-2 Small, used to solve Indirect Object Identification: the task of understanding that the sentence "After John and Mary went to the shops, John gave a bottle of milk to" should end in Mary, not John. I think this is a really cool paper that illustrates how to rigorously reverse engineer real models, and is maybe the third example of a well understood circuit in a real model. So I wanted to understand it better by making a walkthrough. (and hopefully help other people understand it too!) In this walkthrough I'm joined by three of the authors, Kevin Wang, Arthur Conmy and Alexandre Variengien. In Part 1 we give an overview of the high-level themes in the paper and what we think is most interesting about it. If you're willing to watch an hour of this and want more, in Part 2 we do a deep dive into the technical details of the paper and read through it together, and dig into the details of how the circuit works and the techniques used to discover this. If you're interested in contributing to this kind of work, apply to Redwood's REMIX program! Deadline Nov 13th We had some technical issues in filming this, and my video doesn't appear - sorry about that! I also tried my hand at some amateur video editing to trim it down - let me know if this was worthwhile, or if it's bad enough that I shouldn't have bothered lol. If you find this useful, you can check out my first walkthrough on A Mathematical Framework for Transformer Circuits. And let me know if there are more interpretability papers you want walkthroughs on! If you want to try exploring this kind of work for yourself, there's a lot of low-hanging fruit left to pluck! Check out my EasyTransformer library and the accompanying codebase to this paper! A good starting point is their notebook giving an overview of the key experiments or my notebook modelling what initial exploration on any task like this could look like. Some brainstormed future directions: 3 letter acronyms (or more!) Converting names to emails. An extension task is e.g. constructing an email from a snippet like the following: Name: Neel Nanda; Email: last name dot first name k @ gmail Grammatical rules Learning that words after full stops are capital letters Verb conjugation Choosing the right pronouns (e.g. he vs she vs it vs they) Whether something is a proper noun or not Detecting sentiment (eg predicting whether something will be described as good vs bad) Interpreting memorisation. E.g., there are times when GPT-2 knows surprising facts like people's contact information. How does that happen? Counting objects described in text. E.g.: I picked up an apple, a pear, and an orange. I was holding three fruits. Extensions from Alex Variengien Understanding what's happening in the adversarial examples: most notably S-Inhibition Head attention pattern (hard). (S-Inhibition heads are mentioned in the IOI paper) Understanding how are positional signal encoded (relative distance, something else?) bonus point if we have a story that include the positional embeddings and that explain how the difference between position is computed (if relative is the right framework) by Duplicate Token Heads / Induction Heads. (hard, but less context dependant) What are the role of MLPs in IOI (quite broad and hard) What is the role of Duplicate Token Heads outside IOI? Are they used in other Q-compositions with S-Inhibition Heads? Can we describe how their QK cir...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small, published by KevinRoWang on October 28, 2022 on The AI Alignment Forum. To learn more about this work, check out the paper. We assume general familiarity with transformer circuits. Intro: There isn't much interpretability work that explains end-to-end how a model is able to do some task (except for toy models). In this work, we make progress towards this goal by understanding some of the structure of GPT-2 small “in the wild” by studying how it computes a simple natural language task. The task we investigate is what we call indirect object identification (IOI), where sentences like “When John and Mary went to the store, John gave a drink to” should be completed with “Mary.” We discovered the structure of a circuit of 26 attention heads grouped into 7 main classes, the largest end-to-end attempt to reverse engineer how a LM computes a natural behavior (to our knowledge). There is still much missing from our explanation, however, and our explanation doesn't go to the parameter level. Besides discovering the particular circuit shown above, we gained some interesting insights about low-level phenomena arising inside language models. For example, we found attention heads communicating with pointers (sharing the location of a piece of information instead of copying it). We also identified heads compensating for the loss of function of other heads, and heads contributing negatively to the correct next-token prediction. We're excited to see if these discoveries generalize beyond our case study. Since explanations of model behavior can be confused or non-rigorous, we used our knowledge to design adversarial examples. Moreover, we formulate 3 quantitative criteria to test the validity of our circuit. These criteria partially validate our circuit but indicate that there are still gaps in our understanding. This post is a companion post to our paper where we share lessons that we learned from doing this work and describe some of Redwood's interpretability perspectives. We share high-level takeaways and give specific examples from the work to illustrate them. Goals of Interpretability for this Investigation In this kind of mechanistic interpretability work, we tend to use the circuits abstraction. If we think of a model as a computational graph where nodes are terms in its forward pass (neurons, attention heads, etc) and edges are the interactions between those terms (residual connections, attention, etc), a circuit is a subgraph of the model responsible for some behavior. Note that our work is slightly different from Chris Olah's/Anthropic's idea of a circuit in that we investigate this circuit on a specific distribution (instead of the entire distribution of text) and we also don't attain an understanding of the circuit at the parameter-level. Structure: Having all the nodes and edges One kind of valuable interpretability insight is to ensure that we have the correct subgraph, which is one of the main goals in this work. We formulate three main quantitative criteria to measure progress towards this goal. These criteria rely on the idea of "knocking out" or turning off a node from the computational graph by replacing its activation by its mean on a distribution where the IOI task is not present. (Note that we now believe that our causal scrubbing algorithm provides a more robust way to validate circuits.) Faithfulness: Intuition: The circuit performs the task as well as the model. Computation: A model with everything but the circuit knocked out should achieve the same performance according to some task benchmark. Completeness: Intuition: The circuit contains all the nodes that are important for the task. (Note that completeness implies faithfulness). Computation: For every subset of nodes in the ci...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley, published by maxnadeau on October 27, 2022 on The AI Alignment Forum. This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We're excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel. REMIX participants will work to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology (forthcoming within a week) to formalize and evaluate interpretability hypotheses. We hope to produce many more explanations of model behaviors akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks (also forthcoming). We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems (more in Goals and FAQ). Apply here by November 8th to be a researcher in the program. Apply sooner if you'd like to start early (details below) or receive an earlier response. Some key details: We expect to accept 30-50 participants. We plan to have some researchers arrive early, with some people starting as soon as possible. The majority of researchers will likely participate during the months of December and/or January. We expect researchers to participate for a month minimum, and (all else equal) will prefer applicants who are able to come for longer. We'll pay for housing and travel, and also pay researchers for their time. We'll clarify the payment structure prior to asking people to commit to the program. We're interested in some participants acting as team leaders who would help on-board and provide research advice to other participants. This would involve arriving early to get experience with our tools and research directions and participating for a longer period (~2 months). You can indicate interest in this role in the application. We're excited about applicants with a range of backgrounds; we're not expecting applicants to have prior experience in interpretability research. Applicants should be comfortable working with Python, PyTorch/TensorFlow/Numpy (we'll be using PyTorch), and linear algebra. We're particularly excited about applicants with experience doing empirical science in any field. We'll allocate the first week to practice using our interpretability tools and methodology; the rest will be researching in small groups. See Schedule. Feel free to email programs@rdwrs.com with questions. Goals Research output. We hope this program will produce research that is useful in multiple ways: We'd like stronger and more grounded characterizations of how language models perform a certain class of behaviors. For example, we currently have a variety of findings about how GPT-2-small implements indirect object identification (“IOI”, see next section for more explanation), but aren't yet sure how often they apply to other models or other tasks. We'd know a lot more if we had a larger quantity of this research. For each behavior investigated, we think there's some chance of stumbling across something really interesting. Examples of this include induction heads and the “pointer manipulation” result in the IOI paper: not only does the model copy information between attention streams, but it also copies “pointers”, i.e. the position of the residual stream that contains the relevant information. We're interested in learning whether different language models implement the same behaviors in similar ways. We'd like a better sense of how good the current library of interpretability techniques is, and we'd like to get ideas for new techniques. We'd like to have more exam...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley, published by Max Nadeau on October 27, 2022 on The Effective Altruism Forum. This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We're excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel. REMIX participants will work to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology (forthcoming within a week) to formalize and evaluate interpretability hypotheses. We hope to produce many more explanations of model behaviors akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks (also forthcoming). We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems (more in Goals and FAQ). Apply here by November 8th to be a researcher in the program. Apply sooner if you'd like to start early (details below) or receive an earlier response. Some key details: We expect to accept 30-50 participants. We plan to have some researchers arrive early, with some people starting as soon as possible. The majority of researchers will likely participate during the months of December and/or January. We expect researchers to participate for a month minimum, and (all else equal) will prefer applicants who are able to come for longer. We'll pay for housing and travel, and also pay researchers for their time. We'll clarify the payment structure prior to asking people to commit to the program. We're interested in some participants acting as team leaders who would help on-board and provide research advice to other participants. This would involve arriving early to get experience with our tools and research directions and participating for a longer period (~2 months). You can indicate interest in this role in the application. We're excited about applicants with a range of backgrounds; we're not expecting applicants to have prior experience in interpretability research. Applicants should be comfortable working with Python, PyTorch/TensorFlow/Numpy (we'll be using PyTorch), and linear algebra. We're particularly excited about applicants with experience doing empirical science in any field. We'll allocate the first week to practice using our interpretability tools and methodology; the rest will be researching in small groups. See Schedule. Feel free to email programs@rdwrs.com with questions. Goals Research output. We hope this program will produce research that is useful in multiple ways: We'd like stronger and more grounded characterizations of how language models perform a certain class of behaviors. For example, we currently have a variety of findings about how GPT-2-small implements indirect object identification (“IOI”, see next section for more explanation), but aren't yet sure how often they apply to other models or other tasks. We'd know a lot more if we had a larger quantity of this research. For each behavior investigated, we think there's some chance of stumbling across something really interesting. Examples of this include induction heads and the “pointer manipulation” result in the IOI paper: not only does the model copy information between attention streams, but it also copies “pointers”, i.e. the position of the residual stream that contains the relevant information. We're interested in learning whether different language models implement the same behaviors in similar ways. We'd like a better sense of how good the current library of interpretability techniques is, and we'd like to get ideas for new techniques. We'd like to have mo...
Grab your VR goggles and fancy gloves because this week we are heading to the OASIS! It's finally time to fly into the future as we get our gamer on in covering Spielberg's Ready Player One. It's up to Parzival, Artemis and friends to unlock the hidden Easter Egg in the virtual world of the OASIS before the dastardly Sorrento and his IOI goons can get their hands on it. In a world all about an alternate reality, a famous creator has left a clue ridden path to three virtual keys to his own virtual kingdom. And man this movie really felt like it came from another world! It delivered a blast of a story, some really fun characters and a virtual environment beyond anyone's wildest imagination. While it may have ruffled some feathers of the fans of the beloved book its based on, it boasts some fascinating CGI, a banging soundtrack and so many references it'll make even the most hardcore gamers blush. Can Parzival and the High Five obtain the keys, stop Sorrento and tell these bad reviewers its game over? Tune in to find out! If you enjoyed the episode, please rate and share! It goes such a long way and I am so grateful for all of the support. And hey, if you do leave a review of the show, you will be featured as well! How neat is that? Also, be sure to give my official Ko-Fi page a gander. There you can get access to all kinds of cool perks and prizes from your very own shoutout on the show to choosing which movie I cover next. Link down below! You can also find me filling the world with more #BRBR insanity on Insta @billreadsbadreviews and Twitter @BRBR_Bill. Stay awesome. Support BRBR on Ko-Fi: https://ko-fi.com/billreadsbadreviews
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Help out Redwood Research's interpretability team by finding heuristics implemented by GPT-2 small, published by Haoxing Du on October 12, 2022 on LessWrong. Some of Redwood's current research involves finding specific behaviors that language models exhibit, and then doing interpretability to explain how the model does these behaviors. One example of this is the indirect object identification (IOI) behavior, investigated in a forthcoming paper of ours: given the input When John and Mary went to the store, Mary gave a flower to, the model completes John instead of Mary. Another example is the acronym generation task: given the input In a statement released by the Big Government Agency (, the model completes BGA). We are considering scaling up this line of research a bunch, and that means we need a lot more behaviors to investigate! The ideal tasks that we are looking for have the following properties: The task arises in a subset of the training distribution. Both the IOI and the acronym tasks are abundant in the training corpus of language models. This means that we are less interested in tasks specific to inputs that never appear in the training distribution. The ideal task can even be expressed as a regular expression that can be run on the training corpus to obtain the exact subset, as is the case for acronyms. The IOI task is less ideal in this sense, since it is harder to identify the exact subset of the training distribution that involves IOI. There is a simple heuristic for the task. For IOI, the heuristic is “fill in the name that appeared only once so far”. For acronyms, the heuristic is “string together the first letter of each capitalized word, and then close the parentheses”. Note that the heuristic does not have to apply to every instance of the task in the training distribution, e.g. sometimes an acronym is not formed by simply taking the first letter of each word. The gold standard here is if the heuristic can be implemented in an automated way, e.g. as a Python function, but we would also consider the task if a human is needed to supply the labels. GPT-2 small implements this heuristic. We are focusing on the smallest model in the GPT-2 family right now, which is a 117M parameter model that is frankly not that good at most things. We are moderately interested in tasks that bigger models can do that GPT-2 small can't, but the bar is at GPT-2 small right now. Examples The following is a list of tasks that we have found so far/are aware of. Induction and acronym generation remain the tasks that best meet all of the above desiderata. Induction: If the model has seen . A B . in the input, then it is more likely to output B immediately after A when encountering A again. A common use case of induction is remembering complicated names: T|cha|ik|ovsky . T| -> cha|ik|ovsky Acronym generation IOI Email address generation: Hi! My name is John Smith, and I work at Apple. My email address is -> johnsmith@apple.com Subject-verb agreement: The keys to the cabinet -> are and not is, even though the singular “cabinet” is also present in the prompt. (Credit to Nate Thomas) Incrementing citation numbers: Most metals are excellent conductors [1]. However, this is not always the case [ -> 2] (Credit to Lucas Sato) Variations of induction and IOI Extended IOI: identify the right person out of a group with more than two names Inferring someone's last name from name of relative dog_in_house.jpg”> (Credit to Rob Miles) Some examples that we are less excited about include: Closing doc strings, parentheses, etc Close variations of known examples We would love for interested people to contribute ideas! Below are some resources we put together to make the search as easy as possible: You can play with GPT-2 small here in a web app that we built specifically for investigating model b...
On this episode of The AIE Podcast... When New Content and the Summer of Love Collide We have a Flashpoint Newsflash coming in fast AIE in ESO goes to 11! Guild Wars 2 gets a little more balanced! And, we are talking amongst ourselves All that and more coming up right now... Podcast Audio Raw Video http://youtu.be/ZVlZpbHIxfY Open Welcome to episode #388 of the podcast celebrating you, the Alea Iacta Est gaming community, the die has been podcast. This is Tetsemi: To my left is Mewkow: - (catch phrase here). And to my right is Mkallah: (Hey folks, there are Fig scones in the guild kitchen- and I don't mean the cat). This week we are having a host show! Welcome US!! Ok, we'll be digging into what we've been up to shortly, but first, let's cover this week's news... AIE News Community Mandatory Fun Nights Where the fun is mandatory but the attendance is not. Sunday - WoW Classic 2 pm Eastern Sunday - STO 8:30 pm Eastern Monday - GW2 9:30 pm Eastern Tuesday - SWTOR 9 pm Eastern Tuesday - Wednesday - Thursday - FFXIV (Sprout Raid) 10:00 pm Eastern Friday - ESO 9 pm Eastern Friday - FFXIV (Mount Farming) Various times, usually starting at 8pm EST or later Saturday - LotRO 8:30 pm Eastern Saturday - FFXIV (Maps) 9:30 pm Eastern Saturday - Noob Raid (WoW) 11 pm Eastern Summer of Love SOL 2022 Notes Start Aug 2nd, Tuesday Walk at 9:30pm EDT - WoW = Gusty (Shrine of the Fallen Soldier -> Booty Bay) - WoW Classic = Looci - (Shrine of the Fallen Soldier -> Gates of Orgrimmar) - Lotro = Maellung (will adjust the MFN from Monday to Friday) - Bree above the Bluff - SWTOR = BC - Alderaan - ESO = Redrum/Ken (Panther Fang Temple) - STO = Grebog (Wolf 359 Memorial) - GW = Dejara (Lion's Arch - Memorial Wall) - FFXIV = Abovan (FC House) End Aug 7th Gusty - in charge of the WoW cordination will sort out route and fireworks. Pet battles/giveaways, Transmog runs, Visit the other side (horde/alliance flip) Gusty - D&D discussions - All sessions 4-5 players (make slots for 5?) DP Roberts -Thurs 8/4 8pm-11pm Eastern Palell - Fri 8/5 9pm-12am Eastern Duskmire - Sat 8/6 7pm-10pm Eastern Mementh - Sun 8/7 6pm-9pm Eastern. Titles and other details to come Streaming and Guild Podcast News SWTOR Escape Pod Cast https://www.newoverlords.com/swtor-escape-pod-cast-430-seeking-binocs/ We have a player made guild event coming next week that uses the macrobinoculars and seeker droids. You may have forgotten about them or never had the chance to start those missions. Behind the Games Podcast The People Are The Studio – Behind The Games chat with Hector Padilla of IO Interactive https://www.newoverlords.com/the-people-are-the-studio-behind-the-games-chat-with-hector-padilla-of-io-interactive/ Many game studios are having a moment right now where they need scale, and to learn to be better, more professional companies. IO Interactive has this figured out with leaders like Hector Padilla, a people manager at IO Interactive. IOI is the game studio that makes the Hitman series and soon a new 007 game. Hector focuses on making teams work together while developing all these great games. We talk here about an inside view of IO Interactive and Hitman, how to run a studio like a mature company, how to connect teams and people, and a little about the future of gaming. Thanks Hector and IOI! Working Class Nerds (NSFW) - Episode 158: Multiverse of Fetta Cheese! https://www.buzzsprout.com/143519/10888093 Marcus and Nick discuss Doctor Strange in the Multiverse of Madness with a great friend of the show, Joey Fetta!! The three of them indulge in their usual ridiculous side bar conversations, and give their breakdown of the movie! Enjoy! Boards and Swords Podcast BattleTech, Necromunda, Origins 2022 (Special Guest Ken from Geek-Craft!) - Boards & Swords 198 Apparently Backerkit is making its own Kickstarter like platform? Also Magic the Gathering is 30 years old?