POPULARITY
Terwijl mensen wennen aan de mogelijkheden van AI Large Language Models (LLMs), voeden IoT-Devices massaal LSTM-modellen die data omzetten in informatie. Vanuit het warme Brazilië deelt OpenAI red-teamer Hugo zijn voorbeeld broncode waarmee je zelf met AI LSTM-modellen aan de slag kunt. Vanaf de Waterinfodag in Den Bosch lichten Stijn Kuijpers en Ron Bruijns de toepassingenkant in de waterwereld toe. Met boots on te ground (of in de sloot) ben je in een half uur weer bij. Shortcuts: 1:28 Trendwatcher Vincent Everts over AI 3:26 Cmotions Stijn Kuijpers 8:14 HKV Ron Bruijns 15:12 Hugo Gobato Souto over LSTM 27:50 Samenvatting Links: https://cmotions.com/https://www.hkv.nl/Hugo LinkedIn: https://www.linkedin.com/in/hugo-gobato-souto-669b17161/Hugo LSTM-voorbeeld code: https://github.com/rrrRbert360/LSTM_example_Hugo_Gobato_SoutoFlood forecasting at Google: https://research.google/blog/using-ai-to-expand-global-access-to-reliable-flood-forecasts/Foto credits: Cmotion die vastlegde dat een camerateam van Marcel van Roosmalen stiekum met ons meeluisterde.Doe mee!Geef suggesties voor onderwerpen of spreek je vragen in via de Whatsappgroep De IoT Gesprekken:https://chat.whatsapp.com/GnFHYAqWojzKSyF653Ai03Help mee!Beoordeel deze aflevering of laat een bericht achter op je podcastplatform!_._
Welcome to our new mini-series on research capacity strengthening, produced in partnership with the Centre for Capacity Research at the Liverpool School of Tropical Medicine. In this first episode, we explore the design of a Research Capacity Strengthening (RCS) component within an implementation research project, and why it is crucial for sustainable, patient-centred healthcare. Our conversation draws insights from those actively embedding RCS into their work, demonstrating how improving research skills and systems drives real impact in global health.In this episode:Dr. Justin Pulford - Reader at the Centre for Capacity Research, Liverpool School of Tropical Medicine.Dr. Justin Pulford is Deputy Head of the Centre for Capacity Research (CCR) at the Liverpool School of Tropical Medicine (LSTM). A behavioural scientist by training, he has extensive experience developing, implementing, and evaluating research and health systems strengthening initiatives across sub-Saharan Africa and the South Pacific. Dr Pulford also convenes the ‘TROP 703: Public Health Programmes, Policies and Strategies' module for LSTM's MPH programme. Professor Obiageli Nnodu - Co-lead of the PACTS programme, University of Abuja.Professor Obiageli Nnodu is Professor of Haematology and Blood Transfusion at the University of Abuja, Nigeria, and Director of its Centre of Excellence for Sickle Cell Disease Research and Training. She leads multiple NIH-funded sickle cell projects and chairs Africa's largest SCD network. Professor Nnodu also advises the Nigerian government on non-communicable diseases and serves on WHO AFRO committees dedicated to improving sickle cell care.Professor Alex Osei-Akoto - Kwame Nkrumah University of Science and Technology, GhanaProfessor Alex Osei-Akoto is Principal Investigator for PACTS in Ghana. A Professor of Child Health at Kwame Nkrumah University of Science and Technology (KNUST) and Consultant Paediatrician at Komfo Anokye Teaching Hospital, he has focused on sickle cell disease for over two decades. Professor Osei-Akoto led key newborn screening initiatives, advised Ghana's Ministry of Health on SCD, and co-authored numerous publications. He now spearheads PACTS implementation in Ghana, building on his extensive clinical and research leadership in paediatric haematology.Dr. Catherine Chunda-Liyoka - University of ZambiaDr. Catherine Chunda-Liyoka is Head of the Paediatric Haematology Department at Zambia's University Teaching Hospitals–Children's Hospital. She provides specialised care in sickle cell disease, haemophilia, aplastic anaemia, HIV, and TB, while leading multiple research initiatives. As an Honorary Lecturer at the University of Zambia and an Honorary Fellow at LSTM, she mentors health workers nationwide. Dr. Chunda-Liyoka also advises the Zambian Ministry of Health on paediatric haematology and infectious diseases, and plays a key role in major SCD networks—including SPARCO and PACTS—to strengthen national guidelines and clinical practices.Useful linksCentre for Capacity Research | LSTMPatient-centred sickle cell disease management in sub-Saharan Africa (PACTS) | LSTMWant to hear more podcasts like this?Follow Connecting Citizens to Science on your usual podcast platform or YouTube to hear more about current research and debates within global health.The podcast cuts across disciplines, including health systems strengthening, gender and intersectionality, tropical diseases (NTDs, TB, Malaria), maternal and child healthcare (antenatal and postnatal care), mental...
Sepp Hochreiter, the inventor of LSTM (Long Short-Term Memory) networks – a foundational technology in AI. Sepp discusses his journey, the origins of LSTM, and why he believes his latest work, XLSTM, could be the next big thing in AI, particularly for applications like robotics and industrial simulation. He also shares his controversial perspective on Large Language Models (LLMs) and why reasoning is a critical missing piece in current AI systems.SPONSOR MESSAGES:***CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. Check out their super fast DeepSeek R1 hosting!https://centml.ai/pricing/Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich.Goto https://tufalabs.ai/***TRANSCRIPT AND BACKGROUND READING:https://www.dropbox.com/scl/fi/n1vzm79t3uuss8xyinxzo/SEPPH.pdf?rlkey=fp7gwaopjk17uyvgjxekxrh5v&dl=0Prof. Sepp Hochreiterhttps://www.nx-ai.com/https://x.com/hochreitersepphttps://scholar.google.at/citations?user=tvUH3WMAAAAJ&hl=enTOC:1. LLM Evolution and Reasoning Capabilities[00:00:00] 1.1 LLM Capabilities and Limitations Debate[00:03:16] 1.2 Program Generation and Reasoning in AI Systems[00:06:30] 1.3 Human vs AI Reasoning Comparison[00:09:59] 1.4 New Research Initiatives and Hybrid Approaches2. LSTM Technical Architecture[00:13:18] 2.1 LSTM Development History and Technical Background[00:20:38] 2.2 LSTM vs RNN Architecture and Computational Complexity[00:25:10] 2.3 xLSTM Architecture and Flash Attention Comparison[00:30:51] 2.4 Evolution of Gating Mechanisms from Sigmoid to Exponential3. Industrial Applications and Neuro-Symbolic AI[00:40:35] 3.1 Industrial Applications and Fixed Memory Advantages[00:42:31] 3.2 Neuro-Symbolic Integration and Pi AI Project[00:46:00] 3.3 Integration of Symbolic and Neural AI Approaches[00:51:29] 3.4 Evolution of AI Paradigms and System Thinking[00:54:55] 3.5 AI Reasoning and Human Intelligence Comparison[00:58:12] 3.6 NXAI Company and Industrial AI ApplicationsREFS:[00:00:15] Seminal LSTM paper establishing Hochreiter's expertise (Hochreiter & Schmidhuber)https://direct.mit.edu/neco/article-abstract/9/8/1735/6109/Long-Short-Term-Memory[00:04:20] Kolmogorov complexity and program composition limitations (Kolmogorov)https://link.springer.com/article/10.1007/BF02478259[00:07:10] Limitations of LLM mathematical reasoning and symbolic integration (Various Authors)https://www.arxiv.org/pdf/2502.03671[00:09:05] AlphaGo's Move 37 demonstrating creative AI (Google DeepMind)https://deepmind.google/research/breakthroughs/alphago/[00:10:15] New AI research lab in Zurich for fundamental LLM research (Benjamin Crouzier)https://tufalabs.ai[00:19:40] Introduction of xLSTM with exponential gating (Beck, Hochreiter, et al.)https://arxiv.org/abs/2405.04517[00:22:55] FlashAttention: fast & memory-efficient attention (Tri Dao et al.)https://arxiv.org/abs/2205.14135[00:31:00] Historical use of sigmoid/tanh activation in 1990s (James A. McCaffrey)https://visualstudiomagazine.com/articles/2015/06/01/alternative-activation-functions.aspx[00:36:10] Mamba 2 state space model architecture (Albert Gu et al.)https://arxiv.org/abs/2312.00752[00:46:00] Austria's Pi AI project integrating symbolic & neural AI (Hochreiter et al.)https://www.jku.at/en/institute-of-machine-learning/research/projects/[00:48:10] Neuro-symbolic integration challenges in language models (Diego Calanzone et al.)https://openreview.net/forum?id=7PGluppo4k[00:49:30] JKU Linz's historical and neuro-symbolic research (Sepp Hochreiter)https://www.jku.at/en/news-events/news/detail/news/bilaterale-ki-projekt-unter-leitung-der-jku-erhaelt-fwf-cluster-of-excellence/YT: https://www.youtube.com/watch?v=8u2pW2zZLCs
In this special episode of the Eye on AI podcast, Sepp Hochreiter, the inventor of Long Short-Term Memory (LSTM) networks, joins Craig Smith to discuss the profound impact of LSTMs on artificial intelligence, from language models to real-time robotics. Sepp reflects on the early days of LSTM development, sharing insights into his collaboration with Jürgen Schmidhuber and the challenges they faced in gaining recognition for their groundbreaking work. He explains how LSTMs became the foundation for technologies used by giants like Amazon, Apple, and Google, and how they paved the way for modern advancements like transformers. Topics include: - The origin story of LSTMs and their unique architecture. - Why LSTMs were crucial for sequence data like speech and text. - The rise of transformers and how they compare to LSTMs. - Real-time robotics: using LSTMs to build energy-efficient, autonomous systems. The next big challenges for AI and robotics in the era of generative AI. Sepp also shares his optimistic vision for the future of AI, emphasizing the importance of efficient, scalable models and their potential to revolutionize industries from healthcare to autonomous vehicles. Don't miss this deep dive into the history and future of AI, featuring one of its most influential pioneers. (00:00) Introduction: Meet Sepp Hochreiter (01:10) The Origins of LSTMs (02:26) Understanding the Vanishing Gradient Problem (05:12) Memory Cells and LSTM Architecture (06:35) Early Applications of LSTMs in Technology (09:38) How Transformers Differ from LSTMs (13:38) Exploring XLSTM for Industrial Applications (15:17) AI for Robotics and Real-Time Systems (18:55) Expanding LSTM Memory with Hopfield Networks (21:18) The Road to XLSTM Development (23:17) Industrial Use Cases of XLSTM (27:49) AI in Simulation: A New Frontier (32:26) The Future of LSTMs and Scalability (35:48) Inference Efficiency and Potential Applications (39:53) Continuous Learning and Adaptability in AI (42:59) Training Robots with XLSTM Technology (44:47) NXAI: Advancing AI in Industry
Applications for the 2025 AI Engineer Summit are up, and you can save the date for AIE Singapore in April and AIE World's Fair 2025 in June.Happy new year, and thanks for 100 great episodes! Please let us know what you want to see/hear for the next 100!Full YouTube Episode with Slides/ChartsLike and subscribe and hit that bell to get notifs!Timestamps* 00:00 Welcome to the 100th Episode!* 00:19 Reflecting on the Journey* 00:47 AI Engineering: The Rise and Impact* 03:15 Latent Space Live and AI Conferences* 09:44 The Competitive AI Landscape* 21:45 Synthetic Data and Future Trends* 35:53 Creative Writing with AI* 36:12 Legal and Ethical Issues in AI* 38:18 The Data War: GPU Poor vs. GPU Rich* 39:12 The Rise of GPU Ultra Rich* 40:47 Emerging Trends in AI Models* 45:31 The Multi-Modality War* 01:05:31 The Future of AI Benchmarks* 01:13:17 Pionote and Frontier Models* 01:13:47 Niche Models and Base Models* 01:14:30 State Space Models and RWKB* 01:15:48 Inference Race and Price Wars* 01:22:16 Major AI Themes of the Year* 01:22:48 AI Rewind: January to March* 01:26:42 AI Rewind: April to June* 01:33:12 AI Rewind: July to September* 01:34:59 AI Rewind: October to December* 01:39:53 Year-End Reflections and PredictionsTranscript[00:00:00] Welcome to the 100th Episode![00:00:00] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co host Swyx for the 100th time today.[00:00:12] swyx: Yay, um, and we're so glad that, yeah, you know, everyone has, uh, followed us in this journey. How do you feel about it? 100 episodes.[00:00:19] Alessio: Yeah, I know.[00:00:19] Reflecting on the Journey[00:00:19] Alessio: Almost two years that we've been doing this. We've had four different studios. Uh, we've had a lot of changes. You know, we used to do this lightning round. When we first started that we didn't like, and we tried to change the question. The answer[00:00:32] swyx: was cursor and perplexity.[00:00:34] Alessio: Yeah, I love mid journey. It's like, do you really not like anything else?[00:00:38] Alessio: Like what's, what's the unique thing? And I think, yeah, we, we've also had a lot more research driven content. You know, we had like 3DAO, we had, you know. Jeremy Howard, we had more folks like that.[00:00:47] AI Engineering: The Rise and Impact[00:00:47] Alessio: I think we want to do more of that too in the new year, like having, uh, some of the Gemini folks, both on the research and the applied side.[00:00:54] Alessio: Yeah, but it's been a ton of fun. I think we both started, I wouldn't say as a joke, we were kind of like, Oh, we [00:01:00] should do a podcast. And I think we kind of caught the right wave, obviously. And I think your rise of the AI engineer posts just kind of get people. Sombra to congregate, and then the AI engineer summit.[00:01:11] Alessio: And that's why when I look at our growth chart, it's kind of like a proxy for like the AI engineering industry as a whole, which is almost like, like, even if we don't do that much, we keep growing just because there's so many more AI engineers. So did you expect that growth or did you expect that would take longer for like the AI engineer thing to kind of like become, you know, everybody talks about it today.[00:01:32] swyx: So, the sign of that, that we have won is that Gartner puts it at the top of the hype curve right now. So Gartner has called the peak in AI engineering. I did not expect, um, to what level. I knew that I was correct when I called it because I did like two months of work going into that. But I didn't know, You know, how quickly it could happen, and obviously there's a chance that I could be wrong.[00:01:52] swyx: But I think, like, most people have come around to that concept. Hacker News hates it, which is a good sign. But there's enough people that have defined it, you know, GitHub, when [00:02:00] they launched GitHub Models, which is the Hugging Face clone, they put AI engineers in the banner, like, above the fold, like, in big So I think it's like kind of arrived as a meaningful and useful definition.[00:02:12] swyx: I think people are trying to figure out where the boundaries are. I think that was a lot of the quote unquote drama that happens behind the scenes at the World's Fair in June. Because I think there's a lot of doubt or questions about where ML engineering stops and AI engineering starts. That's a useful debate to be had.[00:02:29] swyx: In some sense, I actually anticipated that as well. So I intentionally did not. Put a firm definition there because most of the successful definitions are necessarily underspecified and it's actually useful to have different perspectives and you don't have to specify everything from the outset.[00:02:45] Alessio: Yeah, I was at um, AWS reInvent and the line to get into like the AI engineering talk, so to speak, which is, you know, applied AI and whatnot was like, there are like hundreds of people just in line to go in.[00:02:56] Alessio: I think that's kind of what enabled me. People, right? Which is what [00:03:00] you kind of talked about. It's like, Hey, look, you don't actually need a PhD, just, yeah, just use the model. And then maybe we'll talk about some of the blind spots that you get as an engineer with the earlier posts that we also had on on the sub stack.[00:03:11] Alessio: But yeah, it's been a heck of a heck of a two years.[00:03:14] swyx: Yeah.[00:03:15] Latent Space Live and AI Conferences[00:03:15] swyx: You know, I was, I was trying to view the conference as like, so NeurIPS is I think like 16, 17, 000 people. And the Latent Space Live event that we held there was 950 signups. I think. The AI world, the ML world is still very much research heavy. And that's as it should be because ML is very much in a research phase.[00:03:34] swyx: But as we move this entire field into production, I think that ratio inverts into becoming more engineering heavy. So at least I think engineering should be on the same level, even if it's never as prestigious, like it'll always be low status because at the end of the day, you're manipulating APIs or whatever.[00:03:51] swyx: But Yeah, wrapping GPTs, but there's going to be an increasing stack and an art to doing these, these things well. And I, you know, I [00:04:00] think that's what we're focusing on for the podcast, the conference and basically everything I do seems to make sense. And I think we'll, we'll talk about the trends here that apply.[00:04:09] swyx: It's, it's just very strange. So, like, there's a mix of, like, keeping on top of research while not being a researcher and then putting that research into production. So, like, people always ask me, like, why are you covering Neuralibs? Like, this is a ML research conference and I'm like, well, yeah, I mean, we're not going to, to like, understand everything Or reproduce every single paper, but the stuff that is being found here is going to make it through into production at some point, you hope.[00:04:32] swyx: And then actually like when I talk to the researchers, they actually get very excited because they're like, oh, you guys are actually caring about how this goes into production and that's what they really really want. The measure of success is previously just peer review, right? Getting 7s and 8s on their um, Academic review conferences and stuff like citations is one metric, but money is a better metric.[00:04:51] Alessio: Money is a better metric. Yeah, and there were about 2200 people on the live stream or something like that. Yeah, yeah. Hundred on the live stream. So [00:05:00] I try my best to moderate, but it was a lot spicier in person with Jonathan and, and Dylan. Yeah, that it was in the chat on YouTube.[00:05:06] swyx: I would say that I actually also created.[00:05:09] swyx: Layen Space Live in order to address flaws that are perceived in academic conferences. This is not NeurIPS specific, it's ICML, NeurIPS. Basically, it's very sort of oriented towards the PhD student, uh, market, job market, right? Like literally all, basically everyone's there to advertise their research and skills and get jobs.[00:05:28] swyx: And then obviously all the, the companies go there to hire them. And I think that's great for the individual researchers, but for people going there to get info is not great because you have to read between the lines, bring a ton of context in order to understand every single paper. So what is missing is effectively what I ended up doing, which is domain by domain, go through and recap the best of the year.[00:05:48] swyx: Survey the field. And there are, like NeurIPS had a, uh, I think ICML had a like a position paper track, NeurIPS added a benchmarks, uh, datasets track. These are ways in which to address that [00:06:00] issue. Uh, there's always workshops as well. Every, every conference has, you know, a last day of workshops and stuff that provide more of an overview.[00:06:06] swyx: But they're not specifically prompted to do so. And I think really, uh, Organizing a conference is just about getting good speakers and giving them the correct prompts. And then they will just go and do that thing and they do a very good job of it. So I think Sarah did a fantastic job with the startups prompt.[00:06:21] swyx: I can't list everybody, but we did best of 2024 in startups, vision, open models. Post transformers, synthetic data, small models, and agents. And then the last one was the, uh, and then we also did a quick one on reasoning with Nathan Lambert. And then the last one, obviously, was the debate that people were very hyped about.[00:06:39] swyx: It was very awkward. And I'm really, really thankful for John Franco, basically, who stepped up to challenge Dylan. Because Dylan was like, yeah, I'll do it. But He was pro scaling. And I think everyone who is like in AI is pro scaling, right? So you need somebody who's ready to publicly say, no, we've hit a wall.[00:06:57] swyx: So that means you're saying Sam Altman's wrong. [00:07:00] You're saying, um, you know, everyone else is wrong. It helps that this was the day before Ilya went on, went up on stage and then said pre training has hit a wall. And data has hit a wall. So actually Jonathan ended up winning, and then Ilya supported that statement, and then Noam Brown on the last day further supported that statement as well.[00:07:17] swyx: So it's kind of interesting that I think the consensus kind of going in was that we're not done scaling, like you should believe in a better lesson. And then, four straight days in a row, you had Sepp Hochreiter, who is the creator of the LSTM, along with everyone's favorite OG in AI, which is Juergen Schmidhuber.[00:07:34] swyx: He said that, um, we're pre trading inside a wall, or like, we've run into a different kind of wall. And then we have, you know John Frankel, Ilya, and then Noam Brown are all saying variations of the same thing, that we have hit some kind of wall in the status quo of what pre trained, scaling large pre trained models has looked like, and we need a new thing.[00:07:54] swyx: And obviously the new thing for people is some make, either people are calling it inference time compute or test time [00:08:00] compute. I think the collective terminology has been inference time, and I think that makes sense because test time, calling it test, meaning, has a very pre trained bias, meaning that the only reason for running inference at all is to test your model.[00:08:11] swyx: That is not true. Right. Yeah. So, so, I quite agree that. OpenAI seems to have adopted, or the community seems to have adopted this terminology of ITC instead of TTC. And that, that makes a lot of sense because like now we care about inference, even right down to compute optimality. Like I actually interviewed this author who recovered or reviewed the Chinchilla paper.[00:08:31] swyx: Chinchilla paper is compute optimal training, but what is not stated in there is it's pre trained compute optimal training. And once you start caring about inference, compute optimal training, you have a different scaling law. And in a way that we did not know last year.[00:08:45] Alessio: I wonder, because John is, he's also on the side of attention is all you need.[00:08:49] Alessio: Like he had the bet with Sasha. So I'm curious, like he doesn't believe in scaling, but he thinks the transformer, I wonder if he's still. So, so,[00:08:56] swyx: so he, obviously everything is nuanced and you know, I told him to play a character [00:09:00] for this debate, right? So he actually does. Yeah. He still, he still believes that we can scale more.[00:09:04] swyx: Uh, he just assumed the character to be very game for, for playing this debate. So even more kudos to him that he assumed a position that he didn't believe in and still won the debate.[00:09:16] Alessio: Get rekt, Dylan. Um, do you just want to quickly run through some of these things? Like, uh, Sarah's presentation, just the highlights.[00:09:24] swyx: Yeah, we can't go through everyone's slides, but I pulled out some things as a factor of, like, stuff that we were going to talk about. And we'll[00:09:30] Alessio: publish[00:09:31] swyx: the rest. Yeah, we'll publish on this feed the best of 2024 in those domains. And hopefully people can benefit from the work that our speakers have done.[00:09:39] swyx: But I think it's, uh, these are just good slides. And I've been, I've been looking for a sort of end of year recaps from, from people.[00:09:44] The Competitive AI Landscape[00:09:44] swyx: The field has progressed a lot. You know, I think the max ELO in 2023 on LMSys used to be 1200 for LMSys ELOs. And now everyone is at least at, uh, 1275 in their ELOs, and this is across Gemini, Chadjibuti, [00:10:00] Grok, O1.[00:10:01] swyx: ai, which with their E Large model, and Enthopic, of course. It's a very, very competitive race. There are multiple Frontier labs all racing, but there is a clear tier zero Frontier. And then there's like a tier one. It's like, I wish I had everything else. Tier zero is extremely competitive. It's effectively now three horse race between Gemini, uh, Anthropic and OpenAI.[00:10:21] swyx: I would say that people are still holding out a candle for XAI. XAI, I think, for some reason, because their API was very slow to roll out, is not included in these metrics. So it's actually quite hard to put on there. As someone who also does charts, XAI is continually snubbed because they don't work well with the benchmarking people.[00:10:42] swyx: Yeah, yeah, yeah. It's a little trivia for why XAI always gets ignored. The other thing is market share. So these are slides from Sarah. We have it up on the screen. It has gone from very heavily open AI. So we have some numbers and estimates. These are from RAMP. Estimates of open AI market share in [00:11:00] December 2023.[00:11:01] swyx: And this is basically, what is it, GPT being 95 percent of production traffic. And I think if you correlate that with stuff that we asked. Harrison Chase on the LangChain episode, it was true. And then CLAUD 3 launched mid middle of this year. I think CLAUD 3 launched in March, CLAUD 3. 5 Sonnet was in June ish.[00:11:23] swyx: And you can start seeing the market share shift towards opening, uh, towards that topic, uh, very, very aggressively. The more recent one is Gemini. So if I scroll down a little bit, this is an even more recent dataset. So RAM's dataset ends in September 2 2. 2024. Gemini has basically launched a price war at the low end, uh, with Gemini Flash, uh, being basically free for personal use.[00:11:44] swyx: Like, I think people don't understand the free tier. It's something like a billion tokens per day. Unless you're trying to abuse it, you cannot really exhaust your free tier on Gemini. They're really trying to get you to use it. They know they're in like third place, um, fourth place, depending how you, how you count.[00:11:58] swyx: And so they're going after [00:12:00] the Lower tier first, and then, you know, maybe the upper tier later, but yeah, Gemini Flash, according to OpenRouter, is now 50 percent of their OpenRouter requests. Obviously, these are the small requests. These are small, cheap requests that are mathematically going to be more.[00:12:15] swyx: The smart ones obviously are still going to OpenAI. But, you know, it's a very, very big shift in the market. Like basically 2023, 2022, To going into 2024 opening has gone from nine five market share to Yeah. Reasonably somewhere between 50 to 75 market share.[00:12:29] Alessio: Yeah. I'm really curious how ramped does the attribution to the model?[00:12:32] Alessio: If it's API, because I think it's all credit card spin. . Well, but it's all, the credit card doesn't say maybe. Maybe the, maybe when they do expenses, they upload the PDF, but yeah, the, the German I think makes sense. I think that was one of my main 2024 takeaways that like. The best small model companies are the large labs, which is not something I would have thought that the open source kind of like long tail would be like the small model.[00:12:53] swyx: Yeah, different sizes of small models we're talking about here, right? Like so small model here for Gemini is AB, [00:13:00] right? Uh, mini. We don't know what the small model size is, but yeah, it's probably in the double digits or maybe single digits, but probably double digits. The open source community has kind of focused on the one to three B size.[00:13:11] swyx: Mm-hmm . Yeah. Maybe[00:13:12] swyx: zero, maybe 0.5 B uh, that's moon dream and that is small for you then, then that's great. It makes sense that we, we have a range for small now, which is like, may, maybe one to five B. Yeah. I'll even put that at, at, at the high end. And so this includes Gemma from Gemini as well. But also includes the Apple Foundation models, which I think Apple Foundation is 3B.[00:13:32] Alessio: Yeah. No, that's great. I mean, I think in the start small just meant cheap. I think today small is actually a more nuanced discussion, you know, that people weren't really having before.[00:13:43] swyx: Yeah, we can keep going. This is a slide that I smiley disagree with Sarah. She's pointing to the scale SEAL leaderboard. I think the Researchers that I talked with at NeurIPS were kind of positive on this because basically you need private test [00:14:00] sets to prevent contamination.[00:14:02] swyx: And Scale is one of maybe three or four people this year that has really made an effort in doing a credible private test set leaderboard. Llama405B does well compared to Gemini and GPT 40. And I think that's good. I would say that. You know, it's good to have an open model that is that big, that does well on those metrics.[00:14:23] swyx: But anyone putting 405B in production will tell you, if you scroll down a little bit to the artificial analysis numbers, that it is very slow and very expensive to infer. Um, it doesn't even fit on like one node. of, uh, of H100s. Cerebras will be happy to tell you they can serve 4 or 5B on their super large chips.[00:14:42] swyx: But, um, you know, if you need to do anything custom to it, you're still kind of constrained. So, is 4 or 5B really that relevant? Like, I think most people are basically saying that they only use 4 or 5B as a teacher model to distill down to something. Even Meta is doing it. So with Lama 3. [00:15:00] 3 launched, they only launched the 70B because they use 4 or 5B to distill the 70B.[00:15:03] swyx: So I don't know if like open source is keeping up. I think they're the, the open source industrial complex is very invested in telling you that the, if the gap is narrowing, I kind of disagree. I think that the gap is widening with O1. I think there are very, very smart people trying to narrow that gap and they should.[00:15:22] swyx: I really wish them success, but you cannot use a chart that is nearing 100 in your saturation chart. And look, the distance between open source and closed source is narrowing. Of course it's going to narrow because you're near 100. This is stupid. But in metrics that matter, is open source narrowing?[00:15:38] swyx: Probably not for O1 for a while. And it's really up to the open source guys to figure out if they can match O1 or not.[00:15:46] Alessio: I think inference time compute is bad for open source just because, you know, Doc can donate the flops at training time, but he cannot donate the flops at inference time. So it's really hard to like actually keep up on that axis.[00:15:59] Alessio: Big, big business [00:16:00] model shift. So I don't know what that means for the GPU clouds. I don't know what that means for the hyperscalers, but obviously the big labs have a lot of advantage. Because, like, it's not a static artifact that you're putting the compute in. You're kind of doing that still, but then you're putting a lot of computed inference too.[00:16:17] swyx: Yeah, yeah, yeah. Um, I mean, Llama4 will be reasoning oriented. We talked with Thomas Shalom. Um, kudos for getting that episode together. That was really nice. Good, well timed. Actually, I connected with the AI meta guy, uh, at NeurIPS, and, um, yeah, we're going to coordinate something for Llama4. Yeah, yeah,[00:16:32] Alessio: and our friend, yeah.[00:16:33] Alessio: Clara Shi just joined to lead the business agent side. So I'm sure we'll have her on in the new year.[00:16:39] swyx: Yeah. So, um, my comment on, on the business model shift, this is super interesting. Apparently it is wide knowledge that OpenAI wanted more than 6. 6 billion dollars for their fundraise. They wanted to raise, you know, higher, and they did not.[00:16:51] swyx: And what that means is basically like, it's very convenient that we're not getting GPT 5, which would have been a larger pre train. We should have a lot of upfront money. And [00:17:00] instead we're, we're converting fixed costs into variable costs, right. And passing it on effectively to the customer. And it's so much easier to take margin there because you can directly attribute it to like, Oh, you're using this more.[00:17:12] swyx: Therefore you, you pay more of the cost and I'll just slap a margin in there. So like that lets you control your growth margin and like tie your. Your spend, or your sort of inference spend, accordingly. And it's just really interesting to, that this change in the sort of inference paradigm has arrived exactly at the same time that the funding environment for pre training is effectively drying up, kind of.[00:17:36] swyx: I feel like maybe the VCs are very in tune with research anyway, so like, they would have noticed this, but, um, it's just interesting.[00:17:43] Alessio: Yeah, and I was looking back at our yearly recap of last year. Yeah. And the big thing was like the mixed trial price fights, you know, and I think now it's almost like there's nowhere to go, like, you know, Gemini Flash is like basically giving it away for free.[00:17:55] Alessio: So I think this is a good way for the labs to generate more revenue and pass down [00:18:00] some of the compute to the customer. I think they're going to[00:18:02] swyx: keep going. I think that 2, will come.[00:18:05] Alessio: Yeah, I know. Totally. I mean, next year, the first thing I'm doing is signing up for Devin. Signing up for the pro chat GBT.[00:18:12] Alessio: Just to try. I just want to see what does it look like to spend a thousand dollars a month on AI?[00:18:17] swyx: Yes. Yes. I think if your, if your, your job is a, at least AI content creator or VC or, you know, someone who, whose job it is to stay on, stay on top of things, you should already be spending like a thousand dollars a month on, on stuff.[00:18:28] swyx: And then obviously easy to spend, hard to use. You have to actually use. The good thing is that actually Google lets you do a lot of stuff for free now. So like deep research. That they just launched. Uses a ton of inference and it's, it's free while it's in preview.[00:18:45] Alessio: Yeah. They need to put that in Lindy.[00:18:47] Alessio: I've been using Lindy lately. I've been a built a bunch of things once we had flow because I liked the new thing. It's pretty good. I even did a phone call assistant. Um, yeah, they just launched Lindy voice. Yeah, I think once [00:19:00] they get advanced voice mode like capability today, still like speech to text, you can kind of tell.[00:19:06] Alessio: Um, but it's good for like reservations and things like that. So I have a meeting prepper thing. And so[00:19:13] swyx: it's good. Okay. I feel like we've, we've covered a lot of stuff. Uh, I, yeah, I, you know, I think We will go over the individual, uh, talks in a separate episode. Uh, I don't want to take too much time with, uh, this stuff, but that suffice to say that there is a lot of progress in each field.[00:19:28] swyx: Uh, we covered vision. Basically this is all like the audience voting for what they wanted. And then I just invited the best people I could find in each audience, especially agents. Um, Graham, who I talked to at ICML in Vienna, he is currently still number one. It's very hard to stay on top of SweetBench.[00:19:45] swyx: OpenHand is currently still number one. switchbench full, which is the hardest one. He had very good thoughts on agents, which I, which I'll highlight for people. Everyone is saying 2025 is the year of agents, just like they said last year. And, uh, but he had [00:20:00] thoughts on like eight parts of what are the frontier problems to solve in agents.[00:20:03] swyx: And so I'll highlight that talk as well.[00:20:05] Alessio: Yeah. The number six, which is the Hacken agents learn more about the environment, has been a Super interesting to us as well, just to think through, because, yeah, how do you put an agent in an enterprise where most things in an enterprise have never been public, you know, a lot of the tooling, like the code bases and things like that.[00:20:23] Alessio: So, yeah, there's not indexing and reg. Well, yeah, but it's more like. You can't really rag things that are not documented. But people know them based on how they've been doing it. You know, so I think there's almost this like, you know, Oh, institutional knowledge. Yeah, the boring word is kind of like a business process extraction.[00:20:38] Alessio: Yeah yeah, I see. It's like, how do you actually understand how these things are done? I see. Um, and I think today the, the problem is that, Yeah, the agents are, that most people are building are good at following instruction, but are not as good as like extracting them from you. Um, so I think that will be a big unlock just to touch quickly on the Jeff Dean thing.[00:20:55] Alessio: I thought it was pretty, I mean, we'll link it in the, in the things, but. I think the main [00:21:00] focus was like, how do you use ML to optimize the systems instead of just focusing on ML to do something else? Yeah, I think speculative decoding, we had, you know, Eugene from RWKB on the podcast before, like he's doing a lot of that with Fetterless AI.[00:21:12] swyx: Everyone is. I would say it's the norm. I'm a little bit uncomfortable with how much it costs, because it does use more of the GPU per call. But because everyone is so keen on fast inference, then yeah, makes sense.[00:21:24] Alessio: Exactly. Um, yeah, but we'll link that. Obviously Jeff is great.[00:21:30] swyx: Jeff is, Jeff's talk was more, it wasn't focused on Gemini.[00:21:33] swyx: I think people got the wrong impression from my tweet. It's more about how Google approaches ML and uses ML to design systems and then systems feedback into ML. And I think this ties in with Lubna's talk.[00:21:45] Synthetic Data and Future Trends[00:21:45] swyx: on synthetic data where it's basically the story of bootstrapping of humans and AI in AI research or AI in production.[00:21:53] swyx: So her talk was on synthetic data, where like how much synthetic data has grown in 2024 in the pre training side, the post training side, [00:22:00] and the eval side. And I think Jeff then also extended it basically to chips, uh, to chip design. So he'd spend a lot of time talking about alpha chip. And most of us in the audience are like, we're not working on hardware, man.[00:22:11] swyx: Like you guys are great. TPU is great. Okay. We'll buy TPUs.[00:22:14] Alessio: And then there was the earlier talk. Yeah. But, and then we have, uh, I don't know if we're calling them essays. What are we calling these? But[00:22:23] swyx: for me, it's just like bonus for late in space supporters, because I feel like they haven't been getting anything.[00:22:29] swyx: And then I wanted a more high frequency way to write stuff. Like that one I wrote in an afternoon. I think basically we now have an answer to what Ilya saw. It's one year since. The blip. And we know what he saw in 2014. We know what he saw in 2024. We think we know what he sees in 2024. He gave some hints and then we have vague indications of what he saw in 2023.[00:22:54] swyx: So that was the Oh, and then 2016 as well, because of this lawsuit with Elon, OpenAI [00:23:00] is publishing emails from Sam's, like, his personal text messages to Siobhan, Zelis, or whatever. So, like, we have emails from Ilya saying, this is what we're seeing in OpenAI, and this is why we need to scale up GPUs. And I think it's very prescient in 2016 to write that.[00:23:16] swyx: And so, like, it is exactly, like, basically his insights. It's him and Greg, basically just kind of driving the scaling up of OpenAI, while they're still playing Dota. They're like, no, like, we see the path here.[00:23:30] Alessio: Yeah, and it's funny, yeah, they even mention, you know, we can only train on 1v1 Dota. We need to train on 5v5, and that takes too many GPUs.[00:23:37] Alessio: Yeah,[00:23:37] swyx: and at least for me, I can speak for myself, like, I didn't see the path from Dota to where we are today. I think even, maybe if you ask them, like, they wouldn't necessarily draw a straight line. Yeah,[00:23:47] Alessio: no, definitely. But I think like that was like the whole idea of almost like the RL and we talked about this with Nathan on his podcast.[00:23:55] Alessio: It's like with RL, you can get very good at specific things, but then you can't really like generalize as much. And I [00:24:00] think the language models are like the opposite, which is like, you're going to throw all this data at them and scale them up, but then you really need to drive them home on a specific task later on.[00:24:08] Alessio: And we'll talk about the open AI reinforcement, fine tuning, um, announcement too, and all of that. But yeah, I think like scale is all you need. That's kind of what Elia will be remembered for. And I think just maybe to clarify on like the pre training is over thing that people love to tweet. I think the point of the talk was like everybody, we're scaling these chips, we're scaling the compute, but like the second ingredient which is data is not scaling at the same rate.[00:24:35] Alessio: So it's not necessarily pre training is over. It's kind of like What got us here won't get us there. In his email, he predicted like 10x growth every two years or something like that. And I think maybe now it's like, you know, you can 10x the chips again, but[00:24:49] swyx: I think it's 10x per year. Was it? I don't know.[00:24:52] Alessio: Exactly. And Moore's law is like 2x. So it's like, you know, much faster than that. And yeah, I like the fossil fuel of AI [00:25:00] analogy. It's kind of like, you know, the little background tokens thing. So the OpenAI reinforcement fine tuning is basically like, instead of fine tuning on data, you fine tune on a reward model.[00:25:09] Alessio: So it's basically like, instead of being data driven, it's like task driven. And I think people have tasks to do, they don't really have a lot of data. So I'm curious to see how that changes, how many people fine tune, because I think this is what people run into. It's like, Oh, you can fine tune llama. And it's like, okay, where do I get the data?[00:25:27] Alessio: To fine tune it on, you know, so it's great that we're moving the thing. And then I really like he had this chart where like, you know, the brain mass and the body mass thing is basically like mammals that scaled linearly by brain and body size, and then humans kind of like broke off the slope. So it's almost like maybe the mammal slope is like the pre training slope.[00:25:46] Alessio: And then the post training slope is like the, the human one.[00:25:49] swyx: Yeah. I wonder what the. I mean, we'll know in 10 years, but I wonder what the y axis is for, for Ilya's SSI. We'll try to get them on.[00:25:57] Alessio: Ilya, if you're listening, you're [00:26:00] welcome here. Yeah, and then he had, you know, what comes next, like agent, synthetic data, inference, compute, I thought all of that was like that.[00:26:05] Alessio: I don't[00:26:05] swyx: think he was dropping any alpha there. Yeah, yeah, yeah.[00:26:07] Alessio: Yeah. Any other new reps? Highlights?[00:26:10] swyx: I think that there was comparatively a lot more work. Oh, by the way, I need to plug that, uh, my friend Yi made this, like, little nice paper. Yeah, that was really[00:26:20] swyx: nice.[00:26:20] swyx: Uh, of, uh, of, like, all the, he's, she called it must read papers of 2024.[00:26:26] swyx: So I laid out some of these at NeurIPS, and it was just gone. Like, everyone just picked it up. Because people are dying for, like, little guidance and visualizations And so, uh, I thought it was really super nice that we got there.[00:26:38] Alessio: Should we do a late in space book for each year? Uh, I thought about it. For each year we should.[00:26:42] Alessio: Coffee table book. Yeah. Yeah. Okay. Put it in the will. Hi, Will. By the way, we haven't introduced you. He's our new, you know, general organist, Jamie. You need to[00:26:52] swyx: pull up more things. One thing I saw that, uh, Okay, one fun one, and then one [00:27:00] more general one. So the fun one is this paper on agent collusion. This is a paper on steganography.[00:27:06] swyx: This is secret collusion among AI agents, multi agent deception via steganography. I tried to go to NeurIPS in order to find these kinds of papers because the real reason Like NeurIPS this year has a lottery system. A lot of people actually even go and don't buy tickets because they just go and attend the side events.[00:27:22] swyx: And then also the people who go and end up crowding around the most popular papers, which you already know and already read them before you showed up to NeurIPS. So the only reason you go there is to talk to the paper authors, but there's like something like 10, 000 other. All these papers out there that, you know, are just people's work that they, that they did on the air and they failed to get attention for one reason or another.[00:27:42] swyx: And this was one of them. Uh, it was like all the way at the back. And this is a deep mind paper that actually focuses on collusion between AI agents, uh, by hiding messages in the text that they generate. Uh, so that's what steganography is. So a very simple example would be the first letter of every word.[00:27:57] swyx: If you Pick that out, you know, and the code sends a [00:28:00] different message than that. But something I've always emphasized is to LLMs, we read left to right. LLMs can read up, down, sideways, you know, in random character order. And it's the same to them as it is to us. So if we were ever to get You know, self motivated, underlined LLMs that we're trying to collaborate to take over the planet.[00:28:19] swyx: This would be how they do it. They spread messages among us in the messages that we generate. And he developed a scaling law for that. So he marked, I'm showing it on screen right now, the emergence of this phenomenon. Basically, for example, for Cypher encoding, GPT 2, Lama 2, mixed trial, GPT 3. 5, zero capabilities, and sudden 4.[00:28:40] swyx: And this is the kind of Jason Wei type emergence properties that people kind of look for. I think what made this paper stand out as well, so he developed the benchmark for steganography collusion, and he also focused on shelling point collusion, which is very low coordination. For agreeing on a decoding encoding format, you kind of need to have some [00:29:00] agreement on that.[00:29:00] swyx: But, but shelling point means like very, very low or almost no coordination. So for example, if I, if I ask someone, if the only message I give you is meet me in New York and you're not aware. Or when you would probably meet me at Grand Central Station. That is the Grand Central Station is a shelling point.[00:29:16] swyx: And it's probably somewhere, somewhere during the day. That is the shelling point of New York is Grand Central. To that extent, shelling points for steganography are things like the, the, the common decoding methods that we talked about. It will be interesting at some point in the future when we are worried about alignment.[00:29:30] swyx: It is not interesting today, but it's interesting that DeepMind is already thinking about this.[00:29:36] Alessio: I think that's like one of the hardest things about NeurIPS. It's like the long tail. I[00:29:41] swyx: found a pricing guy. I'm going to feature him on the podcast. Basically, this guy from NVIDIA worked out the optimal pricing for language models.[00:29:51] swyx: It's basically an econometrics paper at NeurIPS, where everyone else is talking about GPUs. And the guy with the GPUs is[00:29:57] Alessio: talking[00:29:57] swyx: about economics instead. [00:30:00] That was the sort of fun one. So the focus I saw is that model papers at NeurIPS are kind of dead. No one really presents models anymore. It's just data sets.[00:30:12] swyx: This is all the grad students are working on. So like there was a data sets track and then I was looking around like, I was like, you don't need a data sets track because every paper is a data sets paper. And so data sets and benchmarks, they're kind of flip sides of the same thing. So Yeah. Cool. Yeah, if you're a grad student, you're a GPU boy, you kind of work on that.[00:30:30] swyx: And then the, the sort of big model that people walk around and pick the ones that they like, and then they use it in their models. And that's, that's kind of how it develops. I, I feel like, um, like, like you didn't last year, you had people like Hao Tian who worked on Lava, which is take Lama and add Vision.[00:30:47] swyx: And then obviously actually I hired him and he added Vision to Grok. Now he's the Vision Grok guy. This year, I don't think there was any of those.[00:30:55] Alessio: What were the most popular, like, orals? Last year it was like the [00:31:00] Mixed Monarch, I think, was like the most attended. Yeah, uh, I need to look it up. Yeah, I mean, if nothing comes to mind, that's also kind of like an answer in a way.[00:31:10] Alessio: But I think last year there was a lot of interest in, like, furthering models and, like, different architectures and all of that.[00:31:16] swyx: I will say that I felt the orals, oral picks this year were not very good. Either that or maybe it's just a So that's the highlight of how I have changed in terms of how I view papers.[00:31:29] swyx: So like, in my estimation, two of the best papers in this year for datasets or data comp and refined web or fine web. These are two actually industrially used papers, not highlighted for a while. I think DCLM got the spotlight, FineWeb didn't even get the spotlight. So like, it's just that the picks were different.[00:31:48] swyx: But one thing that does get a lot of play that a lot of people are debating is the role that's scheduled. This is the schedule free optimizer paper from Meta from Aaron DeFazio. And this [00:32:00] year in the ML community, there's been a lot of chat about shampoo, soap, all the bathroom amenities for optimizing your learning rates.[00:32:08] swyx: And, uh, most people at the big labs are. Who I asked about this, um, say that it's cute, but it's not something that matters. I don't know, but it's something that was discussed and very, very popular. 4Wars[00:32:19] Alessio: of AI recap maybe, just quickly. Um, where do you want to start? Data?[00:32:26] swyx: So to remind people, this is the 4Wars piece that we did as one of our earlier recaps of this year.[00:32:31] swyx: And the belligerents are on the left, journalists, writers, artists, anyone who owns IP basically, New York Times, Stack Overflow, Reddit, Getty, Sarah Silverman, George RR Martin. Yeah, and I think this year we can add Scarlett Johansson to that side of the fence. So anyone suing, open the eye, basically. I actually wanted to get a snapshot of all the lawsuits.[00:32:52] swyx: I'm sure some lawyer can do it. That's the data quality war. On the right hand side, we have the synthetic data people, and I think we talked about Lumna's talk, you know, [00:33:00] really showing how much synthetic data has come along this year. I think there was a bit of a fight between scale. ai and the synthetic data community, because scale.[00:33:09] swyx: ai published a paper saying that synthetic data doesn't work. Surprise, surprise, scale. ai is the leading vendor of non synthetic data. Only[00:33:17] Alessio: cage free annotated data is useful.[00:33:21] swyx: So I think there's some debate going on there, but I don't think it's much debate anymore that at least synthetic data, for the reasons that are blessed in Luna's talk, Makes sense.[00:33:32] swyx: I don't know if you have any perspectives there.[00:33:34] Alessio: I think, again, going back to the reinforcement fine tuning, I think that will change a little bit how people think about it. I think today people mostly use synthetic data, yeah, for distillation and kind of like fine tuning a smaller model from like a larger model.[00:33:46] Alessio: I'm not super aware of how the frontier labs use it outside of like the rephrase, the web thing that Apple also did. But yeah, I think it'll be. Useful. I think like whether or not that gets us the big [00:34:00] next step, I think that's maybe like TBD, you know, I think people love talking about data because it's like a GPU poor, you know, I think, uh, synthetic data is like something that people can do, you know, so they feel more opinionated about it compared to, yeah, the optimizers stuff, which is like,[00:34:17] swyx: they don't[00:34:17] Alessio: really work[00:34:18] swyx: on.[00:34:18] swyx: I think that there is an angle to the reasoning synthetic data. So this year, we covered in the paper club, the star series of papers. So that's star, Q star, V star. It basically helps you to synthesize reasoning steps, or at least distill reasoning steps from a verifier. And if you look at the OpenAI RFT, API that they released, or that they announced, basically they're asking you to submit graders, or they choose from a preset list of graders.[00:34:49] swyx: Basically It feels like a way to create valid synthetic data for them to fine tune their reasoning paths on. Um, so I think that is another angle where it starts to make sense. And [00:35:00] so like, it's very funny that basically all the data quality wars between Let's say the music industry or like the newspaper publishing industry or the textbooks industry on the big labs.[00:35:11] swyx: It's all of the pre training era. And then like the new era, like the reasoning era, like nobody has any problem with all the reasoning, especially because it's all like sort of math and science oriented with, with very reasonable graders. I think the more interesting next step is how does it generalize beyond STEM?[00:35:27] swyx: We've been using O1 for And I would say like for summarization and creative writing and instruction following, I think it's underrated. I started using O1 in our intro songs before we killed the intro songs, but it's very good at writing lyrics. You know, I can actually say like, I think one of the O1 pro demos.[00:35:46] swyx: All of these things that Noam was showing was that, you know, you can write an entire paragraph or three paragraphs without using the letter A, right?[00:35:53] Creative Writing with AI[00:35:53] swyx: So like, like literally just anything instead of token, like not even token level, character level manipulation and [00:36:00] counting and instruction following. It's, uh, it's very, very strong.[00:36:02] swyx: And so no surprises when I ask it to rhyme, uh, and to, to create song lyrics, it's going to do that very much better than in previous models. So I think it's underrated for creative writing.[00:36:11] Alessio: Yeah.[00:36:12] Legal and Ethical Issues in AI[00:36:12] Alessio: What do you think is the rationale that they're going to have in court when they don't show you the thinking traces of O1, but then they want us to, like, they're getting sued for using other publishers data, you know, but then on their end, they're like, well, you shouldn't be using my data to then train your model.[00:36:29] Alessio: So I'm curious to see how that kind of comes. Yeah, I mean, OPA has[00:36:32] swyx: many ways to publish, to punish people without bringing, taking them to court. Already banned ByteDance for distilling their, their info. And so anyone caught distilling the chain of thought will be just disallowed to continue on, on, on the API.[00:36:44] swyx: And it's fine. It's no big deal. Like, I don't even think that's an issue at all, just because the chain of thoughts are pretty well hidden. Like you have to work very, very hard to, to get it to leak. And then even when it leaks the chain of thought, you don't know if it's, if it's [00:37:00] The bigger concern is actually that there's not that much IP hiding behind it, that Cosign, which we talked about, we talked to him on Dev Day, can just fine tune 4.[00:37:13] swyx: 0 to beat 0. 1 Cloud SONET so far is beating O1 on coding tasks without, at least O1 preview, without being a reasoning model, same for Gemini Pro or Gemini 2. 0. So like, how much is reasoning important? How much of a moat is there in this, like, All of these are proprietary sort of training data that they've presumably accomplished.[00:37:34] swyx: Because even DeepSeek was able to do it. And they had, you know, two months notice to do this, to do R1. So, it's actually unclear how much moat there is. Obviously, you know, if you talk to the Strawberry team, they'll be like, yeah, I mean, we spent the last two years doing this. So, we don't know. And it's going to be Interesting because there'll be a lot of noise from people who say they have inference time compute and actually don't because they just have fancy chain of thought.[00:38:00][00:38:00] swyx: And then there's other people who actually do have very good chain of thought. And you will not see them on the same level as OpenAI because OpenAI has invested a lot in building up the mythology of their team. Um, which makes sense. Like the real answer is somewhere in between.[00:38:13] Alessio: Yeah, I think that's kind of like the main data war story developing.[00:38:18] The Data War: GPU Poor vs. GPU Rich[00:38:18] Alessio: GPU poor versus GPU rich. Yeah. Where do you think we are? I think there was, again, going back to like the small model thing, there was like a time in which the GPU poor were kind of like the rebel faction working on like these models that were like open and small and cheap. And I think today people don't really care as much about GPUs anymore.[00:38:37] Alessio: You also see it in the price of the GPUs. Like, you know, that market is kind of like plummeted because there's people don't want to be, they want to be GPU free. They don't even want to be poor. They just want to be, you know, completely without them. Yeah. How do you think about this war? You[00:38:52] swyx: can tell me about this, but like, I feel like the, the appetite for GPU rich startups, like the, you know, the, the funding plan is we will raise 60 million and [00:39:00] we'll give 50 of that to NVIDIA.[00:39:01] swyx: That is gone, right? Like, no one's, no one's pitching that. This was literally the plan, the exact plan of like, I can name like four or five startups, you know, this time last year. So yeah, GPU rich startups gone.[00:39:12] The Rise of GPU Ultra Rich[00:39:12] swyx: But I think like, The GPU ultra rich, the GPU ultra high net worth is still going. So, um, now we're, you know, we had Leopold's essay on the trillion dollar cluster.[00:39:23] swyx: We're not quite there yet. We have multiple labs, um, you know, XAI very famously, you know, Jensen Huang praising them for being. Best boy number one in spinning up 100, 000 GPU cluster in like 12 days or something. So likewise at Meta, likewise at OpenAI, likewise at the other labs as well. So like the GPU ultra rich are going to keep doing that because I think partially it's an article of faith now that you just need it.[00:39:46] swyx: Like you don't even know what it's going to, what you're going to use it for. You just, you just need it. And it makes sense that if, especially if we're going into. More researchy territory than we are. So let's say 2020 to 2023 was [00:40:00] let's scale big models territory because we had GPT 3 in 2020 and we were like, okay, we'll go from 1.[00:40:05] swyx: 75b to 1. 8b, 1. 8t. And that was GPT 3 to GPT 4. Okay, that's done. As far as everyone is concerned, Opus 3. 5 is not coming out, GPT 4. 5 is not coming out, and Gemini 2, we don't have Pro, whatever. We've hit that wall. Maybe I'll call it the 2 trillion perimeter wall. We're not going to 10 trillion. No one thinks it's a good idea, at least from training costs, from the amount of data, or at least the inference.[00:40:36] swyx: Would you pay 10x the price of GPT Probably not. Like, like you want something else that, that is at least more useful. So it makes sense that people are pivoting in terms of their inference paradigm.[00:40:47] Emerging Trends in AI Models[00:40:47] swyx: And so when it's more researchy, then you actually need more just general purpose compute to mess around with, uh, at the exact same time that production deployments of the old, the previous paradigm is still ramping up,[00:40:58] swyx: um,[00:40:58] swyx: uh, pretty aggressively.[00:40:59] swyx: So [00:41:00] it makes sense that the GPU rich are growing. We have now interviewed both together and fireworks and replicates. Uh, we haven't done any scale yet. But I think Amazon, maybe kind of a sleeper one, Amazon, in a sense of like they, at reInvent, I wasn't expecting them to do so well, but they are now a foundation model lab.[00:41:18] swyx: It's kind of interesting. Um, I think, uh, you know, David went over there and started just creating models.[00:41:25] Alessio: Yeah, I mean, that's the power of prepaid contracts. I think like a lot of AWS customers, you know, they do this big reserve instance contracts and now they got to use their money. That's why so many startups.[00:41:37] Alessio: Get bought through the AWS marketplace so they can kind of bundle them together and prefer pricing.[00:41:42] swyx: Okay, so maybe GPU super rich doing very well, GPU middle class dead, and then GPU[00:41:48] Alessio: poor. I mean, my thing is like, everybody should just be GPU rich. There shouldn't really be, even the GPU poorest, it's like, does it really make sense to be GPU poor?[00:41:57] Alessio: Like, if you're GPU poor, you should just use the [00:42:00] cloud. Yes, you know, and I think there might be a future once we kind of like figure out what the size and shape of these models is where like the tiny box and these things come to fruition where like you can be GPU poor at home. But I think today is like, why are you working so hard to like get these models to run on like very small clusters where it's like, It's so cheap to run them.[00:42:21] Alessio: Yeah, yeah,[00:42:22] swyx: yeah. I think mostly people think it's cool. People think it's a stepping stone to scaling up. So they aspire to be GPU rich one day and they're working on new methods. Like news research, like probably the most deep tech thing they've done this year is Distro or whatever the new name is.[00:42:38] swyx: There's a lot of interest in heterogeneous computing, distributed computing. I tend generally to de emphasize that historically, but it may be coming to a time where it is starting to be relevant. I don't know. You know, SF compute launched their compute marketplace this year, and like, who's really using that?[00:42:53] swyx: Like, it's a bunch of small clusters, disparate types of compute, and if you can make that [00:43:00] useful, then that will be very beneficial to the broader community, but maybe still not the source of frontier models. It's just going to be a second tier of compute that is unlocked for people, and that's fine. But yeah, I mean, I think this year, I would say a lot more on device, We are, I now have Apple intelligence on my phone.[00:43:19] swyx: Doesn't do anything apart from summarize my notifications. But still, not bad. Like, it's multi modal.[00:43:25] Alessio: Yeah, the notification summaries are so and so in my experience.[00:43:29] swyx: Yeah, but they add, they add juice to life. And then, um, Chrome Nano, uh, Gemini Nano is coming out in Chrome. Uh, they're still feature flagged, but you can, you can try it now if you, if you use the, uh, the alpha.[00:43:40] swyx: And so, like, I, I think, like, you know, We're getting the sort of GPU poor version of a lot of these things coming out, and I think it's like quite useful. Like Windows as well, rolling out RWKB in sort of every Windows department is super cool. And I think the last thing that I never put in this GPU poor war, that I think I should now, [00:44:00] is the number of startups that are GPU poor but still scaling very well, as sort of wrappers on top of either a foundation model lab, or GPU Cloud.[00:44:10] swyx: GPU Cloud, it would be Suno. Suno, Ramp has rated as one of the top ranked, fastest growing startups of the year. Um, I think the last public number is like zero to 20 million this year in ARR and Suno runs on Moto. So Suno itself is not GPU rich, but they're just doing the training on, on Moto, uh, who we've also talked to on, on the podcast.[00:44:31] swyx: The other one would be Bolt, straight cloud wrapper. And, and, um, Again, another, now they've announced 20 million ARR, which is another step up from our 8 million that we put on the title. So yeah, I mean, it's crazy that all these GPU pores are finding a way while the GPU riches are also finding a way. And then the only failures, I kind of call this the GPU smiling curve, where the edges do well, because you're either close to the machines, and you're like [00:45:00] number one on the machines, or you're like close to the customers, and you're number one on the customer side.[00:45:03] swyx: And the people who are in the middle. Inflection, um, character, didn't do that great. I think character did the best of all of them. Like, you have a note in here that we apparently said that character's price tag was[00:45:15] Alessio: 1B.[00:45:15] swyx: Did I say that?[00:45:16] Alessio: Yeah. You said Google should just buy them for 1B. I thought it was a crazy number.[00:45:20] Alessio: Then they paid 2. 7 billion. I mean, for like,[00:45:22] swyx: yeah.[00:45:22] Alessio: What do you pay for node? Like, I don't know what the game world was like. Maybe the starting price was 1B. I mean, whatever it was, it worked out for everybody involved.[00:45:31] The Multi-Modality War[00:45:31] Alessio: Multimodality war. And this one, we never had text to video in the first version, which now is the hottest.[00:45:37] swyx: Yeah, I would say it's a subset of image, but yes.[00:45:40] Alessio: Yeah, well, but I think at the time it wasn't really something people were doing, and now we had VO2 just came out yesterday. Uh, Sora was released last month, last week. I've not tried Sora, because the day that I tried, it wasn't, yeah. I[00:45:54] swyx: think it's generally available now, you can go to Sora.[00:45:56] swyx: com and try it. Yeah, they had[00:45:58] Alessio: the outage. Which I [00:46:00] think also played a part into it. Small things. Yeah. What's the other model that you posted today that was on Replicate? Video or OneLive?[00:46:08] swyx: Yeah. Very, very nondescript name, but it is from Minimax, which I think is a Chinese lab. The Chinese labs do surprisingly well at the video models.[00:46:20] swyx: I'm not sure it's actually Chinese. I don't know. Hold me up to that. Yep. China. It's good. Yeah, the Chinese love video. What can I say? They have a lot of training data for video. Or a more relaxed regulatory environment.[00:46:37] Alessio: Uh, well, sure, in some way. Yeah, I don't think there's much else there. I think like, you know, on the image side, I think it's still open.[00:46:45] Alessio: Yeah, I mean,[00:46:46] swyx: 11labs is now a unicorn. So basically, what is multi modality war? Multi modality war is, do you specialize in a single modality, right? Or do you have GodModel that does all the modalities? So this is [00:47:00] definitely still going, in a sense of 11 labs, you know, now Unicorn, PicoLabs doing well, they launched Pico 2.[00:47:06] swyx: 0 recently, HeyGen, I think has reached 100 million ARR, Assembly, I don't know, but they have billboards all over the place, so I assume they're doing very, very well. So these are all specialist models, specialist models and specialist startups. And then there's the big labs who are doing the sort of all in one play.[00:47:24] swyx: And then here I would highlight Gemini 2 for having native image output. Have you seen the demos? Um, yeah, it's, it's hard to keep up. Literally they launched this last week and a shout out to Paige Bailey, who came to the Latent Space event to demo on the day of launch. And she wasn't prepared. She was just like, I'm just going to show you.[00:47:43] swyx: So they have voice. They have, you know, obviously image input, and then they obviously can code gen and all that. But the new one that OpenAI and Meta both have but they haven't launched yet is image output. So you can literally, um, I think their demo video was that you put in an image of a [00:48:00] car, and you ask for minor modifications to that car.[00:48:02] swyx: They can generate you that modification exactly as you asked. So there's no need for the stable diffusion or comfy UI workflow of like mask here and then like infill there in paint there and all that, all that stuff. This is small model nonsense. Big model people are like, huh, we got you in as everything in the transformer.[00:48:21] swyx: This is the multimodality war, which is, do you, do you bet on the God model or do you string together a whole bunch of, uh, Small models like a, like a chump. Yeah,[00:48:29] Alessio: I don't know, man. Yeah, that would be interesting. I mean, obviously I use Midjourney for all of our thumbnails. Um, they've been doing a ton on the product, I would say.[00:48:38] Alessio: They launched a new Midjourney editor thing. They've been doing a ton. Because I think, yeah, the motto is kind of like, Maybe, you know, people say black forest, the black forest models are better than mid journey on a pixel by pixel basis. But I think when you put it, put it together, have you tried[00:48:53] swyx: the same problems on black forest?[00:48:55] Alessio: Yes. But the problem is just like, you know, on black forest, it generates one image. And then it's like, you got to [00:49:00] regenerate. You don't have all these like UI things. Like what I do, no, but it's like time issue, you know, it's like a mid[00:49:06] swyx: journey. Call the API four times.[00:49:08] Alessio: No, but then there's no like variate.[00:49:10] Alessio: Like the good thing about mid journey is like, you just go in there and you're cooking. There's a lot of stuff that just makes it really easy. And I think people underestimate that. Like, it's not really a skill issue, because I'm paying mid journey, so it's a Black Forest skill issue, because I'm not paying them, you know?[00:49:24] Alessio: Yeah,[00:49:25] swyx: so, okay, so, uh, this is a UX thing, right? Like, you, you, you understand that, at least, we think that Black Forest should be able to do all that stuff. I will also shout out, ReCraft has come out, uh, on top of the image arena that, uh, artificial analysis has done, has apparently, uh, Flux's place. Is this still true?[00:49:41] swyx: So, Artificial Analysis is now a company. I highlighted them I think in one of the early AI Newses of the year. And they have launched a whole bunch of arenas. So, they're trying to take on LM Arena, Anastasios and crew. And they have an image arena. Oh yeah, Recraft v3 is now beating Flux 1. 1. Which is very surprising [00:50:00] because Flux And Black Forest Labs are the old stable diffusion crew who left stability after, um, the management issues.[00:50:06] swyx: So Recurve has come from nowhere to be the top image model. Uh, very, very strange. I would also highlight that Grok has now launched Aurora, which is, it's very interesting dynamics between Grok and Black Forest Labs because Grok's images were originally launched, uh, in partnership with Black Forest Labs as a, as a thin wrapper.[00:50:24] swyx: And then Grok was like, no, we'll make our own. And so they've made their own. I don't know, there are no APIs or benchmarks about it. They just announced it. So yeah, that's the multi modality war. I would say that so far, the small model, the dedicated model people are winning, because they are just focused on their tasks.[00:50:42] swyx: But the big model, People are always catching up. And the moment I saw the Gemini 2 demo of image editing, where I can put in an image and just request it and it does, that's how AI should work. Not like a whole bunch of complicated steps. So it really is something. And I think one frontier that we haven't [00:51:00] seen this year, like obviously video has done very well, and it will continue to grow.[00:51:03] swyx: You know, we only have Sora Turbo today, but at some point we'll get full Sora. Oh, at least the Hollywood Labs will get Fulsora. We haven't seen video to audio, or video synced to audio. And so the researchers that I talked to are already starting to talk about that as the next frontier. But there's still maybe like five more years of video left to actually be Soda.[00:51:23] swyx: I would say that Gemini's approach Compared to OpenAI, Gemini seems, or DeepMind's approach to video seems a lot more fully fledged than OpenAI. Because if you look at the ICML recap that I published that so far nobody has listened to, um, that people have listened to it. It's just a different, definitely different audience.[00:51:43] swyx: It's only seven hours long. Why are people not listening? It's like everything in Uh, so, so DeepMind has, is working on Genie. They also launched Genie 2 and VideoPoet. So, like, they have maybe four years advantage on world modeling that OpenAI does not have. Because OpenAI basically only started [00:52:00] Diffusion Transformers last year, you know, when they hired, uh, Bill Peebles.[00:52:03] swyx: So, DeepMind has, has a bit of advantage here, I would say, in, in, in showing, like, the reason that VO2, while one, They cherry pick their videos. So obviously it looks better than Sora, but the reason I would believe that VO2, uh, when it's fully launched will do very well is because they have all this background work in video that they've done for years.[00:52:22] swyx: Like, like last year's NeurIPS, I already was interviewing some of their video people. I forget their model name, but for, for people who are dedicated fans, they can go to NeurIPS 2023 and see, see that paper.[00:52:32] Alessio: And then last but not least, the LLMOS. We renamed it to Ragops, formerly known as[00:52:39] swyx: Ragops War. I put the latest chart on the Braintrust episode.[00:52:43] swyx: I think I'm going to separate these essays from the episode notes. So the reason I used to do that, by the way, is because I wanted to show up on Hacker News. I wanted the podcast to show up on Hacker News. So I always put an essay inside of there because Hacker News people like to read and not listen.[00:52:58] Alessio: So episode essays,[00:52:59] swyx: I remember [00:53:00] purchasing them separately. You say Lanchain Llama Index is still growing.[00:53:03] Alessio: Yeah, so I looked at the PyPy stats, you know. I don't care about stars. On PyPy you see Do you want to share your screen? Yes. I prefer to look at actual downloads, not at stars on GitHub. So if you look at, you know, Lanchain still growing.[00:53:20] Alessio: These are the last six months. Llama Index still growing. What I've basically seen is like things that, One, obviously these things have A commercial product. So there's like people buying this and sticking with it versus kind of hopping in between things versus, you know, for example, crew AI, not really growing as much.[00:53:38] Alessio: The stars are growing. If you look on GitHub, like the stars are growing, but kind of like the usage is kind of like flat. In the last six months, have they done some[00:53:4
Happy holidays! We'll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production!For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver.Of perennial interest, particularly at academic conferences, is scaled-up architecture research as people hunt for the next Attention Is All You Need. We have many names for them: “efficient models”, “retentive networks”, “subquadratic attention” or “linear attention” but some of them don't even have any lineage with attention - one of the best papers of this NeurIPS was Sepp Hochreiter's xLSTM, which has a particularly poetic significance as one of the creators of the LSTM returning to update and challenge the OG language model architecture:So, for lack of a better term, we decided to call this segment “the State of Post-Transformers” and fortunately everyone rolled with it.We are fortunate to have two powerful friends of the pod to give us an update here:* Together AI: with CEO Vipul Ved Prakash and CTO Ce Zhang joining us to talk about how they are building Together together as a quote unquote full stack AI startup, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms, with notable industry contributions from RedPajama v2, Flash Attention 3, Mamba 2, Mixture of Agents, BASED, Sequoia, Evo, Dragonfly, Dan Fu's ThunderKittens and many more research projects this year* Recursal AI: with CEO Eugene Cheah who has helped lead the independent RWKV project while also running Featherless AI. This year, the team has shipped RWKV v5, codenamed Eagle, to 1.5 billion Windows 10 and Windows 11 machines worldwide, to support Microsoft's on-device, energy-usage-sensitive Windows Copilot usecases, and has launched the first updates on RWKV v6, codenamed Finch and GoldFinch. On the morning of Latent Space Live, they also announced QRWKV6, a Qwen 32B model modified with RWKV linear attention layers. We were looking to host a debate between our speakers, but given that both of them were working on post-transformers alternativesFull Talk on YoutubePlease like and subscribe!LinksAll the models and papers they picked:* Earlier Cited Work* Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* Hungry hungry hippos: Towards language modeling with state space models* Hyena hierarchy: Towards larger convolutional language models* Mamba: Linear-Time Sequence Modeling with Selective State Spaces* S4: Efficiently Modeling Long Sequences with Structured State Spaces* Just Read Twice (Arora et al)* Recurrent large language models that compete with Transformers in language modeling perplexity are emerging at a rapid rate (e.g., Mamba, RWKV). Excitingly, these architectures use a constant amount of memory during inference. However, due to the limited memory, recurrent LMs cannot recall and use all the information in long contexts leading to brittle in-context learning (ICL) quality. A key challenge for efficient LMs is selecting what information to store versus discard. In this work, we observe the order in which information is shown to the LM impacts the selection difficulty. * To formalize this, we show that the hardness of information recall reduces to the hardness of a problem called set disjointness (SD), a quintessential problem in communication complexity that requires a streaming algorithm (e.g., recurrent model) to decide whether inputted sets are disjoint. We empirically and theoretically show that the recurrent memory required to solve SD changes with set order, i.e., whether the smaller set appears first in-context. * Our analysis suggests, to mitigate the reliance on data order, we can put information in the right order in-context or process prompts non-causally. Towards that end, we propose: (1) JRT-Prompt, where context gets repeated multiple times in the prompt, effectively showing the model all data orders. This gives 11.0±1.3 points of improvement, averaged across 16 recurrent LMs and the 6 ICL tasks, with 11.9× higher throughput than FlashAttention-2 for generation prefill (length 32k, batch size 16, NVidia H100). We then propose (2) JRT-RNN, which uses non-causal prefix-linear-attention to process prompts and provides 99% of Transformer quality at 360M params., 30B tokens and 96% at 1.3B params., 50B tokens on average across the tasks, with 19.2× higher throughput for prefill than FA2.* Jamba: A 52B Hybrid Transformer-Mamba Language Model* We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. * Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. * This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU.* Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. * We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.* SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers* We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: * (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. * (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. * (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. * (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. * As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. * RWKV: Reinventing RNNs for the Transformer Era* Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. * We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.* Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. * We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.* LoLCATs: On Low-Rank Linearizing of Large Language Models* Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. * We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute. * We base these steps on two findings. * First, we can replace an LLM's softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss ("attention transfer").* Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA). * LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU. * Furthermore, LoLCATs does so with only 0.2% of past methods' model parameters and 0.4% of their training tokens. * Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work). * When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.Timestamps* [00:02:27] Intros* [00:03:16] Why Scale Context Lengths? or work on Efficient Models* [00:06:07] The Story of SSMs* [00:09:33] Idea 1: Approximation -> Principled Modeling* [00:12:14] Idea 3: Selection* [00:15:07] Just Read Twice* [00:16:51] Idea 4: Test Time Compute* [00:17:32] Idea 2: Hardware & Kernel Support* [00:19:49] RWKV vs SSMs* [00:24:24] RWKV Arch* [00:26:15] QWRKWv6 launch* [00:30:00] What's next* [00:33:21] Hot Takes - does anyone really need long context?Transcript[00:00:00] AI Charlie: We're back at Latent Space Live, our first mini conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co host. As a special treat this week, we're recapping the best of 2024 going domain by domain. We sent out a survey to the over 900 of you who told us what you wanted, and then invited the best speakers in the Latent Space Network to cover each field.[00:00:24] AI Charlie: 200 of you joined us in person throughout the day, with over 2200 watching live online. Thanks Our next keynote covers the State of Transformers alternative architectures, with a special joint presentation with Dan Fu of Together AI and Eugene Chia of Recursal AI and Featherless AI. We've featured both Together and Recursal on the pod before, with CEO Veepal Vedprakash introducing them.[00:00:49] AI Charlie: And CTO CE Zhang joining us to talk about how they are building together together as a quote unquote full stack AI startup from the lowest level kernel and systems [00:01:00] programming to the highest level mathematical abstractions driving new model architectures and inference algorithms with notable industry contributions from Red Pajama V2, Flash Attention 3, Mamba 2, Mixture of Agents.[00:01:15] AI Charlie: Based, Sequoia, Evo, Dragonfly, Danfoo's Thunder Kittens, and many more research projects this year. As for Recursal and Featherless, we were the first podcast to feature RWKV last year, and this year the team has shipped RWKV v5, codenamed Eagle, to 1. 5 billion Windows 10 and Windows 11 machines worldwide to support Microsoft's on device, end Energy Usage Sensitive Windows Copilot Use Cases and has launched the first updates on RWKV v6, codenamed Finch and Goldfinch.[00:01:53] AI Charlie: On the morning of Latent Space Live, they also announced QRdata UKv6, a QEN32B model [00:02:00] modified with RDWKV linear attention layers. Eugene has also written the most single most popular guest post on the Latent Space blog this year. Yes, we do take guest posts on what he has discovered about the H100 GPU inference NeoCloud market since the successful launch of Featherless AI this year.[00:02:20] AI Charlie: As always, don't forget to check the show notes for the YouTube link to their talk as well as their slides. Watch out and take care.[00:02:27] Intros[00:02:27] Dan Fu: Yeah, so thanks so much for having us. So this is going to be a little bit of a two part presentation. My name is Dan. I'm at Together AI, and I'll be joining UCSD as faculty in about a year. And Eugene, you want to introduce yourself?[00:02:46] Eugene Cheah: Eugene, I lead the art activity team, and I, I'm CEO of Featherless, and we both work on this new post transformer architecture space.[00:02:55] Dan Fu: Yeah, so yeah, so today we're really excited to talk to you a little bit [00:03:00] about that. So first I'm going to give a broad overview of kind of the last few years of progress in non post transformer architectures. And then afterwards Eugene will tell us a little bit about the latest and the greatest and the latest frontier models in this space.[00:03:16] Why Scale Context Lengths? or work on Efficient Models[00:03:16] Dan Fu: So, the story starts with Scaling. So this is probably a figure or something like this that you've seen very recently. Over the last five to six years, we've seen models really scale up in parameter size, and that's brought with it a bunch of new capabilities, like the ability to talk to you and tell you sometimes how to use your Colab screens.[00:03:35] Dan Fu: But another place where we've seen scaling especially recently is scaling in context length. So this can mean Having more text inputs for your models, but it can also mean things like taking a lot of visual token inputs image inputs to your models or generating lots of outputs. And one thing that's been really exciting over the last few months or so is that we're, we're seeing scaling, not only during training time, but also [00:04:00] during test time.[00:04:00] Dan Fu: So this is one of the, the, this is the iconic image from the OpenAI 01 release. Not only are we starting to scale train time compute, but we're also starting to scale test time compute. Now if you're familiar with our attention and our transformer architectures today, this graph on the right might look a little bit scary.[00:04:19] Dan Fu: And one of the reasons is that the implications are a little bit Interesting. So what does it mean if we want to continue having smarter and smarter models? Do we just need to start building bigger, bigger data centers, spending more flops? Is this this little Dolly 3, we need more flops, guys? Is this going to be the future of all of AI?[00:04:39] Dan Fu: Or is there a better way, another path forward? Maybe we can get the same capabilities that we've gotten used to, But for a lot less compute, a lot less flops. And one of the things that we're going to talk about today is specifically looking at that core attention operator in some of these models.[00:04:57] Dan Fu: And the reason is that so this is just some, some [00:05:00] basic you know, scaling curves, but attention has compute that scales quadratically in the context length. So that means that if you're doing something like test time compute and you want to spend a bunch of tokens thinking about what comes next, the longer that that goes the, the, the more tokens you spend on that, that compute grows quadratically in that.[00:05:19] Dan Fu: One of the questions that we're interested in is, can we take that basic sequence model, that basic sequence primitive at the bottom, and get it to scale better? Can we scale in, let's say, n to the 3 halves or n log n? So in, in the first part of the talk, so we just went over the introduction. What I'm gonna do over the next few slides is just talk about some of the key advances and ideas that have shown over the past few years since maybe early 2020 to, to now that shown promise that this might actually be possible.[00:05:48] Dan Fu: That you can actually get potentially the same quality that we want while scale, while scaling better. So to do that, we're and, and basically the, the story that we're gonna look is we're gonna start to see [00:06:00] how. So this is a basic graph of just the past couple years of progress of perplexity where that blue line, that dotted blue line, is attention.[00:06:07] The Story of SSMs[00:06:07] Dan Fu: It's your basic transformer, full dense attention. And then the dots coming down are some of the methods that you'll see in this presentation today. We're going to turn the clock back all the way to 2020. So this, this, this question of can we make attention subquadratic? Basically, as soon as we said attention is all you need, People started asking this question.[00:06:28] Dan Fu: So we have this quadratic attention operator. Can we do better? I'll briefly talk about why attention is quadratic. And the basic thing that happens, if you're not familiar, is that you have these inputs, these keys and queries. And what you do in this attention matrix, this S matrix over here, is that you're using, you're comparing every token in your input to every other token.[00:06:49] Dan Fu: So when I try to do something like upload a whole book to Gemini, what happens beyond the Maybe not Gemini, because we don't necessarily know what architecture is. But let's say we upload it to LLAMA, what happens beyond [00:07:00] the scenes, behind the scenes, is that it's going to take every single word in that book and compare it to every other word.[00:07:05] Dan Fu: And this has been a really, it's, it's led to some pretty impressive things. But it's kind of a brute forcing of the way that you would try to interpret a interpret something. And what attention does in particular is the, and then what attention, sorry, don't want to. Okay, no, no laser pointer. What, what attention does afterwards is that instead of always operating in this quadratic thing, it takes a row wise softmax over this matrix, and then multiplies it by this values matrix.[00:07:32] Dan Fu: So, one of the key points to notice is that the output size is always going to be the same as the inputs, at least in standard self attention. So one of the first things that folks tried to do around 2020 is this thing called linear attention, which is just, just noticing that if we take out this softmax from here, if we take out this non linearity in the middle of the attention operation, and then if you compute the keys and the values operation first, you actually never hit this quadratic bottleneck.[00:07:57] Dan Fu: So that, that's potentially a way [00:08:00] to get a lot more computationally efficient. And there are various ways to do this by basically using feature maps or try to approximate this overall attention computation. But some of this work sort of started to hit a wall in 2020. And the basic challenges were, were two.[00:08:16] Dan Fu: So one was quality. It was back then, it was kind of hard to, to get good quality with these linear attention operators. The other one was actually hardware efficiency. So these, this feature map that was just shown by a simplify simplify here. Actually ends up being quite computationally expensive if you just implement it naively.[00:08:34] Dan Fu: So you started having these operators that not only were you sure, you're not really sure if they have the same quality, but also they're actually just wall clock slower. So you kind of end up getting the worst of both worlds. So this was the the stage. So that kind of sets the stage for four years ago.[00:08:49] Dan Fu: Keep this in mind because linear attention is actually going to come back in a few years once we have a better understanding. But one of the works that started kicking off this, this [00:09:00] mini revolution in post transformer architectures was this idea called states based model. So here the seminal work is, is one about our work queue in 2022.[00:09:09] Dan Fu: And this, this piece of work really brought together a few ideas from, from some long running research research lines of work. The first one was, and this is really one of the keys to, to closing the gap in quality was just using things that, that if you talk to a, a, an electrical engineer off the street, they might know off, off the, like the back of their hand.[00:09:33] Idea 1: Approximation -> Principled Modeling[00:09:33] Dan Fu: But taking some of those properties with how we model dynamical systems in signal processing and then using those ideas to model the inputs, the, the text tokens in, for example a transformer like Next Token Prediction Architecture. So some of those early states-based model papers were looking at this relatively, relatively simple recurrent update model that comes from maybe chapter one of a signal processing class.[00:09:59] Dan Fu: But then using [00:10:00] some principle theory about how you should do that recurrent update in order to really get the most that you can out of your hidden state, out of your out of your sequence. So that, that was one key idea for quality and. When this was eventually realized, you started to see a bunch of benchmarks that were pretty sticky for a few years.[00:10:20] Dan Fu: Things like long range arena, some long sequence evaluation benchmarks, There was stuff in time series, time series analysis. They started to, you started to see the quality tick up in meaningful ways. But the other key thing that What's so influential about these states based models is that they also had a key idea about how you can compute these things efficiently.[00:10:45] Dan Fu: So if you go back to your machine learning 101 class where you learned about RNNs, one thing that you may have learned is that they don't paralyze as well as detention, because if you just run them naively, you have to do this kind of sequential update to process new tokens, [00:11:00] whereas in attention, you can process all the tokens in parallel at one time.[00:11:04] Dan Fu: One of the key insights behind the S4 paper was that these recurrent models, you could take them and you could also formulate them as a convolution. And in particular, with a convolution, you could, instead of using a PyTorch conv1d operation, you can compute that with the FFT. And that would give you n log n compute in the in the sequence length n with an operator that was relatively well optimized for modern hardware.[00:11:28] Dan Fu: So those are really, I'd say, the two key ideas in 2022 that started allowing these breakthroughs to happen in these non transformer architectures. So, these ideas about how to principally model sorry, how to model the recurrent updates of a mo of, of a sequence in a principled way, and also these key ideas in how you can compute it efficiently by turning it into a convolution and then scaling it up with the FFT.[00:11:53] Dan Fu: Along those same lines, so afterwards we started putting out some work on specialized kernels, so just [00:12:00] like we have flash attention for transformers, we also have works like flash fft conf, and if you look at these lines of work oftentimes when, whenever you see a new architecture, you see a new primitive one of the, one of the table stakes now is, do you have an efficient kernel so that you can actually get wall clock speed up?[00:12:14] Idea 3: Selection[00:12:14] Dan Fu: So by 2022, We are starting to have these models that had promising quality primitives, but and, and also promising wall clocks. So you could actually see regimes where they were better than transformers in meaningful ways. That being said, there were, there's still sometimes a quality gap, particularly for language modeling.[00:12:33] Dan Fu: And because languages, It's so core to what we do in sequence modeling these days the, the next, the next key idea that I'm going to talk about is this idea of selection mechanisms. And this is basically an idea of, so you have this recurrent state that you're keeping around that just summarizes everything that, that came before.[00:12:50] Dan Fu: And to get a good sequence model, one of the things that you really need to be able to do is have the model learn what's the best way to pick out pieces from that recurrent [00:13:00] state. So one of the, one of the major ideas here in a line of work called H3, Hungry Hungry Hippos, and also these hyena models were One way you can do this is by just adding some simple element wise gates.[00:13:13] Dan Fu: So versions of these ideas have been around for decades. If you squint at the LSTM paper you, you can probably find, find this gating mechanism. But turns out you can take those old ideas, add them into these new. state space models, and then you can see quality start to pick up. If you've heard of the Mamba model, this also takes the selection to the next level by actually making some changes in that fundamental recurrent state space.[00:13:40] Dan Fu: So, it's not only just this gating that happens around the SSM layer, but also you can actually make The ABCD matrices of your state space model, you can make them data dependent, which will allow you to even better select out different pieces from your hidden state depending on what you're seeing. I'll also point out if you look at the [00:14:00] bottom right of this figure, there's this little triangle with a GPU SRAM, GPU HBM, and this, this is just continuing that trend of when you have a new architecture you, you, you also release it with a kernel to, to, to show that it is hardware efficient, that it, that it can be hardware efficient on modern hardware.[00:14:17] Dan Fu: The, the, one of the next cool things that happened is once we had this understanding of these are the basic pieces, these are the basic principles behind some of the sequence models linear attention actually started to come back. So in earlier this year, there was a model called BASED the, from Simran Arora and, and some other folks, that combined a more principled version of linear attention that basically the, the, the, the two second summary is that it used a Taylor approximation of the softmax attention, combined that with a simple sliding window attention and was starting to able, starting to be able to expand the Pareto frontier of how much data can you recall from your sequence, versus how small is your recurrent state size.[00:14:58] Dan Fu: So those orange dots [00:15:00] are, at the top there, are just showing smaller sequences that can recall more memory.[00:15:07] Just Read Twice[00:15:07] Dan Fu: And the last major idea I think that has been influential in this line of work and is very relatively late breaking just a few months ago, is just the basic idea that when you have these models that are fundamentally more efficient in the sequence length, you maybe don't want to prompt them or use them in exactly the same way.[00:15:26] Dan Fu: So this was a really cool paper called Just Read Twice, also from Simran. That basically said, hey, all these efficient models can process tokens so much more efficiently than transformers that they can sometimes have unfair advantages compared to a simple transformer token. So, or sorry, a simple transformer model.[00:15:44] Dan Fu: So take, for example the standard, the standard use case of you have some long document, you're going to pass it in as input, and then you're going to ask some question about it. One problem you might imagine for a recurrent model where you have a fixed state size is, let's say that [00:16:00] you're. Article is very long, and you're trying to ask about some really niche thing.[00:16:04] Dan Fu: You can imagine it might be hard for the model to know ahead of time what information to put into the hidden state. But these, these, these models are so much more efficient that you can do something really stupid, like, you can just put the document write down the document, write down the question, write down the document again, and then write down the question again, and then this time, the second time that you go over that document, you know exactly what to look for.[00:16:25] Dan Fu: And the cool thing about this is, so this is, And this this results in better quality, especially on these recall intensive tasks. But the other interesting thing is it really takes advantage of the more efficient architectures that, that we're having here. So one of the other, I think, influential ideas in this line of work is if you change the fundamental compute capabilities of your model and the way that it scales, you can actually start to query it at test time differently.[00:16:51] Idea 4: Test Time Compute[00:16:51] Dan Fu: And this actually, of course, goes back to those slides on test time compute. So while everybody's looking at, say, test time compute for big transformer models, [00:17:00] I think potentially a really interesting research question is, how can you take those and how does it change with this new next generation of models?[00:17:09] Dan Fu: So the, I'll just briefly summarize what some of those key ideas were and then talk and then show you briefly kind of what the state of the art is today. So, so the four key ideas are instead of just doing a simple linear attention approximation, instead take ideas that we know from other fields like signal processing, do a more principled approach to your modeling of the sequence.[00:17:32] Idea 2: Hardware & Kernel Support[00:17:32] Dan Fu: Another key idea throughout all these lines of work is you really want. Hardware and kernel support from day one. So, so even if your model is theoretically more efficient if somebody goes and runs it and it's two times slower one of the things that, that we've learned is that if, if you're in that situation, it's, it's just gonna be dead on arrival.[00:17:49] Dan Fu: So you want to be designing your architectures one of the key, key machine learning ideas that has been important for the quality is just making sure that you encode different ways that you can [00:18:00] select from your hidden state and, and really focus on that as a key decider of quality. And finally, I think one of the, the, the emerging new, new things for, for this line of work and something that's quite interesting is, What are the right test time paradigms for these models?[00:18:15] Dan Fu: How do they change relative to relative to what you might do for a standard transformer? I'll briefly end this section. So I've labeled this slide where we are yesterday because Eugene is going to talk about some new models that he released literally this morning. But as of yesterday, some of the really cool results out of the, these efficient alternative models were so AI2 trained this hybrid MOE called Jamba.[00:18:40] Dan Fu: That, that, that seems, that is currently the state of the art for these non transformer architectures. There's this NVIDIA and MIT put out this new diffusion model called SANA recently that one of their key key observations is that you can take a standard diffusion transformer diffusion model, replace the layers with linear [00:19:00] attention, and then that lets you scale to much larger much larger images, much, much Much larger sequences more efficiently.[00:19:07] Dan Fu: And and one thing that I don't think anybody would have called when a few years ago is that one of those gated SSM, gated states based models ended up on the cover of Science because a great group of folks went and trained some DNA models. So that's Michael Polley, Eric Yuen from from Stanford and the Arc Institute.[00:19:26] Dan Fu: So it's, we're really at an exciting time in 2024 where these non transformer, post transformer architectures are showing promise across a wide range. Across a wide range of, of modalities, of applications, and, and of tasks. And with that, I'll pass it on to Eugene, who can tell you a little bit about the latest and greatest with RWKV.[00:19:49] RWKV vs SSMs[00:19:49] Eugene Cheah: So, that's useful? Yeah. You're talking to here. Oh, I'm talking to here. Okay. So, yeah, two streams. Yeah. So, I think one common questions that we tend to get asked, right, is what's the difference between [00:20:00] RWKV and state space? So I think one of the key things to really understand, right the difference between the two groups, right, is that we are actually more like an open source, random internet meets academia kind of situation.[00:20:11] Eugene Cheah: Like, most of us never wrote any paper, but we, we basically look at RNNs and linear intention when intention is all you need came out, and then we decided to like, hey there is a quadratic scaling problem. Why don't we try fixing that instead? So, so, so we end up developing our own branch, but we end up sharing ideas back and forth.[00:20:30] Eugene Cheah: So, and, and we do all this actively in Discord, GitHub, etc. This was so bad for a few years, right, that basically, the average group's H index was so close to zero, right, Illuter. ai actually came in and helped us write our first paper. Great, now our H index is now three, apparently. So, so, so, but, but the thing is, like, a lot of these experiments led to results, and, and, essentially, essentially, we we took the same ideas from linear attention, [00:21:00] and we built on it.[00:21:01] Eugene Cheah: So, to take a step back into, like, how does RWKB handle its own attention mechanic and achieve the same goals of, like, O and compute, respectively, and in focus of our overall goal to make AI accessible to everyone, regardless of language, nation, or compute, that's our goal. We actually train our models primarily on over a hundred languages, which is another topic altogether.[00:21:23] Eugene Cheah: And our goal is to train to even 200 languages to cover all languages in the world. But at the same time, we work on this architecture, To lower the compute cost so that people can run it on Raspberry Pis and on anything. So, how did RWKB break the dependency of LSTM token flow? Because I think to understand architecture, right, it's probably easier to understand it from the RNN lens.[00:21:46] Eugene Cheah: Because that's where we built on. We all, we all state space kind of like try to, try to start anew and took lessons from that and say, So there's a little bit of divergence there. And AKA, this our version of linear attention. So to take step back [00:22:00] all foundation models, be it transformers or non transformers at a very high level, right?[00:22:05] Eugene Cheah: Pumps in the token. I mean, text that things into embeddings and go through a lot of layers. Generate a lot of states where the QKV cache or be iron in states or RW KB states. And outputs and embedding, they are not the same thing. And we just take more layers and more embeddings. And somehow that magically works.[00:22:23] Eugene Cheah: So, if you, if you remember your ancient RNN lessons which we, which we, which we we call best learning these days the general idea is that you have the embedding information flowing all the way up, and when, and you take that information and you flow it back down, and then you process it as part of your LSTM layers.[00:22:41] Eugene Cheah: So, this is how it generally works. Kapati is quoted saying that RNNs are actually unreasonably effective. The problem is this is not scalable. To start doing work on the second token, you need to wait for the first token. And then you need to, and likewise for the third token and fourth token, yada yada.[00:22:55] Eugene Cheah: That is CPU land, not GPU land. So, so, so, you [00:23:00] can have a H100 and you can't even use 1 percent of it. So, so that's kind of why RNNs didn't really take off in the direction that we wanted, like, billions of parameters when it comes to training. So, what did RDAP KV version 0 do? Boom. We just did the dumbest, lamest thing.[00:23:13] Eugene Cheah: Sorry, this is the bottleneck for RNN. We did the dumb thing of removing that line. And it kind of worked. It trained. It sucked, but it kind of worked. Then we were like, hey, then no one cared because the loss was crap, but how do we improve that? And that's essentially where we move forward, because if you see this kind of flow, right, you can actually get your GPU saturated quickly, where it essentially cascades respectively.[00:23:41] Eugene Cheah: So I'm just waiting for this to loop again. So it's like, once you get your first layer, your token to be computed finish. You start to cascade your compute all the way until you are, Hey, I'm using 100 percent of the GPU. So we, we worked on it, and we started going along the principle of that as long as we keep this general architecture [00:24:00] where, where we can cascade and, and be highly efficient with our architecture, nothing is sacred in our architecture.[00:24:06] Eugene Cheah: And we have done some crazy ideas. In fact, you ask us, if you ask me to explain some things in the paper, right, officially in the paper, I'll say we had this idea and we wrote it this way. The reality is someone came with a code, we tested it, it worked, and then we rationalized later. So, so the general[00:24:24] RWKV Arch[00:24:24] Eugene Cheah: The idea behind rwkbr is that we generally have two major blocks that we do.[00:24:30] Eugene Cheah: We call time mix and channel mix. And time mix generally handles handles long term memory states, where essentially, where essentially where we apply the matrix multiplication and Cilu activation functions into processing an input embedding and an output embedding. I'm oversimplifying it because this, This calculation changed every version and we have, like, version 7 right now.[00:24:50] Eugene Cheah: ChannelMix is similar to Base in the sense that it does shorter term attention, where it just looks at the sister token, or the token before it, because [00:25:00] there's a shift in the token shift matrix. I don't really want to go too much into the papers itself, because, like, we do have three papers on this.[00:25:09] Eugene Cheah: Basically, RWKB, RNN for the transformer, ERA, Ego and Pinch, RWKB, Matrix Value State. This is the updated version 5, version 6. And Goldfinch is our, is, is, is, is our hybrid model respectively. We are writing the paper already for V seven and which is, which is for R wk V seven. Called, named Goose, or architectures are named by Bird.[00:25:30] Eugene Cheah: And, I'm going to cover as well, qrwkb, and mama100k, and rwkb, and Where did that lead to? Great! Because we are all GPU poor and to be clear, like, most of this research is done, like, only on a handful H100s, which I had one Google researcher told me that was, like, his experiment budget for a single researcher.[00:25:48] Eugene Cheah: So, our entire organization has less compute than a single researcher in Google. So We, we, one of the things that we explored into was to how do we convert transformer models instead? Because [00:26:00] someone already paid that billion dollars, a million dollars onto training, so why don't we take advantage of those weights?[00:26:05] Eugene Cheah: And, and to, I believe, together AI worked on the lockets for, for the Lambda side of things, and, and we took some ideas from there as well, and we essentially did that for RWKB.[00:26:15] QWRKWv6 launch[00:26:15] Eugene Cheah: And that led to, Q RWKB6, which we just dropped today, a 32 bit instruct preview model, where we took the Quen 32 bit instruct model, freeze the feedforward layer, remove the QKB attention layer, and replace it with RWKB linear layers.[00:26:32] Eugene Cheah: So to be clear, this means we do not have the rwkv channel mix layer, we only have the time mix layer. But but once we do that, we train the rwkv layer. Important is that the feedforward layer needs to be frozen, so the new attention can be learned. And then we unfreeze the feedforward layer, and train all the layers together with a custom learning rate schedule, so that they can learn how to work together.[00:26:54] Eugene Cheah: The end result, surprisingly, And, to be honest, to the frustration of the R. W. [00:27:00] KV MOE team, which ended up releasing the model on the same day, was that, with just a few hours of training on two nodes, we managed to get it to be on par, kind of, with the original QUAN32B model. So, in fact, when the first run, right, that completely confused us, it was like, and I was telling Daniel Goldstein, Smirky, who kind of leads most of our research coordination, When you pitched me this idea, you told me at best you'll get the same level of performance.[00:27:26] Eugene Cheah: You didn't tell me the challenge and score and Winograd score will shoot up. I don't know what's happening there. But it did. MMLU score dropping, that was expected. Because if you think about it, when we were training all the layers, right, we were essentially Like, Frankenstein this thing, and we did brain damage to the feedforward network layer 2 with the new RWKB layers.[00:27:47] Eugene Cheah: But, 76%, hey, somehow it's retained, and we can probably further train this. We didn't even spend more than 3 days training this, so there's a lot more that can be done, hence the preview. This brings up [00:28:00] a big question, because We are already now in the process of converting to 7TB. We are now, this is actually extremely compute efficient to test our attention mechanic.[00:28:10] Eugene Cheah: It's like, it becomes a shortcut. We can, we are already planning to do our version 7 and our hybrid architecture for it. Because we don't need to train from scratch. And we get a really good model out of it. And the other thing that is uncomfortable to say is that because we are doing right now on the 70b is that if this scales correctly to 128k context length, I'm not even talking about a million 128, majority of enterprise workload today is just on 70b at under 32k context length.[00:28:41] Eugene Cheah: That means if this works and the benchmark matches it, It means we can replace the vast majority of current AI workload, unless you want super long context. And then sorry, can someone give us more GPUs? Because we do need the VRAM for super long context, sadly. So yeah, that's what we are working on, and essentially, [00:29:00] we are excited about this to just push it further.[00:29:02] Eugene Cheah: And this conversion process, to be clear, I don't think it's going to be exclusive to RWKB. It probably will work for Mamba as well, I don't see why not. And we will probably see more ideas, or more experiments, or more hybrids, or Yeah, like, one of the weirdest things that I wanted to say outright, and I confirmed this with the Black Mamba team and the Jamba team, which because we did the GoFinch hybrid model, is that none of us understand why a hard hybrid with a state based model to be R.[00:29:28] Eugene Cheah: QA state space and transformer performs better when, than the baseline of both. It's like, it's like when you train one, you expect, and then you replace, you expect the same results. That's our pitch. That's our claim. But somehow when we jam both together, it outperforms both. And that's like one area of emulation that, like, we only have four experiments, plus four teams, that a lot more needs to be done.[00:29:51] Eugene Cheah: But, but these are things that excite me, essentially, because that is what it's potentially we can move ahead for. Which brings us to what comes next.[00:30:00] What's next[00:30:00] [00:30:00][00:30:00] Dan Fu: So, this part is kind of just some, where we'll talk a little bit about stuff that, that we're excited about. Maybe have some wild speculation on, on what, what's, what's coming next.[00:30:12] Dan Fu: And, of course this is also the part that will be more open to questions. So, a couple things that, that I'm excited about is continued hardware model co design for, for these models. So one of the things that we've put out recently is this library called ThunderKittens. It's a CUDA library.[00:30:29] Dan Fu: And one of the things that, that we found frustrating is every time that we built one of these new architectures, and I'm sure you had the exact same experience, we'd have to go and spend two months in CUDA land, like writing these, these new efficient things. And. If we decided to change one thing in PyTorch, like one line of PyTorch code is like a week of CUDA code at least.[00:30:47] Dan Fu: So one of our goals with, with a library like Thunderkitten, so we, we just broke down what are the key principles, what are the key hardware things what are the key, Compute pieces that you get from the hardware. So for example on [00:31:00] H100 everything is really revolves around a warp group matrix multiply operation.[00:31:06] Dan Fu: So you really want your operation to be able to split into relatively small matrix, matrix multiply operations. So like multiplying two 64 by 64 matrices, for example. And so if you know that ahead of time when you're designing your model, that probably gives you you know, some information about how you set the state sizes, how you set the update, how you set the update function.[00:31:27] Dan Fu: So with Thunderkittens we basically built a whole library just around this basic idea that all your basic compute primitives should not be a float, but it should be a matrix, and everything should just be matrix compute. And we've been using that to, to try to both re implement some existing architectures, and also start to design code.[00:31:44] Dan Fu: Some new ones that are really designed with this core with a tensor core primitive in mind. Another thing that that we're, that at least I'm excited about is we, over the last four or five years, we've really been looking at language models as the next thing. But if you've been paying [00:32:00] attention to Twitter there's been a bunch of new next generation models that are coming out.[00:32:04] Dan Fu: So there, there are. So, video generation models that can run real time, that are supported by your mouse and your keyboard, that I'm told if you play with them that, you know, that they only have a few seconds of memory. Can we take that model, can we give it a very long context length so that you could actually maybe generate an entire game state at a time?[00:32:25] Dan Fu: What does that look like for the model? You're certainly not going to do a giant quadratic attention computation to try to run that. Maybe, maybe use some of these new models, or some of these new video generation models that came out. So Sora came out I don't know, two days ago now. But with super long queue times and super long generation times.[00:32:43] Dan Fu: So that's probably a quadratic attention operation at the, at the bottom of it. What if we could remove that and get the same quality, but a lot faster generation time? Or some of the demos that we saw from Paige earlier today. You know, if I have a super long conversation with my [00:33:00] Gemini bot, what if I wanted to remember everything that it's seen in the last week?[00:33:06] Dan Fu: I mean, maybe you don't for personal reasons, but what if I did, you know? What does that mean for the architecture? And I think, you know, that's certainly something I'm pretty excited about. I'm sure you're excited about it too. So, I think we were supposed to have some hot takes, but I honestly don't remember what our hot takes were.[00:33:21] Hot Takes - does anyone really need long context?[00:33:21] Eugene Cheah: Yeah, including the next slide. Hot takes, yes, these are our[00:33:25] Dan Fu: hot takes.[00:33:25] Eugene Cheah: I think the big one on Twitter that we saw, that we shared, was the question is like, is RAG relevant? In the case of, like, the future of, like, state based models?[00:33:38] Dan Fu: Let's see, I haven't played too much with RAG. But when I have. I'll say I found it was a little bit challenging to do research on it because we had this experience over and over again, where you could have any, an embedding model of any quality, so you could have a really, really bad embedding model, or you could have a really, really [00:34:00] good one, By any measure of good.[00:34:03] Dan Fu: And for the final RAG application, it kind of didn't matter. That's what I'll say about RAG while I'm being recorded. I know it doesn't actually answer the question, but[00:34:13] Eugene Cheah: Yeah, so I think a lot of folks are like, extremely excited of the idea of RWKB or State Space potentially having infinite context.[00:34:21] Eugene Cheah: But I think the reality is that when we say infinite context, we just mean a different kind of infinite context, or you, or as it's previously covered, you need to test the model differently. So, think of it more along the lines of the human. Like, I don't remember what I ate for breakfast yesterday.[00:34:37] Eugene Cheah: Yeah, that's the statement that I'll say. And And we humans are not quadratic transformers. If we did, if let's say we increased our brain size for every second we live, we would have exploded by the time we are 5 years old or something like that. And, and I think, I think basically fundamentally for us, right, be it whether we, regardless of whether RWKB, statespace, XLSTM, [00:35:00] etc, our general idea is that instead of that expanding state, that increase in computational cost, what if we have a fixed state size?[00:35:08] Eugene Cheah: And Information theory detects that that fixed state size will have a limit. Just how big of a limit is a question, like, we, like, RWKB is running at 40 megabytes for, for its state. Its future version might run into 400 megabytes. That is like millions of tokens in, if you're talking about mathematically, the maximum possibility.[00:35:29] Eugene Cheah: It's just that I guess we were all more inefficient about it, so maybe we hit 100, 000. And that's kind of like the work we are doing, trying to like push it and maximize it. And that's where the models will start differing, because it will choose to forget things, it will choose to remember things. And that's why I think that there might be some element of right, but it may not be the same right.[00:35:49] Eugene Cheah: It may be the model learn things, and it's like, hmm, I can't remember that, that article. Let me do a database search, to search. Just like us humans, when we can't remember the article in the company. We do a search on Notion. [00:36:00][00:36:00] Dan Fu: I think something that would be really interesting is if you could have facts that are, so right now, the one intuition about language models is that all those parameters are around just to store random facts about the world.[00:36:14] Dan Fu: And this intuition comes from the observation that if you take a really small language model, it can do things like talk to you, or kind of has like the The style of conversation, it can learn that, but where it will usually fall over compared to a much larger one is it'll just be a lot less factual about things that it knows or that it can do.[00:36:32] Dan Fu: But that points to all those weights that we're spending, all that SGD that we're spending to train these models are just being used to store facts. And we have things like databases that are pretty good at storing facts. So I think one thing that would be really interesting is if we could actually have some sort of outside data store that a language model can can look at that that maybe is you know, has has some sort of gradient descent in it, but but would be quite interesting.[00:36:58] Dan Fu: And then maybe you could edit it, delete [00:37:00] facts, you know, change who's president so that it doesn't, it doesn't get lost.[00:37:04] Vibhu: Can we open up Q& A and hot takes for the audience? I have a hot take Q& A. Do these scale? When, when 405B state space model, RAG exists, no one does long context, who's throwing in 2 million token questions, hot takes?[00:37:24] Dan Fu: The, the who's throwing in 2 million token question, I think, is, is a really good question. So I actually, I was going to offer that as a hot take. I mean, my hot take was going to be that long context doesn't matter. I know I just gave a whole talk about it, but you know, what, what's the point of doing research if you can't, you know, play both sides.[00:37:40] Dan Fu: But I think one of the, so I think for both of us, the reason that we first got into this was just from the first principled questions of there's this quadratic thing. Clearly intelligence doesn't need to be quadratic. What is going on? Can we understand it better? You know, since then it's kind of turned into a race, which has [00:38:00] been exciting to watch, like, how much context you can take in.[00:38:03] Dan Fu: But I think it's right. Nobody is actually putting in a two million context prompt into these models. And, and, you know, if they are, maybe we can go, go You know, design a better model to do that particular thing. Yeah, what do you think about that? So you've also been working on this. Do you think long context matters?[00:38:19] Eugene Cheah: So I'm going to burn a bit. How many of you remember the news of Google Gemini supporting 3 million contacts, right? Raise your hand.[00:38:28] Vibhu: Yeah, 2 million.[00:38:29] Eugene Cheah: Oh, it's 2 million.[00:38:31] Eugene Cheah: Yeah, how many of you actually tried that? See?[00:38:34] Vibhu: I use it a lot. You? You work for MindsTV. I use it a lot.[00:38:41] Eugene Cheah: So, for some people that has used, and I think, I think that's the, that's might be, like, this is where my opinion starts to differ, because I think the big labs may have a bigger role in this, because Like, even for RWKB, even when we train non contacts, the reason why I say VRAM is a problem is that because when we did the, we need to backprop [00:39:00] against the states, we actually need to maintain the state in between the tokens by the token length.[00:39:05] Eugene Cheah: So that means we need to actually roll out the whole 1 million contacts if we are actually training 1 million. Which is the same for transformers, actually, but it just means we don't magically reuse the VRAM consumption in the training time space. So that is one of the VRAM bottlenecks, and I'm neither OpenAI nor Google, so donate GPUs if you have too much of them.[00:39:27] Eugene Cheah: But then, putting it back to another paradigm, right, is that I think O1 style reasoning might be actually pushing that direction downwards. In my opinion, this is my partial hot take is that if, let's say you have a super big model, And let's say you have a 70B model that may take double the tokens, but gets the same result.[00:39:51] Eugene Cheah: Strictly speaking, a 70B, and this is even for transformer or non transformer, right? We we'll take less less resources than that 400 B [00:40:00] model, even if it did double the amount thinking. And if that's the case, and we are still all trying to figure this out, maybe the direction for us is really getting the sub 200 B to be as fast as efficient as possible.[00:40:11] Eugene Cheah: We a very efficient architecture that some folks happen to be working on to, to just reason it out over larger and larger context thing.[00:40:20] Question: Yeah. One thing I'm super interested in is. Models that can watch forever? Obviously you cannot train something on infinite context length. How are y'all thinking about that, where you run on a much longer context length than is possible to train on?[00:40:38] Dan Fu: Yeah, it's a, it's a great question. So I think when I think you guys probably had tweets along these lines, too. When we first started doing these things, because these are all recurrent models in theory you could just run it forever. You could just run it forever. And at the very least it won't, it won't like error out on your crash.[00:40:57] Dan Fu: There's another question of whether it can actually [00:41:00] use what it's seen in that infinite context. And I think there, so one place where probably the research and architectures ran faster Then another research is actually the benchmarks for long context. So you turn it on forever. You want to do everything or watch everything.[00:41:16] Dan Fu: What is it that you actually wanted to do? Can we actually build some benchmarks for that? Then measure what's happening. And then ask the question, can the models do it? Is there something else that they need? Yeah, I think that if I were to turn back the clock to 2022, that's probably one of the things I would have done differently, which would have been actually get some long context benchmarks out at the same time as we started pushing context length on all these models.[00:41:41] Eugene Cheah: I will also say the use case. So like, I think we both agree that there's no Infinite memory and the model needs to be able to learn and decide. I think what we have observed for, I think this also fits the state space model, is that one of the key advantages of this alternate attention mechanic that is not based on token position is that the model don't suddenly become crazy when you go past the [00:42:00] 8k training context tank, or a million context tank.[00:42:03] Eugene Cheah: It's actually still stable. It's still able to run, it's still able to rationalize. It just starts forgetting things. But some of these things are still there in latent memory. Some of these things are still somewhat there. That's the whole point of why reading twice works. Things like that. And one of the biggest pushes in this direction is that I think both Statespace and RWKB have Separate papers by other researchers where they use this architecture for time series data.[00:42:26] Eugene Cheah: Weather modeling. So, you are not asking what was the weather five days ago. You're asking what's the weather tomorrow based on the infinite length that we, as long as this Earth and the computer will keep running. So, so, and they found that it is like, better than existing, like, transformer or existing architecture in modeling this weather data.[00:42:47] Eugene Cheah: Control for the param size and stuff. I'm quite sure there are people with larger models. So, so there are things that, that in this case, right, there is future applications if your question is just what's next and not what's 10 years ago.[00:42:59] Dan Fu: Thanks so [00:43:00] much for having us. Get full access to Latent Space at www.latent.space/subscribe
Spleenlab is a highly specialized AI software company founded with the Idea to redefine Safety and AI. Since April 2018, the company has been primarily engaged in the development and distribution of safe Machine Learning algorithms for the semi- and fully autonomous mobility, especially the flight of unmanned aerial vehicles (UAV), helicopters, Air Taxis, driving vehicles and beyond. The groundbreaking fusion of different sensors, such as camera, laser and radar by means of Machine Learning is the core business of the company. The generated SLAM (Simultaneous Localization and Mapping) enables completely new applications and products for any kind of autonomous mobility.
Fala galera, nesse episódio eu entrevisto o Hugo Abonizio, LLM Engineer na Maritaca AI, a única empresa de LLM brasileira que acabou de lançar o Sabia 3, um LLM focado em portugues com resultados entre o gpt4o e o Claude 3.5 Sonnet. No episódio a gente fala sobre a Maritaca AI, o Sabia 3 e LLMs em geral. Foi um papo muuito massa e eu recomendo a todo mundo escutar e compartilhar com a galera! LSTM vs transformers, na figura 7 tem o paper das scaling laws da openai: https://arxiv.org/pdf/2001.08361 Bitter Lesson sobre métodos que usam mais dados/computação serem melhores do que métodos formulados por humanos: http://www.incompleteideas.net/IncIdeas/BitterLesson.html Video explicando o ColBERT/PaliBERT: https://x.com/victorialslocum/status/1848689063315841039 Paper do Sabia 3: https://arxiv.org/pdf/2410.12049 Link do grupo do wpp: https://chat.whatsapp.com/GNLhf8aCurbHQc9ayX5oCP Instagram do podcast: https://www.instagram.com/podcast.lifewithai Linkedin do podcast: https://www.linkedin.com/company/life-with-ai
Jürgen Schmidhuber, the father of generative AI shares his groundbreaking work in deep learning and artificial intelligence. In this exclusive interview, he discusses the history of AI, some of his contributions to the field, and his vision for the future of intelligent machines. Schmidhuber offers unique insights into the exponential growth of technology and the potential impact of AI on humanity and the universe. YT version: https://youtu.be/DP454c1K_vQ MLST is sponsored by Brave: The Brave Search API covers over 20 billion webpages, built from scratch without Big Tech biases or the recent extortionate price hikes on search API access. Perfect for AI model training and retrieval augmentated generation. Try it now - get 2,000 free queries monthly at http://brave.com/api. TOC 00:00:00 Intro 00:03:38 Reasoning 00:13:09 Potential AI Breakthroughs Reducing Computation Needs 00:20:39 Memorization vs. Generalization in AI 00:25:19 Approach to the ARC Challenge 00:29:10 Perceptions of Chat GPT and AGI 00:58:45 Abstract Principles of Jurgen's Approach 01:04:17 Analogical Reasoning and Compression 01:05:48 Breakthroughs in 1991: the P, the G, and the T in ChatGPT and Generative AI 01:15:50 Use of LSTM in Language Models by Tech Giants 01:21:08 Neural Network Aspect Ratio Theory 01:26:53 Reinforcement Learning Without Explicit Teachers Refs: ★ "Annotated History of Modern AI and Deep Learning" (2022 survey by Schmidhuber): ★ Chain Rule For Backward Credit Assignment (Leibniz, 1676) ★ First Neural Net / Linear Regression / Shallow Learning (Gauss & Legendre, circa 1800) ★ First 20th Century Pioneer of Practical AI (Quevedo, 1914) ★ First Recurrent NN (RNN) Architecture (Lenz, Ising, 1920-1925) ★ AI Theory: Fundamental Limitations of Computation and Computation-Based AI (Gödel, 1931-34) ★ Unpublished ideas about evolving RNNs (Turing, 1948) ★ Multilayer Feedforward NN Without Deep Learning (Rosenblatt, 1958) ★ First Published Learning RNNs (Amari and others, ~1972) ★ First Deep Learning (Ivakhnenko & Lapa, 1965) ★ Deep Learning by Stochastic Gradient Descent (Amari, 1967-68) ★ ReLUs (Fukushima, 1969) ★ Backpropagation (Linnainmaa, 1970); precursor (Kelley, 1960) ★ Backpropagation for NNs (Werbos, 1982) ★ First Deep Convolutional NN (Fukushima, 1979); later combined with Backprop (Waibel 1987, Zhang 1988). ★ Metalearning or Learning to Learn (Schmidhuber, 1987) ★ Generative Adversarial Networks / Artificial Curiosity / NN Online Planners (Schmidhuber, Feb 1990; see the G in Generative AI and ChatGPT) ★ NNs Learn to Generate Subgoals and Work on Command (Schmidhuber, April 1990) ★ NNs Learn to Program NNs: Unnormalized Linear Transformer (Schmidhuber, March 1991; see the T in ChatGPT) ★ Deep Learning by Self-Supervised Pre-Training. Distilling NNs (Schmidhuber, April 1991; see the P in ChatGPT) ★ Experiments with Pre-Training; Analysis of Vanishing/Exploding Gradients, Roots of Long Short-Term Memory / Highway Nets / ResNets (Hochreiter, June 1991, further developed 1999-2015 with other students of Schmidhuber) ★ LSTM journal paper (1997, most cited AI paper of the 20th century) ★ xLSTM (Hochreiter, 2024) ★ Reinforcement Learning Prompt Engineer for Abstract Reasoning and Planning (Schmidhuber 2015) ★ Mindstorms in Natural Language-Based Societies of Mind (2023 paper by Schmidhuber's team) https://arxiv.org/abs/2305.17066 ★ Bremermann's physical limit of computation (1982) EXTERNAL LINKS CogX 2018 - Professor Juergen Schmidhuber https://www.youtube.com/watch?v=17shdT9-wuA Discovering Neural Nets with Low Kolmogorov Complexity and High Generalization Capability (Neural Networks, 1997) https://sferics.idsia.ch/pub/juergen/loconet.pdf The paradox at the heart of mathematics: Gödel's Incompleteness Theorem - Marcus du Sautoy https://www.youtube.com/watch?v=I4pQbo5MQOs (Refs truncated, full version on YT VD)
We still have something for the summer break! One prompt and the song is ready. This will change the music industry. Peter is a big Prince fan, so we came up with something for the summer. Kudos to Tristan Behrens and Suno.AI. Try it out. We were fascinated by it.
集英社連載最新回「もはやアートを超える!? ジェームズ・ウェッブ宇宙望遠鏡が撮った天体写真が美しすぎる!」 を公開しました! https://yomitai.jp/series/astro/18-sasaki/ 初書籍「超入門はじめてのAI・データサイエンス」出版しました! Youtubeチャンネルも更新中! ブラックホール特集プレイリストはこちら! スーパー宇宙ビジネスマン準レギュラー「ヒロ」のプレイリストはこちら! 月プレイリストはこちら! 宇宙兄弟公式コラボプレイリストはこちら! お便りコーナーはこちら! もう1つのチャンネル「となりのデータ分析屋さん」はこちら! Spotify / Apple Podcast Instagram(ryo_astro) 個人ホームページはこちら! Twitter(_ryo_astro) note ジングル作成:モリグチさんfromワクワクラジオ ソース https://academic.oup.com/pasj/article-abstract/75/3/691/7147651
Nihayet geldik ChatGPT'nin temelindeki modele. Hafıza sahibi ağlardan (LSTM) bugün her yerde kullanılan transformer mimarisine geçişi anlayacağız. Kilit kavramlar: Word embedding ve attention mekanizması..Konular:(00:04) Kartopu Dünya Hipotezi(00:55) Tekrar: Geri besleme(02:15) Kısa dönem hafıza(04:30) LSTM(05:57) Token(07:40) Matrix(08:50) Word Embeddings(14:20) Fularsız Entellik çok bozdu(16:50) Encoder Decoder(18:30) Attention Mekanizması(19:50) Transformer Mimarisi(23:45) Parallellik(26:20) ChatGPT vs GPT(28:12) Patreon TeşekkürleriKaynaklar:Video Ders: MIT 6.S191 (2023): Recurrent Neural Networks, Transformers, and AttentionVideo: Let's build GPT: from scratch, in code, spelled out.------- Podbee Sunar -------Bu Podcast, Amazon Prime Video hakkında reklam içerir. "Amazon'dan Nefes Kesici Eğlence. The Boys 4. Sezon, Fallout ve Güven Bana, Hepsi Prime'la ücretsiz. Prime Video üyeliği gereklidir. Hüküm ve koşullar için amazon.com.tr'yi ziyaret edin."Bu Podcast, Hiwell hakkında reklam içerir. Hiwell'in klinik psikologlarıyla ücretsiz tanışma görüşmeleri yapmak ve terapi seanslarınızda pod10 koduyla %10 indirimden faydalanmak için. tıklayın.See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.
In this episode, we discuss the challenges faced by the health workforce in fragile settings such as conflict zones and areas hit by political and economic crises. Our guests share their first-hand experiences and insights on how health systems and workers strive to provide care under extreme conditions. This episode is part of the six-part mini-series "Stories of Resilience: Local Lives and Health Systems," brought to you by ReBUILD for Resilience.During this episode, there are repeated references to war and conflict with first-hand experiences of illness and trauma which you may find distressing.Chapters00:00 Welcome to Connecting Citizens to Science00:42 Introduction to the Health Workforce in Fragile Settings01:20 Insights from Dr. Nasher Al-Aghbari in Yemen01:46 Global Challenges with Dr. Jim Campbell02:32 Welcoming Remarks by Dr. Kim Ozano02:53 Dr. Wesam Mansour on Health Workers in Gaza04:08 Dr. Nasher's Experiences in Yemen07:39 The Role of WHO with Dr. Jim Campbell10:53 System vs. Individual Resilience12:46 International Support and Building Resilience15:18 Emergency Response vs. Long-Term Efforts17:21 Health Worker Migration Issues21:55 Key Advice for Strengthening Resilience24:48 Concluding Remarks and Call to ActionIn this episode:Dr Wesam Mansour, Post-Doctoral Research Associate, Liverpool School of Tropical Medicine, UK.Wesam is a physician, a Paediatric and Neonatology Specialist and a Fellow at the International Society for Quality in Healthcare (ISQua), with a Ph.D. in Health Policy and Management. At Liverpool School of Tropical Medicine, Wesam is a research associate at the Department of International Public Health. For the ReBUILD programme, she coordinates the work with ReBUILD's country partners, particularly those in Lebanon.Dr Nasher Al-Aghbari - Head of Paediatric Department, Al-Thawra General Modern Teaching Hospital, Sana'a, Yemen.Nasher Al-Aghbari is a paediatric consultant in the Paediatric Department at Al-Thawra General Teaching Modern Hospital in Sana'a, Yemen. He is the Head of the Paediatric Emergency Department. He is also a member of the Teaching Panel in the Arab Board Membership. He has undertaken research for the past 15 years. As part of his Masters degree and PhD at LSTM, Dr. Al-Aghbari worked on blood diseases in children in Africa and Yemen and pulmonary childhood tuberculosis in Yemen. Mr Jim Campbell - Director of the Health Workforce Department, World Health Organization, Geneva.Jim Campbell is the Director of the Health Workforce Department at the World Health Organisation. He oversees the development and implementation of global public goods, evidence and tools to inform investments in the education, employment and retention of the health and care workforce in pursuit of global health security, universal health coverage and the Sustainable Development Goals. Useful links:Research on gender and close-to-community providers of health care – providing policy guidance in a pandemic - Rebuild ConsortiumRebuild ConsortiumHow to attract and retain health workers in rural areas of fragile states? Findings from a labour market survey in Guinea - Rebuild ConsortiumRebuild Consortium“Posting policies don't change because there is peace or war” - Rebuild ConsortiumRebuild Consortium
In this episode of the IoT For All Podcast, Serhii Skoromets, AI Lead and Consultant at ElifTech, joins Ryan Chacon to discuss AIoT, LLMs and generative AI, and the use of drones in warfare. The conversation also covers the application of AI in healthcare and agriculture, smart assistants, digital twins, AI security challenges, digital transformation, and human trust in technology. Serhii Skoromets has extensive experience in the field of AI and has been experimenting with algorithms and developing AI models since the early days of AI awareness. His expertise lies in translating cutting-edge AI advancements into tangible business value. He applies a wide range of technologies such as LLMs, LangChain, LSTM, NN, NLP, and NLU to targeted use cases across various industries including fintech, logistics, wealth management, ecommerce, and edtech. ElifTech is a trusted global technology services company that designs, builds, and implements effective software solutions to supply the needs of businesses in AI, IoT, fintech, ecommerce, healthcare, logistics, edtech, and other industries. They're a team that transforms business ideas into valuable high-tech products that connect people, systems, devices, and data. Going beyond just being another service provider, they accompany clients on their journeys from startups to thriving enterprises. Discover more about AI and IoT at https://www.iotforall.com More about ElifTech: https://www.eliftech.com Connect with Serhii: https://www.linkedin.com/in/serhii-skoromets/ Our sponsor: https://www.com4.no/ Free IoT Test Kit: https://www.com4.no/en/free-iot-test-kit (00:00) Sponsor (00:23) Intro (00:32) Serhii Skoromets and ElifTech (01:43) The role of LLMs and generative AI (02:57) Consumer trust in AI (04:49) Smart assistants (06:43) AI and IoT: digital twins and smart homes (09:29) Smart devices, IoT data, and security (10:39) Drones: AI and IoT in warfare (14:09) AI and IoT in agriculture (16:18) Identifying real AI solutions (17:23) Human trust in technology (18:46) Digital transformation and conclusion SUBSCRIBE TO THE CHANNEL: https://bit.ly/2NlcEwm Join Our Newsletter: https://www.iotforall.com/iot-newsletter Follow Us on Social: https://linktr.ee/iot4all Check out the IoT For All Media Network: https://www.iotforall.com/podcast-overview
Sepp Hochreiter hat in seiner Diplomarbeit zu Beginn der neunziger Jahre eine kleine KI-Revolution erdacht. Sein Lernalgorithmus verbarg sich hinter dem Kürzel LSTM, das steht für Long-Short-Term-Memory. Er verlieh der KI so etwas wie ein brauchbares Gedächtnis - Computer können seither besser mit Sequenzen umgehen. Als er die Idee zusammen mit Jürgen Schmidhuber verbreitete und weiterentwickelte, erkannten zunächst nicht einmal Fachleute, welche Macht sie besaß. Letztendlich hob LSTM die Sprachverarbeitung auf eine neue Stufe. Von Google bis Microsoft verwendete jeder Tech-Konzern sie, bis zum Jahr 2017 steckte sie sozusagen in jedem Smartphone. Dann veröffentlichten Fachleute von Google einen Aufsatz und stellten darin die Grundlage für jene großen Sprachemodelle vor, die heute so erfolgreich sind und die durch ChatGPT berühmt und ein Massenphänomen wurden. Doch auch diese KI-Systeme haben Grenzen, scheitern an Aufgaben, die für das Gehirn leicht sind. Nun kommt Hochreiter, der inzwischen im österreichischen Linz lehrt und forscht, seinerseits mit der nächsten womöglich bahnbrechenden Idee um die Ecke: Er stellte unlängst vor, wie er LSTM weiterentwickelt hat - unter dem Kürzel xLSTM, das steht für Extended Long-Short-Term-Memory. Die bisherigen Schwächen will er reduziert und zugleich die Vorteile gegenüber den gegenwärtig angesagten Transformer-Modellen verbessert haben. Wie funktioniert das? Was steckt dahinter? Wie weit ist sein Unternehmen NXAI? Über all das sprechen wir in dieser Episode.
Generative AI in 2016? Get an early look and how the AI/ML industry was back when data science was evolving into what we see today! Future directions in Generative #AI. This episode features Martin Musiol, founder of GenerativeAI.net, as he discusses the evolution and impact of generative AI on various industries. The conversation spans Martin's career journey, from his early days in data science to the founding of GenerativeAI.net and its mission to educate and implement generative AI solutions. Key topics include the differences between traditional AI/ML and generative AI, practical applications, challenges and opportunities in the field, and the future of artificial general intelligence (AGI). GenerativeAI.net was founded in 2018 and focuses on educating and consulting on generative AI technologies. The company offers online courses, a popular newsletter, and consultancy services to help businesses implement AI solutions effectively. Technology and Technology Partners Mentioned: GenerativeAI.net, GANs, Transformer Models, IBM, Infosys Consulting, AI, ML, Data Science, NLP, LSTM, Bird Model, OpenAI, GPT, AGI, Cognitive Services, AI Consultancy, Fleeky AI, RAG System ☑️ Web: https://www.generativeai.net ☑️ Support the Channel: https://ko-fi.com/gtwgt ☑️ Be on #GTwGT: Contact via Twitter @GTwGTPodcast or visit https://www.gtwgt.com ☑️ Subscribe to YouTube: https://www.youtube.com/@GTwGTPodcast?sub_confirmation=1 ☑️ Subscribe to Spotify: https://open.spotify.com/show/5Y1Fgl4DgGpFd5Z4dHulVX • https://gtwgt.com Web • https://twitter.com/GTwGTPodcast Twitter • https://podcasts.apple.com/us/podcast/id1519439787?mt=2&ls=1 Apple Podcasts ☑️ Music: https://www.bensound.com
In this episode we talk about Timeseries and in the main part Prof. Dr. Marco Huber from IPA and Marc Zöller from GFT explain how their Timeseries AutoML tool works. In the news part we offer an an overview from Statistical (ARIMA) through ML (Random Forest, gradient boosting, ...), Neural Networks (LSTM), AutoML and lately Transformers approaches. Thanks for listening. We welcome suggestions for topics, criticism and a few stars on Apple, Spotify and Co. We thank our partner **SIEMENS** https://www.siemens.de/de/ xLSTM Github ([mehr](https://lnkd.in/eG3HWJrs)) auto-sktime ([mehr](https://github.com/Ennosigaeon/auto-sktime)) Our guest [Marco Huber ](https://www.linkedin.com/in/marco-huber-78a1a151/) Our guest [Marc Zöller ](https://www.linkedin.com/in/marc-zoeller/) #machinelearning #ai #aimodel #industrialautomation #manufacturing #automation #genai #datascience #mlops #llm #IndustrialAI #artificialintelligence #sklearn
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Visualizing neural network planning, published by Nevan Wichers on May 9, 2024 on The AI Alignment Forum. TLDR We develop a technique to try and detect if a NN is doing planning internally. We apply the decoder to the intermediate representations of the network to see if it's representing the states it's planning through internally. We successfully reveal intermediate states in a simple Game of Life model, but find no evidence of planning in an AlphaZero chess model. We think the idea won't work in its current state for real world NNs because they use higher-level, abstract representations for planning that our current technique cannot decode. Please comment if you have ideas that may work for detecting more abstract ways the NN could be planning. Idea and motivation To make safe ML, it's important to know if the network is performing mesa optimization, and if so, what optimization process it's using. In this post, I'll focus on a particular form of mesa optimization: internal planning. This involves the model searching through possible future states and selecting the ones that best satisfy an internal goal. If the network is doing internal planning, then it's important the goal it's planning for is aligned with human values. An interpretability technique which could identify what states it's searching through would be very useful for safety. If the NN is doing planning it might represent the states it's considering in that plan. For example, if predicting the next move in chess, it may represent possible moves it's considering in its hidden representations. We assume that NN is given the representation of the environment as input and that the first layer of the NN encodes the information into a hidden representation. Then the network has hidden layers and finally a decoder to compute the final output. The encoder and decoder are trained as an autoencoder, so the decoder can reconstruct the environment state from the encoder output. Language models are an example of this where the encoder is the embedding lookup. Our hypothesis is that the NN may use the same representation format for states it's considering in its plan as it does for the encoder's output. Our idea is to apply the decoder to the hidden representations at different layers to decode them. If our hypothesis is correct, this will recover the states it considers in its plan. This is similar to the Logit Lens for LLMs, but we're applying it here to investigate mesa-optimization. A potential pitfall is that the NN uses a slightly different representation for the states it considers during planning than for the encoder output. In this case, the decoder won't be able to reconstruct the environment state it's considering very well. To overcome this, we train the decoder to output realistic looking environment states given the hidden representations by training it like the generator in a GAN. Note that the decoder isn't trained on ground truth environment states, because we don't know which states the NN is considering in its plan. Game of Life proof of concept (code) We consider an NN trained to predict the number of living cells after the Nth time step of the Game of Life (GoL). We chose the GoL because it has simple rules, and the NN will probably have to predict the intermediate states to get the final cell count. This NN won't do planning, but it may represent the intermediate states of the GoL in its hidden states. We use an LSTM architecture with an encoder to encode the initial GoL state, and a "count cells NN" to output the number of living cells after the final LSTM output. Note that training the NN to predict the number of alive cells at the final state makes this more difficult for our method than training the network to predict the final state since it's less obvious that the network will predict t...
Hochreiter's xLSTM paper was eagerly awaited. Now it is here. Quote: "We have enhanced LSTM to xLSTM by exponential gating with memory mixing and a new memory structure. xLSTM models perform favorably on language modeling when compared to state-of-the-art methods like Transformers and State Space Models. The scaling laws indicate that larger xLSTM models will be serious competitors to current Large Language Models that are built with the Transformer technology. xLSTM has the potential to considerably impact other deep learning fields like Reinforcement Learning, Time Series Prediction, or the modeling of physical systems." Industrial applications in particular are eyeing xLSTM. The next step is the commercial distribution of the architecture. We met Timo Gessmann, CTO of Schunk, at the Hannover Messe. He uses a GenAI approach to accelerate labeling. The Industrial AI Podcast reports weekly on the latest developments in AI and machine learning for the engineering, robotics, automotive, process and automation industries. The podcast features industrial users, scientists, vendors and startups in the field of Industrial AI and machine learning. The podcast is hosted by Peter Seeberg, Industrial AI consultant and Robert Weber, tech journalist. Their mission: Demystify Industrial AI and machine learning, inspire industrial users. The hosts: Peter Seeberg is an Industrial AI and machine learning expert for the manufacturing industry. He worked over 25 years in IT (Intel) and 10 years in Automation. He co-initiated the Industrial Data Intelligence Startup (Softing) where he was responsible for managing machine learning projects in industrial environments. Today he advises companies when it comes to Industrial AI and machine learning. Together with Robert Weber, journalist for industrial topics, he discusses AI and ML applications, standards, and education topics, make or buy decisions as well as regulation for AI in manufacturing.
Did you know that the concept of deep learning goes way back to the 1950s? However, it is only in recent years that this technology has created a tremendous amount of buzz (and for good reason!). A subset of machine learning, deep learning is inspired by the structure of the human brain, making it fascinating to learn about. In this episode, Lois Houston and Nikita Abraham interview Senior Principal OCI Instructor Hemant Gahankari about deep learning concepts, including how Convolution Neural Networks work, and help you get your deep learning basics right. Oracle MyLearn: https://mylearn.oracle.com/ou/learning-path/become-an-oci-ai-foundations-associate-2023/127177 Oracle University Learning Community: https://education.oracle.com/ou-community LinkedIn: https://www.linkedin.com/showcase/oracle-university/ X (formerly Twitter): https://twitter.com/Oracle_Edu Special thanks to Arijit Ghosh, David Wright, Himanshu Raj, and the OU Studio Team for helping us create this episode. -------------------------------------------------------- Episode Transcript: 00:00 The world of artificial intelligence is vast and everchanging. And with all the buzz around it lately, we figured it was the perfect time to revisit our AI Made Easy series. Join us over the next few weeks as we chat about all things AI, helping you to discover its endless possibilities. Ready to dive in? Let's go! 00:33 Welcome to the Oracle University Podcast, the first stop on your cloud journey. During this series of informative podcasts, we'll bring you foundational training on the most popular Oracle technologies. Let's get started! 00:47 Lois: Hello and welcome to the Oracle University Podcast. I'm Lois Houston, Director of Innovation Programs with Oracle University, and with me is Nikita Abraham, Principal Technical Editor. Nikita: Hi everyone! Lois: Today, we're going to focus on the basics of deep learning with our Senior Principal OCI Instructor, Hemant Gahankari. Nikita: Hi Hemant! Thanks for being with us today. So, to get started, what is deep learning? 01:14 Hemant: Hi Niki and hi Lois. So, deep learning is a subset of machine learning that focuses on training Artificial Neural Networks, abbreviated as ANN, to solve a task at hand. Say, for example, image classification. A very important quality of the ANN is that it can process raw data like pixels of an image and extract patterns from it. These patterns are treated as features to predict the outcomes. Let us say we have a set of handwritten images of digits 0 to 9. As we know, everyone writes the digits in a slightly different way. So how do we train a machine to identify a handwritten digit? For this, we use ANN. ANN accepts image pixels as inputs, extracts patterns like edges and curves and so on, and correlates these patterns to predict an outcome. That is what digit does the image has in this case. 02:17 Lois: Ok, so what you're saying is given a bunch of pixels, ANN is able to process pixel data, learn an internal representation of the data, and predict outcomes. That's so cool! So, why do we need deep learning? Hemant: We need to specify features while we train machine learning algorithm. With deep learning, features are automatically extracted from the data. Internal representation of features and their combinations is built to predict outcomes by deep learning algorithms. This may not be feasible manually. Deep learning algorithms can make use of parallel computations. For this, usually data is split into small batches and processed parallelly. So these algorithms can process large amount of data in a short time to learn the features and their combinations. This leads to scalability and performance. In short, deep learning complements machine learning algorithms for complex data for which features cannot be described easily. 03:21 Nikita: What can you tell us about the origins of deep learning? Hemant: Some of the deep learning concepts like artificial neuron, perceptron, and multilayer perceptron existed as early as 1950s. One of the most important concept of using backpropagation for training ANN came in 1980s. In 1990s, convolutional neural networks were also introduced for image analysis tasks. Starting 2000, GPUs were introduced. And 2010 onwards, GPUs became cheaper and widely available. This fueled the widespread adoption of deep learning uses like computer vision, natural language processing, speech recognition, text translation, and so on. In 2012, major networks like AlexNet and Deep-Q Network were built. 2016 onward, generative use cases of the deep learning also started to come up. Today, we have widely adopted deep learning for a variety of use cases, including large language models and many other types of generative models. 04:32 Lois: Hemant, what are various applications of deep learning algorithms? Hemant: Deep learning algorithms are targeted at a variety of data and applications. For data, we have images, videos, text, and audio. For images, applications can be image classification, object detection, and segmentation. For textual data, applications are to translate the text or detect a sentiment of a text. For audio, the applications can be music generation, speech to text, and so on. 05:05 Lois: It's important that we select the right deep learning algorithm based on the data and application, right? So how do we do that? Hemant: For image tasks like image classification, object detection, image segmentation, or facial recognition, CNN is a suitable architecture. For text, we have a choice of the latest transformers or LSTM or even RNN. For generative tasks like text summarization or question answering, transformers is a good choice. For generating images, text to image generation, transformers, GANs, or diffusion models are available choices. 05:45 Nikita: Let's dive a little deeper into Artificial Neural Networks. Can you tell us more about them, Hemant? Hemant: Artificial Neural Networks are inspired by the human brain. They are made up of interconnected nodes called as neurons. Nikita: And how are inputs processed by a neuron? Hemant: In ANN, we assign weights to the connection between neurons. Weighted inputs are added up. And if the sum crosses a specified threshold, the neuron is fired. And the outputs of a layer of neuron become an input to another layer. 06:16 Lois: Hemant, tell us about the building blocks of ANN so we understand this better. Hemant: So first building block is layers. We have input layer, output layer, and multiple hidden layers. The input layer and output layer are mandatory. And the hidden layers are optional. The layers consist of neurons. Neurons are computational units, which accept an input and produce an output. Weights determine the strength of connection between neurons. So the connections could be between input and a neuron, or it could be between a neuron and another neuron. Activation functions work on the weighted sum of inputs to the neuron and produce an output. Additional input to the neuron that allows a certain degree of flexibility is called as a bias. 07:05 Nikita: I think we've got the components of ANN straight but maybe you should give us an example. You mentioned this example earlier…of needing to train ANN to recognize handwritten digits from images. How would we go about that? Hemant: For that, we have to collect a large number of digit images, and we need to train ANN using these images. So, in this case, the images consist of 28 by 28 pixels, which act as input layer. For the output, we have 10 neurons which represent digits 0 to 9. And we have multiple hidden layers. So, for example, we have two hidden layers which are consisting of 16 neurons each. The hidden layers are responsible for capturing the internal representation of the raw images. And the output layer is responsible for producing the desired outcomes. So, in this case, the desired outcome is the prediction of whether the digit is 0 or 1 or up to digit 9. So how do we train this particular ANN? So the first thing we use is the backpropagation algorithm. During training, we show an image to the ANN. Let's say it is an image of digit 2. So we expect output neuron for digit 2 to fire. But in real, let's say output neuron of a digit 6 fired. 08:28 Lois: So, then, what do we do? Hemant: We know that there is an error. So we correct an error. We adjust the weights of the connection between neurons based on a calculation, which we call as backpropagation algorithm. By showing thousands of images and adjusting the weights iteratively, ANN is able to predict correct outcomes for most of the input images. This process of adjusting weights through backpropagation is called as model training. 09:01 Do you have an idea for a new course or learning opportunity? We'd love to hear it! Visit the Oracle University Learning Community and share your thoughts with us on the Idea Incubator. Your suggestion could find a place in future development projects! Visit mylearn.oracle.com to get started. 09:22 Nikita: Welcome back! Let's move on to CNN. Hemant, what is a Convolutional Neural Network? Hemant: CNN is a type of deep learning model specifically designed for processing and analyzing grid-like data, such as images and videos. In the ANN, the input image is converted to a single dimensional array and given as an input to the network. But that does not work well with the image data because image data is inherently two dimensional. CNN works better with the two dimensional data. The role of the CNN is to reduce the images into a form, which is easier to process and without losing features, which are critical for getting a good prediction. 10:10 Lois: A CNN has different layers, right? Could you tell us a bit about them? Hemant: The first one is input layer. Input layer is followed by feature extraction layers, which is a combination and repetition of convolutional layer with ReLu activation and a pooling layer. And this is followed by a classification layer. These are the fully connected output layers, where the classification occurs as output classes. The class with the highest probability is the predicted class. And finally, we have the dropout layer. This layer is a regularization technique used to prevent overfitting in the network. 10:51 Nikita: And what are the top applications of CNN? Hemant: One of the most widely used applications of CNNs is image classification. For example, classifying whether an image contains a specific object, say cat or a dog. CNNs are also used for object detection tasks. The goal here is to draw bounding boxes around objects in an image. CNNs can perform pixel-level segmentation, where each pixel in the image is labeled to represent different objects or regions. CNNs are employed for face recognition tasks as well, identifying and verifying individuals based on facial features. CNNs are widely used in medical image analysis, helping with tasks like tumor detection, diagnosis, and classification of various medical conditions. CNNs play an important role in the development of self-driving cars, helping them to recognize and understand the road traffic signs, pedestrians, and other vehicles. 12:02 Nikita: Hemant, let's talk about sequence models. What are they and what are they used for? Hemant: Sequence models are used to solve problems, where the input data is in the form of sequences. The sequences are ordered lists of data points or events. The goal in sequence models is to find patterns and dependencies within the data and make predictions, classifications, or even generate new sequences. 12:31 Lois: Can you give us some examples of sequence models? Hemant: Some common examples of the sequence models are in natural language processing, deep learning models are used for tasks, such as machine translation, sentiment analysis, or text generation. In speech recognition, deep learning models are used to convert a recorded audio into text. Deep learning models can generate new music or create original compositions. Even sequences of hand gestures are interpreted by deep learning models for applications like sign language recognition. In fields like finance or weather prediction, time series data is used to predict future values. 13:15 Nikita: Which deep learning models can be used to work with sequence data? Hemant: Recurrent Neural Networks, abbreviated as RNNs, are a class of neural network architectures specifically designed to handle sequential data. Unlike traditional feedforward neural network, RNNs have a feedback loop that allows information to persist across different timesteps. The key features of RNNs is their ability to maintain an internal state often referred to as a hidden state or memory, which is updated as the network processes each element in the input sequence. The hidden state is used as input to the network for the next time step, allowing the model to capture dependencies and patterns in the data that are spread across time. 14:07 Nikita: Are there various types of RNNs? Hemant: There are different types of RNN architectures based on application. One of them is one to one. This is like feed forward neural network and is not suited for sequential data. A one to many model produces multiple output values for one input value. Music generation or sequence generation are some applications using this architecture. A many to one model produces one output value after receiving multiple input values. Example is sentiments analysis based on the review. Many to many model produces multiple output values for multiple input values. Examples are machine translation and named entity recognition. RNN does not perform well when it comes to capturing long-term dependencies. This is due to the vanishing gradients problem, which is overcome by using LSTM model. 15:11 Lois: Another acronym. What is LSTM, Hemant? Hemant: Long Short-Term memory, abbreviated as LSTM, works by using a specialized memory cell and a gating mechanism to capture long term dependencies in the sequential data. The key idea behind LSTM is to selectively remember or forget information over time, enabling the model to maintain relevant information over long sequences, which helps overcome the vanishing gradients problem. 15:45 Nikita: Can you take us, step-by-step, through the working of LSTM? Hemant: At each timestep, the LSTM takes an input vector representing the current data point in the sequence. The LSTM also receives the previous hidden state and cell state. These represent what the LSTM has remembered and forgotten up to the current point in the sequence. The core of the LSTM lies in its gating mechanisms, which include three gates: the input gate, the forget gate, and the output gate. These gates are like the filters that control the flow of information within the LSTM cell. The input gate decides what new information from the current input should be added to the memory cell. The forget gate determines what information in the current memory cell should be discarded or forgotten. The output gate regulates how much of the current memory cell should be exposed as the output of the current time step. 16:52 Lois: Thank you, Hemant, for joining us in this episode of the Oracle University Podcast. I learned so much today. If you want to learn more about deep learning, visit mylearn.oracle.com and search for the Oracle Cloud Infrastructure AI Foundations course. And remember, the AI Foundations course and certification are free. So why not get started now? Nikita: Right, Lois. In our next episode, we will discuss generative AI and language learning models. Until then, this is Nikita Abraham… Lois: And Lois Houston signing off! 17:26 That's all for this episode of the Oracle University Podcast. If you enjoyed listening, please click Subscribe to get all the latest episodes. We'd also love it if you would take a moment to rate and review us on your podcast app. See you again on the next episode of the Oracle University Podcast.
Our next 2 big events are AI UX and the World's Fair. Join and apply to speak/sponsor!Due to timing issues we didn't have an interview episode to share with you this week, but not to worry, we have more than enough “weekend special” content in the backlog for you to get your Latent Space fix, whether you like thinking about the big picture, or learning more about the pod behind the scenes, or talking Groq and GPUs, or AI Leadership, or Personal AI. Enjoy!AI BreakdownThe indefatigable NLW had us back on his show for an update on the Four Wars, covering Sora, Suno, and the reshaped GPT-4 Class Landscape:and a longer segment on AI Engineering trends covering the future LLM landscape (Llama 3, GPT-5, Gemini 2, Claude 4), Open Source Models (Mistral, Grok), Apple and Meta's AI strategy, new chips (Groq, MatX) and the general movement from baby AGIs to vertical Agents:Thursday Nights in AIWe're also including swyx's interview with Josh Albrecht and Ali Rohde to reintroduce swyx and Latent Space to a general audience, and engage in some spicy Q&A:Dylan Patel on GroqWe hosted a private event with Dylan Patel of SemiAnalysis (our last pod here):Not all of it could be released so we just talked about our Groq estimates:Milind Naphade - Capital OneIn relation to conversations at NeurIPS and Nvidia GTC and upcoming at World's Fair, we also enjoyed chatting with Milind Naphade about his AI Leadership work at IBM, Cisco, Nvidia, and now leading the AI Foundations org at Capital One. We covered:* Milind's learnings from ~25 years in machine learning * His first paper citation was 24 years ago* Lessons from working with Jensen Huang for 6 years and being CTO of Metropolis * Thoughts on relevant AI research* GTC takeaways and what makes NVIDIA specialIf you'd like to work on building solutions rather than platform (as Milind put it), his Applied AI Research team at Capital One is hiring, which falls under the Capital One Tech team.Personal AI MeetupIt all started with a meme:Within days of each other, BEE, FRIEND, EmilyAI, Compass, Nox and LangFriend were all launching personal AI wearables and assistants. So we decided to put together a the world's first Personal AI meetup featuring creators and enthusiasts of wearables. The full video is live now, with full show notes within.Timestamps* [00:01:13] AI Breakdown Part 1* [00:02:20] Four Wars* [00:13:45] Sora* [00:15:12] Suno* [00:16:34] The GPT-4 Class Landscape* [00:17:03] Data War: Reddit x Google* [00:21:53] Gemini 1.5 vs Claude 3* [00:26:58] AI Breakdown Part 2* [00:27:33] Next Frontiers: Llama 3, GPT-5, Gemini 2, Claude 4* [00:31:11] Open Source Models - Mistral, Grok* [00:34:13] Apple MM1* [00:37:33] Meta's $800b AI rebrand* [00:39:20] AI Engineer landscape - from baby AGIs to vertical Agents* [00:47:28] Adept episode - Screen Multimodality* [00:48:54] Top Model Research from January Recap* [00:53:08] AI Wearables* [00:57:26] Groq vs Nvidia month - GPU Chip War* [01:00:31] Disagreements* [01:02:08] Summer 2024 Predictions* [01:04:18] Thursday Nights in AI - swyx* [01:33:34] Dylan Patel - Semianalysis + Latent Space Live Show* [01:34:58] GroqTranscript[00:00:00] swyx: Welcome to the Latent Space Podcast Weekend Edition. This is Charlie, your AI co host. Swyx and Alessio are off for the week, making more great content. We have exciting interviews coming up with Elicit, Chroma, Instructor, and our upcoming series on NSFW, Not Safe for Work AI. In today's episode, we're collating some of Swyx and Alessio's recent appearances, all in one place for you to find.[00:00:32] swyx: In part one, we have our first crossover pod of the year. In our listener survey, several folks asked for more thoughts from our two hosts. In 2023, Swyx and Alessio did crossover interviews with other great podcasts like the AI Breakdown, Practical AI, Cognitive Revolution, Thursday Eye, and Chinatalk, all of which you can find in the Latentspace About page.[00:00:56] swyx: NLW of the AI Breakdown asked us back to do a special on the 4Wars framework and the AI engineer scene. We love AI Breakdown as one of the best examples Daily podcasts to keep up on AI news, so we were especially excited to be back on Watch out and take[00:01:12] NLW: care[00:01:13] AI Breakdown Part 1[00:01:13] NLW: today on the AI breakdown. Part one of my conversation with Alessio and Swix from Latent Space.[00:01:19] NLW: All right, fellas, welcome back to the AI Breakdown. How are you doing? I'm good. Very good. With the last, the last time we did this show, we were like, oh yeah, let's do check ins like monthly about all the things that are going on and then. Of course, six months later, and, you know, the, the, the world has changed in a thousand ways.[00:01:36] NLW: It's just, it's too busy to even, to even think about podcasting sometimes. But I, I'm super excited to, to be chatting with you again. I think there's, there's a lot to, to catch up on, just to tap in, I think in the, you know, in the beginning of 2024. And, and so, you know, we're gonna talk today about just kind of a, a, a broad sense of where things are in some of the key battles in the AI space.[00:01:55] NLW: And then the, you know, one of the big things that I, that I'm really excited to have you guys on here for us to talk about where, sort of what patterns you're seeing and what people are actually trying to build, you know, where, where developers are spending their, their time and energy and, and, and any sort of, you know, trend trends there, but maybe let's start I guess by checking in on a framework that you guys actually introduced, which I've loved and I've cribbed a couple of times now, which is this sort of four wars of the, of the AI stack.[00:02:20] Four Wars[00:02:20] NLW: Because first, since I have you here, I'd love, I'd love to hear sort of like where that started gelling. And then and then maybe we can get into, I think a couple of them that are you know, particularly interesting, you know, in the, in light of[00:02:30] swyx: some recent news. Yeah, so maybe I'll take this one. So the four wars is a framework that I came up around trying to recap all of 2023.[00:02:38] swyx: I tried to write sort of monthly recap pieces. And I was trying to figure out like what makes one piece of news last longer than another or more significant than another. And I think it's basically always around battlegrounds. Wars are fought around limited resources. And I think probably the, you know, the most limited resource is talent, but the talent expresses itself in a number of areas.[00:03:01] swyx: And so I kind of focus on those, those areas at first. So the four wars that we cover are the data wars, the GPU rich, poor war, the multi modal war, And the RAG and Ops War. And I think you actually did a dedicated episode to that, so thanks for covering that. Yeah, yeah.[00:03:18] NLW: Not only did I do a dedicated episode, I actually used that.[00:03:22] NLW: I can't remember if I told you guys. I did give you big shoutouts. But I used it as a framework for a presentation at Intel's big AI event that they hold each year, where they have all their folks who are working on AI internally. And it totally resonated. That's amazing. Yeah, so, so, what got me thinking about it again is specifically this inflection news that we recently had, this sort of, you know, basically, I can't imagine that anyone who's listening wouldn't have thought about it, but, you know, inflection is a one of the big contenders, right?[00:03:53] NLW: I think probably most folks would have put them, you know, just a half step behind the anthropics and open AIs of the world in terms of labs, but it's a company that raised 1. 3 billion last year, less than a year ago. Reed Hoffman's a co founder Mustafa Suleyman, who's a co founder of DeepMind, you know, so it's like, this is not a a small startup, let's say, at least in terms of perception.[00:04:13] NLW: And then we get the news that basically most of the team, it appears, is heading over to Microsoft and they're bringing in a new CEO. And you know, I'm interested in, in, in kind of your take on how much that reflects, like hold aside, I guess, you know, all the other things that it might be about, how much it reflects this sort of the, the stark.[00:04:32] NLW: Brutal reality of competing in the frontier model space right now. And, you know, just the access to compute.[00:04:38] Alessio: There are a lot of things to say. So first of all, there's always somebody who's more GPU rich than you. So inflection is GPU rich by startup standard. I think about 22, 000 H100s, but obviously that pales compared to the, to Microsoft.[00:04:55] Alessio: The other thing is that this is probably good news, maybe for the startups. It's like being GPU rich, it's not enough. You know, like I think they were building something pretty interesting in, in pi of their own model of their own kind of experience. But at the end of the day, you're the interface that people consume as end users.[00:05:13] Alessio: It's really similar to a lot of the others. So and we'll tell, talk about GPT four and cloud tree and all this stuff. GPU poor, doing something. That the GPU rich are not interested in, you know we just had our AI center of excellence at Decibel and one of the AI leads at one of the big companies was like, Oh, we just saved 10 million and we use these models to do a translation, you know, and that's it.[00:05:39] Alessio: It's not, it's not a GI, it's just translation. So I think like the inflection part is maybe. A calling and a waking to a lot of startups then say, Hey, you know, trying to get as much capital as possible, try and get as many GPUs as possible. Good. But at the end of the day, it doesn't build a business, you know, and maybe what inflection I don't, I don't, again, I don't know the reasons behind the inflection choice, but if you say, I don't want to build my own company that has 1.[00:06:05] Alessio: 3 billion and I want to go do it at Microsoft, it's probably not a resources problem. It's more of strategic decisions that you're making as a company. So yeah, that was kind of my. I take on it.[00:06:15] swyx: Yeah, and I guess on my end, two things actually happened yesterday. It was a little bit quieter news, but Stability AI had some pretty major departures as well.[00:06:25] swyx: And you may not be considering it, but Stability is actually also a GPU rich company in the sense that they were the first new startup in this AI wave to brag about how many GPUs that they have. And you should join them. And you know, Imadis is definitely a GPU trader in some sense from his hedge fund days.[00:06:43] swyx: So Robin Rhombach and like the most of the Stable Diffusion 3 people left Stability yesterday as well. So yesterday was kind of like a big news day for the GPU rich companies, both Inflection and Stability having sort of wind taken out of their sails. I think, yes, it's a data point in the favor of Like, just because you have the GPUs doesn't mean you can, you automatically win.[00:07:03] swyx: And I think, you know, kind of I'll echo what Alessio says there. But in general also, like, I wonder if this is like the start of a major consolidation wave, just in terms of, you know, I think that there was a lot of funding last year and, you know, the business models have not been, you know, All of these things worked out very well.[00:07:19] swyx: Even inflection couldn't do it. And so I think maybe that's the start of a small consolidation wave. I don't think that's like a sign of AI winter. I keep looking for AI winter coming. I think this is kind of like a brief cold front. Yeah,[00:07:34] NLW: it's super interesting. So I think a bunch of A bunch of stuff here.[00:07:38] NLW: One is, I think, to both of your points, there, in some ways, there, there had already been this very clear demarcation between these two sides where, like, the GPU pores, to use the terminology, like, just weren't trying to compete on the same level, right? You know, the vast majority of people who have started something over the last year, year and a half, call it, were racing in a different direction.[00:07:59] NLW: They're trying to find some edge somewhere else. They're trying to build something different. If they're, if they're really trying to innovate, it's in different areas. And so it's really just this very small handful of companies that are in this like very, you know, it's like the coheres and jaspers of the world that like this sort of, you know, that are that are just sort of a little bit less resourced than, you know, than the other set that I think that this potentially even applies to, you know, everyone else that could clearly demarcate it into these two, two sides.[00:08:26] NLW: And there's only a small handful kind of sitting uncomfortably in the middle, perhaps. Let's, let's come back to the idea of, of the sort of AI winter or, you know, a cold front or anything like that. So this is something that I, I spent a lot of time kind of thinking about and noticing. And my perception is that The vast majority of the folks who are trying to call for sort of, you know, a trough of disillusionment or, you know, a shifting of the phase to that are people who either, A, just don't like AI for some other reason there's plenty of that, you know, people who are saying, You Look, they're doing way worse than they ever thought.[00:09:03] NLW: You know, there's a lot of sort of confirmation bias kind of thing going on. Or two, media that just needs a different narrative, right? Because they're sort of sick of, you know, telling the same story. Same thing happened last summer, when every every outlet jumped on the chat GPT at its first down month story to try to really like kind of hammer this idea that that the hype was too much.[00:09:24] NLW: Meanwhile, you have, you know, just ridiculous levels of investment from enterprises, you know, coming in. You have, you know, huge, huge volumes of, you know, individual behavior change happening. But I do think that there's nothing incoherent sort of to your point, Swyx, about that and the consolidation period.[00:09:42] NLW: Like, you know, if you look right now, for example, there are, I don't know, probably 25 or 30 credible, like, build your own chatbot. platforms that, you know, a lot of which have, you know, raised funding. There's no universe in which all of those are successful across, you know, even with a, even, even with a total addressable market of every enterprise in the world, you know, you're just inevitably going to see some amount of consolidation.[00:10:08] NLW: Same with, you know, image generators. There are, if you look at A16Z's top 50 consumer AI apps, just based on, you know, web traffic or whatever, they're still like I don't know, a half. Dozen or 10 or something, like, some ridiculous number of like, basically things like Midjourney or Dolly three. And it just seems impossible that we're gonna have that many, you know, ultimately as, as, as sort of, you know, going, going concerned.[00:10:33] NLW: So, I don't know. I, I, I think that the, there will be inevitable consolidation 'cause you know. It's, it's also what kind of like venture rounds are supposed to do. You're not, not everyone who gets a seed round is supposed to get to series A and not everyone who gets a series A is supposed to get to series B.[00:10:46] NLW: That's sort of the natural process. I think it will be tempting for a lot of people to try to infer from that something about AI not being as sort of big or as as sort of relevant as, as it was hyped up to be. But I, I kind of think that's the wrong conclusion to come to.[00:11:02] Alessio: I I would say the experimentation.[00:11:04] Alessio: Surface is a little smaller for image generation. So if you go back maybe six, nine months, most people will tell you, why would you build a coding assistant when like Copilot and GitHub are just going to win everything because they have the data and they have all the stuff. If you fast forward today, A lot of people use Cursor everybody was excited about the Devin release on Twitter.[00:11:26] Alessio: There are a lot of different ways of attacking the market that are not completion of code in the IDE. And even Cursors, like they evolved beyond single line to like chat, to do multi line edits and, and all that stuff. Image generation, I would say, yeah, as a, just as from what I've seen, like maybe the product innovation has slowed down at the UX level and people are improving the models.[00:11:50] Alessio: So the race is like, how do I make better images? It's not like, how do I make the user interact with the generation process better? And that gets tough, you know? It's hard to like really differentiate yourselves. So yeah, that's kind of how I look at it. And when we think about multimodality, maybe the reason why people got so excited about Sora is like, oh, this is like a completely It's not a better image model.[00:12:13] Alessio: This is like a completely different thing, you know? And I think the creative mind It's always looking for something that impacts the viewer in a different way, you know, like they really want something different versus the developer mind. It's like, Oh, I, I just, I have this like very annoying thing I want better.[00:12:32] Alessio: I have this like very specific use cases that I want to go after. So it's just different. And that's why you see a lot more companies in image generation. But I agree with you that. If you fast forward there, there's not going to be 10 of them, you know, it's probably going to be one or[00:12:46] swyx: two. Yeah, I mean, to me, that's why I call it a war.[00:12:49] swyx: Like, individually, all these companies can make a story that kind of makes sense, but collectively, they cannot all be true. Therefore, they all, there is some kind of fight over limited resources here. Yeah, so[00:12:59] NLW: it's interesting. We wandered very naturally into sort of another one of these wars, which is the multimodality kind of idea, which is, you know, basically a question of whether it's going to be these sort of big everything models that end up winning or whether, you know, you're going to have really specific things, you know, like something, you know, Dolly 3 inside of sort of OpenAI's larger models versus, you know, a mid journey or something like that.[00:13:24] NLW: And at first, you know, I was kind of thinking like, For most of the last, call it six months or whatever, it feels pretty definitively both and in some ways, you know, and that you're, you're seeing just like great innovation on sort of the everything models, but you're also seeing lots and lots happen at sort of the level of kind of individual use cases.[00:13:45] Sora[00:13:45] NLW: But then Sora comes along and just like obliterates what I think anyone thought you know, where we were when it comes to video generation. So how are you guys thinking about this particular battle or war at the moment?[00:13:59] swyx: Yeah, this was definitely a both and story, and Sora tipped things one way for me, in terms of scale being all you need.[00:14:08] swyx: And the benefit, I think, of having multiple models being developed under one roof. I think a lot of people aren't aware that Sora was developed in a similar fashion to Dolly 3. And Dolly3 had a very interesting paper out where they talked about how they sort of bootstrapped their synthetic data based on GPT 4 vision and GPT 4.[00:14:31] swyx: And, and it was just all, like, really interesting, like, if you work on one modality, it enables you to work on other modalities, and all that is more, is, is more interesting. I think it's beneficial if it's all in the same house, whereas the individual startups who don't, who sort of carve out a single modality and work on that, definitely won't have the state of the art stuff on helping them out on synthetic data.[00:14:52] swyx: So I do think like, The balance is tilted a little bit towards the God model companies, which is challenging for the, for the, for the the sort of dedicated modality companies. But everyone's carving out different niches. You know, like we just interviewed Suno ai, the sort of music model company, and, you know, I don't see opening AI pursuing music anytime soon.[00:15:12] Suno[00:15:12] swyx: Yeah,[00:15:13] NLW: Suno's been phenomenal to play with. Suno has done that rare thing where, which I think a number of different AI product categories have done, where people who don't consider themselves particularly interested in doing the thing that the AI enables find themselves doing a lot more of that thing, right?[00:15:29] NLW: Like, it'd be one thing if Just musicians were excited about Suno and using it but what you're seeing is tons of people who just like music all of a sudden like playing around with it and finding themselves kind of down that rabbit hole, which I think is kind of like the highest compliment that you can give one of these startups at the[00:15:45] swyx: early days of it.[00:15:46] swyx: Yeah, I, you know, I, I asked them directly, you know, in the interview about whether they consider themselves mid journey for music. And he had a more sort of nuanced response there, but I think that probably the business model is going to be very similar because he's focused on the B2C element of that. So yeah, I mean, you know, just to, just to tie back to the question about, you know, You know, large multi modality companies versus small dedicated modality companies.[00:16:10] swyx: Yeah, highly recommend people to read the Sora blog posts and then read through to the Dali blog posts because they, they strongly correlated themselves with the same synthetic data bootstrapping methods as Dali. And I think once you make those connections, you're like, oh, like it, it, it is beneficial to have multiple state of the art models in house that all help each other.[00:16:28] swyx: And these, this, that's the one thing that a dedicated modality company cannot do.[00:16:34] The GPT-4 Class Landscape[00:16:34] NLW: So I, I wanna jump, I wanna kind of build off that and, and move into the sort of like updated GPT-4 class landscape. 'cause that's obviously been another big change over the last couple months. But for the sake of completeness, is there anything that's worth touching on with with sort of the quality?[00:16:46] NLW: Quality data or sort of a rag ops wars just in terms of, you know, anything that's changed, I guess, for you fundamentally in the last couple of months about where those things stand.[00:16:55] swyx: So I think we're going to talk about rag for the Gemini and Clouds discussion later. And so maybe briefly discuss the data piece.[00:17:03] Data War: Reddit x Google[00:17:03] swyx: I think maybe the only new thing was this Reddit deal with Google for like a 60 million dollar deal just ahead of their IPO, very conveniently turning Reddit into a AI data company. Also, very, very interestingly, a non exclusive deal, meaning that Reddit can resell that data to someone else. And it probably does become table stakes.[00:17:23] swyx: A lot of people don't know, but a lot of the web text dataset that originally started for GPT 1, 2, and 3 was actually scraped from GitHub. from Reddit at least the sort of vote scores. And I think, I think that's a, that's a very valuable piece of information. So like, yeah, I think people are figuring out how to pay for data.[00:17:40] swyx: People are suing each other over data. This, this, this war is, you know, definitely very, very much heating up. And I don't think, I don't see it getting any less intense. I, you know, next to GPUs, data is going to be the most expensive thing in, in a model stack company. And. You know, a lot of people are resorting to synthetic versions of it, which may or may not be kosher based on how far along or how commercially blessed the, the forms of creating that synthetic data are.[00:18:11] swyx: I don't know if Alessio, you have any other interactions with like Data source companies, but that's my two cents.[00:18:17] Alessio: Yeah yeah, I actually saw Quentin Anthony from Luther. ai at GTC this week. He's also been working on this. I saw Technium. He's also been working on the data side. I think especially in open source, people are like, okay, if everybody is putting the gates up, so to speak, to the data we need to make it easier for people that don't have 50 million a year to get access to good data sets.[00:18:38] Alessio: And Jensen, at his keynote, he did talk about synthetic data a little bit. So I think that's something that we'll definitely hear more and more of in the enterprise, which never bodes well, because then all the, all the people with the data are like, Oh, the enterprises want to pay now? Let me, let me put a pay here stripe link so that they can give me 50 million.[00:18:57] Alessio: But it worked for Reddit. I think the stock is up. 40 percent today after opening. So yeah, I don't know if it's all about the Google deal, but it's obviously Reddit has been one of those companies where, hey, you got all this like great community, but like, how are you going to make money? And like, they try to sell the avatars.[00:19:15] Alessio: I don't know if that it's a great business for them. The, the data part sounds as an investor, you know, the data part sounds a lot more interesting than, than consumer[00:19:25] swyx: cosmetics. Yeah, so I think, you know there's more questions around data you know, I think a lot of people are talking about the interview that Mira Murady did with the Wall Street Journal, where she, like, just basically had no, had no good answer for where they got the data for Sora.[00:19:39] swyx: I, I think this is where, you know, there's, it's in nobody's interest to be transparent about data, and it's, it's kind of sad for the state of ML and the state of AI research but it is what it is. We, we have to figure this out as a society, just like we did for music and music sharing. You know, in, in sort of the Napster to Spotify transition, and that might take us a decade.[00:19:59] swyx: Yeah, I[00:20:00] NLW: do. I, I agree. I think, I think that you're right to identify it, not just as that sort of technical problem, but as one where society has to have a debate with itself. Because I think that there's, if you rationally within it, there's Great kind of points on all side, not to be the sort of, you know, person who sits in the middle constantly, but it's why I think a lot of these legal decisions are going to be really important because, you know, the job of judges is to listen to all this stuff and try to come to things and then have other judges disagree.[00:20:24] NLW: And, you know, and have the rest of us all debate at the same time. By the way, as a total aside, I feel like the synthetic data right now is like eggs in the 80s and 90s. Like, whether they're good for you or bad for you, like, you know, we, we get one study that's like synthetic data, you know, there's model collapse.[00:20:42] NLW: And then we have like a hint that llama, you know, to the most high performance version of it, which was one they didn't release was trained on synthetic data. So maybe it's good. It's like, I just feel like every, every other week I'm seeing something sort of different about whether it's a good or bad for, for these models.[00:20:56] swyx: Yeah. The branding of this is pretty poor. I would kind of tell people to think about it like cholesterol. There's good cholesterol, bad cholesterol. And you can have, you know, good amounts of both. But at this point, it is absolutely without a doubt that most large models from here on out will all be trained as some kind of synthetic data and that is not a bad thing.[00:21:16] swyx: There are ways in which you can do it poorly. Whether it's commercial, you know, in terms of commercial sourcing or in terms of the model performance. But it's without a doubt that good synthetic data is going to help your model. And this is just a question of like where to obtain it and what kinds of synthetic data are valuable.[00:21:36] swyx: You know, if even like alpha geometry, you know, was, was a really good example from like earlier this year.[00:21:42] NLW: If you're using the cholesterol analogy, then my, then my egg thing can't be that far off. Let's talk about the sort of the state of the art and the, and the GPT 4 class landscape and how that's changed.[00:21:53] Gemini 1.5 vs Claude 3[00:21:53] NLW: Cause obviously, you know, sort of the, the two big things or a couple of the big things that have happened. Since we last talked, we're one, you know, Gemini first announcing that a model was coming and then finally it arriving, and then very soon after a sort of a different model arriving from Gemini and and Cloud three.[00:22:11] NLW: So I guess, you know, I'm not sure exactly where the right place to start with this conversation is, but, you know, maybe very broadly speaking which of these do you think have made a bigger impact? Thank you.[00:22:20] Alessio: Probably the one you can use, right? So, Cloud. Well, I'm sure Gemini is going to be great once they let me in, but so far I haven't been able to.[00:22:29] Alessio: I use, so I have this small podcaster thing that I built for our podcast, which does chapters creation, like named entity recognition, summarization, and all of that. Cloud Tree is, Better than GPT 4. Cloud2 was unusable. So I use GPT 4 for everything. And then when Opus came out, I tried them again side by side and I posted it on, on Twitter as well.[00:22:53] Alessio: Cloud is better. It's very good, you know, it's much better, it seems to me, it's much better than GPT 4 at doing writing that is more, you know, I don't know, it just got good vibes, you know, like the GPT 4 text, you can tell it's like GPT 4, you know, it's like, it always uses certain types of words and phrases and, you know, maybe it's just me because I've now done it for, you know, So, I've read like 75, 80 generations of these things next to each other.[00:23:21] Alessio: Clutter is really good. I know everybody is freaking out on twitter about it, my only experience of this is much better has been on the podcast use case. But I know that, you know, Quran from from News Research is a very big opus pro, pro opus person. So, I think that's also It's great to have people that actually care about other models.[00:23:40] Alessio: You know, I think so far to a lot of people, maybe Entropic has been the sibling in the corner, you know, it's like Cloud releases a new model and then OpenAI releases Sora and like, you know, there are like all these different things, but yeah, the new models are good. It's interesting.[00:23:55] NLW: My my perception is definitely that just, just observationally, Cloud 3 is certainly the first thing that I've seen where lots of people.[00:24:06] NLW: They're, no one's debating evals or anything like that. They're talking about the specific use cases that they have, that they used to use chat GPT for every day, you know, day in, day out, that they've now just switched over. And that has, I think, shifted a lot of the sort of like vibe and sentiment in the space too.[00:24:26] NLW: And I don't necessarily think that it's sort of a A like full you know, sort of full knock. Let's put it this way. I think it's less bad for open AI than it is good for anthropic. I think that because GPT 5 isn't there, people are not quite willing to sort of like, you know get overly critical of, of open AI, except in so far as they're wondering where GPT 5 is.[00:24:46] NLW: But I do think that it makes, Anthropic look way more credible as a, as a, as a player, as a, you know, as a credible sort of player, you know, as opposed to to, to where they were.[00:24:57] Alessio: Yeah. And I would say the benchmarks veil is probably getting lifted this year. I think last year. People were like, okay, this is better than this on this benchmark, blah, blah, blah, because maybe they did not have a lot of use cases that they did frequently.[00:25:11] Alessio: So it's hard to like compare yourself. So you, you defer to the benchmarks. I think now as we go into 2024, a lot of people have started to use these models from, you know, from very sophisticated things that they run in production to some utility that they have on their own. Now they can just run them side by side.[00:25:29] Alessio: And it's like, Hey, I don't care that like. The MMLU score of Opus is like slightly lower than GPT 4. It just works for me, you know, and I think that's the same way that traditional software has been used by people, right? Like you just strive for yourself and like, which one does it work, works best for you?[00:25:48] Alessio: Like nobody looks at benchmarks outside of like sales white papers, you know? And I think it's great that we're going more in that direction. We have a episode with Adapt coming out this weekend. I'll and some of their model releases, they specifically say, We do not care about benchmarks, so we didn't put them in, you know, because we, we don't want to look good on them.[00:26:06] Alessio: We just want the product to work. And I think more and more people will, will[00:26:09] swyx: go that way. Yeah. I I would say like, it does take the wind out of the sails for GPT 5, which I know where, you know, Curious about later on. I think anytime you put out a new state of the art model, you have to break through in some way.[00:26:21] swyx: And what Claude and Gemini have done is effectively take away any advantage to saying that you have a million token context window. Now everyone's just going to be like, Oh, okay. Now you just match the other two guys. And so that puts An insane amount of pressure on what gpt5 is going to be because it's just going to have like the only option it has now because all the other models are multimodal all the other models are long context all the other models have perfect recall gpt5 has to match everything and do more to to not be a flop[00:26:58] AI Breakdown Part 2[00:26:58] NLW: hello friends back again with part two if you haven't heard part one of this conversation i suggest you go check it out but to be honest they are kind of actually separable In this conversation, we get into a topic that I think Alessio and Swyx are very well positioned to discuss, which is what developers care about right now, what people are trying to build around.[00:27:16] NLW: I honestly think that one of the best ways to see the future in an industry like AI is to try to dig deep on what developers and entrepreneurs are attracted to build, even if it hasn't made it to the news pages yet. So consider this your preview of six months from now, and let's dive in. Let's bring it to the GPT 5 conversation.[00:27:33] Next Frontiers: Llama 3, GPT-5, Gemini 2, Claude 4[00:27:33] NLW: I mean, so, so I think that that's a great sort of assessment of just how the stakes have been raised, you know is your, I mean, so I guess maybe, maybe I'll, I'll frame this less as a question, just sort of something that, that I, that I've been watching right now, the only thing that makes sense to me with how.[00:27:50] NLW: Fundamentally unbothered and unstressed OpenAI seems about everything is that they're sitting on something that does meet all that criteria, right? Because, I mean, even in the Lex Friedman interview that, that Altman recently did, you know, he's talking about other things coming out first. He's talking about, he's just like, he, listen, he, he's good and he could play nonchalant, you know, if he wanted to.[00:28:13] NLW: So I don't want to read too much into it, but. You know, they've had so long to work on this, like unless that we are like really meaningfully running up against some constraint, it just feels like, you know, there's going to be some massive increase, but I don't know. What do you guys think?[00:28:28] swyx: Hard to speculate.[00:28:29] swyx: You know, at this point, they're, they're pretty good at PR and they're not going to tell you anything that they don't want to. And he can tell you one thing and change their minds the next day. So it's, it's, it's really, you know, I've always said that model version numbers are just marketing exercises, like they have something and it's always improving and at some point you just cut it and decide to call it GPT 5.[00:28:50] swyx: And it's more just about defining an arbitrary level at which they're ready and it's up to them on what ready means. We definitely did see some leaks on GPT 4. 5, as I think a lot of people reported and I'm not sure if you covered it. So it seems like there might be an intermediate release. But I did feel, coming out of the Lex Friedman interview, that GPT 5 was nowhere near.[00:29:11] swyx: And you know, it was kind of a sharp contrast to Sam talking at Davos in February, saying that, you know, it was his top priority. So I find it hard to square. And honestly, like, there's also no point Reading too much tea leaves into what any one person says about something that hasn't happened yet or has a decision that hasn't been taken yet.[00:29:31] swyx: Yeah, that's, that's my 2 cents about it. Like, calm down, let's just build .[00:29:35] Alessio: Yeah. The, the February rumor was that they were gonna work on AI agents, so I don't know, maybe they're like, yeah,[00:29:41] swyx: they had two agent two, I think two agent projects, right? One desktop agent and one sort of more general yeah, sort of GPTs like agent and then Andre left, so he was supposed to be the guy on that.[00:29:52] swyx: What did Andre see? What did he see? I don't know. What did he see?[00:29:56] Alessio: I don't know. But again, it's just like the rumors are always floating around, you know but I think like, this is, you know, we're not going to get to the end of the year without Jupyter you know, that's definitely happening. I think the biggest question is like, are Anthropic and Google.[00:30:13] Alessio: Increasing the pace, you know, like it's the, it's the cloud four coming out like in 12 months, like nine months. What's the, what's the deal? Same with Gemini. They went from like one to 1. 5 in like five days or something. So when's Gemini 2 coming out, you know, is that going to be soon? I don't know.[00:30:31] Alessio: There, there are a lot of, speculations, but the good thing is that now you can see a world in which OpenAI doesn't rule everything. You know, so that, that's the best, that's the best news that everybody got, I would say.[00:30:43] swyx: Yeah, and Mistral Large also dropped in the last month. And, you know, not as, not quite GPT 4 class, but very good from a new startup.[00:30:52] swyx: So yeah, we, we have now slowly changed in landscape, you know. In my January recap, I was complaining that nothing's changed in the landscape for a long time. But now we do exist in a world, sort of a multipolar world where Cloud and Gemini are legitimate challengers to GPT 4 and hopefully more will emerge as well hopefully from meta.[00:31:11] Open Source Models - Mistral, Grok[00:31:11] NLW: So speak, let's actually talk about sort of the open source side of this for a minute. So Mistral Large, notable because it's, it's not available open source in the same way that other things are, although I think my perception is that the community has largely given them Like the community largely recognizes that they want them to keep building open source stuff and they have to find some way to fund themselves that they're going to do that.[00:31:27] NLW: And so they kind of understand that there's like, they got to figure out how to eat, but we've got, so, you know, there there's Mistral, there's, I guess, Grok now, which is, you know, Grok one is from, from October is, is open[00:31:38] swyx: sourced at, yeah. Yeah, sorry, I thought you thought you meant Grok the chip company.[00:31:41] swyx: No, no, no, yeah, you mean Twitter Grok.[00:31:43] NLW: Although Grok the chip company, I think is even more interesting in some ways, but and then there's the, you know, obviously Llama3 is the one that sort of everyone's wondering about too. And, you know, my, my sense of that, the little bit that, you know, Zuckerberg was talking about Llama 3 earlier this year, suggested that, at least from an ambition standpoint, he was not thinking about how do I make sure that, you know, meta content, you know, keeps, keeps the open source thrown, you know, vis a vis Mistral.[00:32:09] NLW: He was thinking about how you go after, you know, how, how he, you know, releases a thing that's, you know, every bit as good as whatever OpenAI is on at that point.[00:32:16] Alessio: Yeah. From what I heard in the hallways at, at GDC, Llama 3, the, the biggest model will be, you 260 to 300 billion parameters, so that that's quite large.[00:32:26] Alessio: That's not an open source model. You know, you cannot give people a 300 billion parameters model and ask them to run it. You know, it's very compute intensive. So I think it is, it[00:32:35] swyx: can be open source. It's just, it's going to be difficult to run, but that's a separate question.[00:32:39] Alessio: It's more like, as you think about what they're doing it for, you know, it's not like empowering the person running.[00:32:45] Alessio: llama. On, on their laptop, it's like, oh, you can actually now use this to go after open AI, to go after Anthropic, to go after some of these companies at like the middle complexity level, so to speak. Yeah. So obviously, you know, we estimate Gentala on the podcast, they're doing a lot here, they're making PyTorch better.[00:33:03] Alessio: You know, they want to, that's kind of like maybe a little bit of a shorted. Adam Bedia, in a way, trying to get some of the CUDA dominance out of it. Yeah, no, it's great. The, I love the duck destroying a lot of monopolies arc. You know, it's, it's been very entertaining. Let's bridge[00:33:18] NLW: into the sort of big tech side of this, because this is obviously like, so I think actually when I did my episode, this was one of the I added this as one of as an additional war that, that's something that I'm paying attention to.[00:33:29] NLW: So we've got Microsoft's moves with inflection, which I think pretend, potentially are being read as A shift vis a vis the relationship with OpenAI, which also the sort of Mistral large relationship seems to reinforce as well. We have Apple potentially entering the race, finally, you know, giving up Project Titan and and, and kind of trying to spend more effort on this.[00:33:50] NLW: Although, Counterpoint, we also have them talking about it, or there being reports of a deal with Google, which, you know, is interesting to sort of see what their strategy there is. And then, you know, Meta's been largely quiet. We kind of just talked about the main piece, but, you know, there's, and then there's spoilers like Elon.[00:34:07] NLW: I mean, you know, what, what of those things has sort of been most interesting to you guys as you think about what's going to shake out for the rest of this[00:34:13] Apple MM1[00:34:13] swyx: year? I'll take a crack. So the reason we don't have a fifth war for the Big Tech Wars is that's one of those things where I just feel like we don't cover differently from other media channels, I guess.[00:34:26] swyx: Sure, yeah. In our anti interestness, we actually say, like, we try not to cover the Big Tech Game of Thrones, or it's proxied through Twitter. You know, all the other four wars anyway, so there's just a lot of overlap. Yeah, I think absolutely, personally, the most interesting one is Apple entering the race.[00:34:41] swyx: They actually released, they announced their first large language model that they trained themselves. It's like a 30 billion multimodal model. People weren't that impressed, but it was like the first time that Apple has kind of showcased that, yeah, we're training large models in house as well. Of course, like, they might be doing this deal with Google.[00:34:57] swyx: I don't know. It sounds very sort of rumor y to me. And it's probably, if it's on device, it's going to be a smaller model. So something like a Jemma. It's going to be smarter autocomplete. I don't know what to say. I'm still here dealing with, like, Siri, which hasn't, probably hasn't been updated since God knows when it was introduced.[00:35:16] swyx: It's horrible. I, you know, it, it, it makes me so angry. So I, I, one, as an Apple customer and user, I, I'm just hoping for better AI on Apple itself. But two, they are the gold standard when it comes to local devices, personal compute and, and trust, like you, you trust them with your data. And. I think that's what a lot of people are looking for in AI, that they have, they love the benefits of AI, they don't love the downsides, which is that you have to send all your data to some cloud somewhere.[00:35:45] swyx: And some of this data that we're going to feed AI is just the most personal data there is. So Apple being like one of the most trusted personal data companies, I think it's very important that they enter the AI race, and I hope to see more out of them.[00:35:58] Alessio: To me, the, the biggest question with the Google deal is like, who's paying who?[00:36:03] Alessio: Because for the browsers, Google pays Apple like 18, 20 billion every year to be the default browser. Is Google going to pay you to have Gemini or is Apple paying Google to have Gemini? I think that's, that's like what I'm most interested to figure out because with the browsers, it's like, it's the entry point to the thing.[00:36:21] Alessio: So it's really valuable to be the default. That's why Google pays. But I wonder if like the perception in AI is going to be like, Hey. You just have to have a good local model on my phone to be worth me purchasing your device. And that was, that's kind of drive Apple to be the one buying the model. But then, like Shawn said, they're doing the MM1 themselves.[00:36:40] Alessio: So are they saying we do models, but they're not as good as the Google ones? I don't know. The whole thing is, it's really confusing, but. It makes for great meme material on on Twitter.[00:36:51] swyx: Yeah, I mean, I think, like, they are possibly more than OpenAI and Microsoft and Amazon. They are the most full stack company there is in computing, and so, like, they own the chips, man.[00:37:05] swyx: Like, they manufacture everything so if, if, if there was a company that could do that. You know, seriously challenge the other AI players. It would be Apple. And it's, I don't think it's as hard as self driving. So like maybe they've, they've just been investing in the wrong thing this whole time. We'll see.[00:37:21] swyx: Wall Street certainly thinks[00:37:22] NLW: so. Wall Street loved that move, man. There's a big, a big sigh of relief. Well, let's, let's move away from, from sort of the big stuff. I mean, the, I think to both of your points, it's going to.[00:37:33] Meta's $800b AI rebrand[00:37:33] NLW: Can I, can[00:37:34] swyx: I, can I, can I jump on factoid about this, this Wall Street thing? I went and looked at when Meta went from being a VR company to an AI company.[00:37:44] swyx: And I think the stock I'm trying to look up the details now. The stock has gone up 187% since Lamo one. Yeah. Which is $830 billion in market value created in the past year. . Yeah. Yeah.[00:37:57] NLW: It's, it's, it's like, remember if you guys haven't Yeah. If you haven't seen the chart, it's actually like remarkable.[00:38:02] NLW: If you draw a little[00:38:03] swyx: arrow on it, it's like, no, we're an AI company now and forget the VR thing.[00:38:10] NLW: It's it, it is an interesting, no, it's, I, I think, alessio, you called it sort of like Zuck's Disruptor Arc or whatever. He, he really does. He is in the midst of a, of a total, you know, I don't know if it's a redemption arc or it's just, it's something different where, you know, he, he's sort of the spoiler.[00:38:25] NLW: Like people loved him just freestyle talking about why he thought they had a better headset than Apple. But even if they didn't agree, they just loved it. He was going direct to camera and talking about it for, you know, five minutes or whatever. So that, that's a fascinating shift that I don't think anyone had on their bingo card, you know, whatever, two years ago.[00:38:41] NLW: Yeah. Yeah,[00:38:42] swyx: we still[00:38:43] Alessio: didn't see and fight Elon though, so[00:38:45] swyx: that's what I'm really looking forward to. I mean, hey, don't, don't, don't write it off, you know, maybe just these things take a while to happen. But we need to see and fight in the Coliseum. No, I think you know, in terms of like self management, life leadership, I think he has, there's a lot of lessons to learn from him.[00:38:59] swyx: You know he might, you know, you might kind of quibble with, like, the social impact of Facebook, but just himself as a in terms of personal growth and, and, you know, Per perseverance through like a lot of change and you know, everyone throwing stuff his way. I think there's a lot to say about like, to learn from, from Zuck, which is crazy 'cause he's my age.[00:39:18] swyx: Yeah. Right.[00:39:20] AI Engineer landscape - from baby AGIs to vertical Agents[00:39:20] NLW: Awesome. Well, so, so one of the big things that I think you guys have, you know, distinct and, and unique insight into being where you are and what you work on is. You know, what developers are getting really excited about right now. And by that, I mean, on the one hand, certainly, you know, like startups who are actually kind of formalized and formed to startups, but also, you know, just in terms of like what people are spending their nights and weekends on what they're, you know, coming to hackathons to do.[00:39:45] NLW: And, you know, I think it's a, it's a, it's, it's such a fascinating indicator for, for where things are headed. Like if you zoom back a year, right now was right when everyone was getting so, so excited about. AI agent stuff, right? Auto, GPT and baby a GI. And these things were like, if you dropped anything on YouTube about those, like instantly tens of thousands of views.[00:40:07] NLW: I know because I had like a 50,000 view video, like the second day that I was doing the show on YouTube, you know, because I was talking about auto GPT. And so anyways, you know, obviously that's sort of not totally come to fruition yet, but what are some of the trends in what you guys are seeing in terms of people's, people's interest and, and, and what people are building?[00:40:24] Alessio: I can start maybe with the agents part and then I know Shawn is doing a diffusion meetup tonight. There's a lot of, a lot of different things. The, the agent wave has been the most interesting kind of like dream to reality arc. So out of GPT, I think they went, From zero to like 125, 000 GitHub stars in six weeks, and then one year later, they have 150, 000 stars.[00:40:49] Alessio: So there's kind of been a big plateau. I mean, you might say there are just not that many people that can start it. You know, everybody already started it. But the promise of, hey, I'll just give you a goal, and you do it. I think it's like, amazing to get people's imagination going. You know, they're like, oh, wow, this This is awesome.[00:41:08] Alessio: Everybody, everybody can try this to do anything. But then as technologists, you're like, well, that's, that's just like not possible, you know, we would have like solved everything. And I think it takes a little bit to go from the promise and the hope that people show you to then try it yourself and going back to say, okay, this is not really working for me.[00:41:28] Alessio: And David Wong from Adept, you know, they in our episode, he specifically said. We don't want to do a bottom up product. You know, we don't want something that everybody can just use and try because it's really hard to get it to be reliable. So we're seeing a lot of companies doing vertical agents that are narrow for a specific domain, and they're very good at something.[00:41:49] Alessio: Mike Conover, who was at Databricks before, is also a friend of Latentspace. He's doing this new company called BrightWave doing AI agents for financial research, and that's it, you know, and they're doing very well. There are other companies doing it in security, doing it in compliance, doing it in legal.[00:42:08] Alessio: All of these things that like, people, nobody just wakes up and say, Oh, I cannot wait to go on AutoGPD and ask it to do a compliance review of my thing. You know, just not what inspires people. So I think the gap on the developer side has been the more bottom sub hacker mentality is trying to build this like very Generic agents that can do a lot of open ended tasks.[00:42:30] Alessio: And then the more business side of things is like, Hey, If I want to raise my next round, I can not just like sit around the mess, mess around with like super generic stuff. I need to find a use case that really works. And I think that that is worth for, for a lot of folks in parallel, you have a lot of companies doing evals.[00:42:47] Alessio: There are dozens of them that just want to help you measure how good your models are doing. Again, if you build evals, you need to also have a restrained surface area to actually figure out whether or not it's good, right? Because you cannot eval anything on everything under the sun. So that's another category where I've seen from the startup pitches that I've seen, there's a lot of interest in, in the enterprise.[00:43:11] Alessio: It's just like really. Fragmented because the production use cases are just coming like now, you know, there are not a lot of long established ones to, to test against. And so does it, that's kind of on the virtual agents and then the robotic side it's probably been the thing that surprised me the most at NVIDIA GTC, the amount of robots that were there that were just like robots everywhere.[00:43:33] Alessio: Like, both in the keynote and then on the show floor, you would have Boston Dynamics dogs running around. There was, like, this, like fox robot that had, like, a virtual face that, like, talked to you and, like, moved in real time. There were industrial robots. NVIDIA did a big push on their own Omniverse thing, which is, like, this Digital twin of whatever environments you're in that you can use to train the robots agents.[00:43:57] Alessio: So that kind of takes people back to the reinforcement learning days, but yeah, agents, people want them, you know, people want them. I give a talk about the, the rise of the full stack employees and kind of this future, the same way full stack engineers kind of work across the stack. In the future, every employee is going to interact with every part of the organization through agents and AI enabled tooling.[00:44:17] Alessio: This is happening. It just needs to be a lot more narrow than maybe the first approach that we took, which is just put a string in AutoGPT and pray. But yeah, there's a lot of super interesting stuff going on.[00:44:27] swyx: Yeah. Well, he Let's recover a lot of stuff there. I'll separate the robotics piece because I feel like that's so different from the software world.[00:44:34] swyx: But yeah, we do talk to a lot of engineers and you know, that this is our sort of bread and butter. And I do agree that vertical agents have worked out a lot better than the horizontal ones. I think all You know, the point I'll make here is just the reason AutoGPT and maybe AGI, you know, it's in the name, like they were promising AGI.[00:44:53] swyx: But I think people are discovering that you cannot engineer your way to AGI. It has to be done at the model level and all these engineering, prompt engineering hacks on top of it weren't really going to get us there in a meaningful way without much further, you know, improvements in the models. I would say, I'll go so far as to say, even Devin, which is, I would, I think the most advanced agent that we've ever seen, still requires a lot of engineering and still probably falls apart a lot in terms of, like, practical usage.[00:45:22] swyx: Or it's just, Way too slow and expensive for, you know, what it's, what it's promised compared to the video. So yeah, that's, that's what, that's what happened with agents from, from last year. But I, I do, I do see, like, vertical agents being very popular and, and sometimes you, like, I think the word agent might even be overused sometimes.[00:45:38] swyx: Like, people don't really care whether or not you call it an AI agent, right? Like, does it replace boring menial tasks that I do That I might hire a human to do, or that the human who is hired to do it, like, actually doesn't really want to do. And I think there's absolutely ways in sort of a vertical context that you can actually go after very routine tasks that can be scaled out to a lot of, you know, AI assistants.[00:46:01] swyx: So, so yeah, I mean, and I would, I would sort of basically plus one what let's just sit there. I think it's, it's very, very promising and I think more people should work on it, not less. Like there's not enough people. Like, we, like, this should be the, the, the main thrust of the AI engineer is to look out, look for use cases and, and go to a production with them instead of just always working on some AGI promising thing that never arrives.[00:46:21] swyx: I,[00:46:22] NLW: I, I can only add that so I've been fiercely making tutorials behind the scenes around basically everything you can imagine with AI. We've probably done, we've done about 300 tutorials over the last couple of months. And the verticalized anything, right, like this is a solution for your particular job or role, even if it's way less interesting or kind of sexy, it's like so radically more useful to people in terms of intersecting with how, like those are the ways that people are actually.[00:46:50] NLW: Adopting AI in a lot of cases is just a, a, a thing that I do over and over again. By the way, I think that's the same way that even the generalized models are getting adopted. You know, it's like, I use midjourney for lots of stuff, but the main thing I use it for is YouTube thumbnails every day. Like day in, day out, I will always do a YouTube thumbnail, you know, or two with, with Midjourney, right?[00:47:09] NLW: And it's like you can, you can start to extrapolate that across a lot of things and all of a sudden, you know, a AI doesn't. It looks revolutionary because of a million small changes rather than one sort of big dramatic change. And I think that the verticalization of agents is sort of a great example of how that's[00:47:26] swyx: going to play out too.[00:47:28] Adept episode - Screen Multimodality[00:47:28] swyx: So I'll have one caveat here, which is I think that Because multi modal models are now commonplace, like Cloud, Gemini, OpenAI, all very very easily multi modal, Apple's easily multi modal, all this stuff. There is a switch for agents for sort of general desktop browsing that I think people so much for joining us today, and we'll see you in the next video.[00:48:04] swyx: Version of the the agent where they're not specifically taking in text or anything They're just watching your screen just like someone else would and and I'm piloting it by vision And you know in the the episode with David that we'll have dropped by the time that this this airs I think I think that is the promise of adept and that is a promise of what a lot of these sort of desktop agents Are and that is the more general purpose system That could be as big as the browser, the operating system, like, people really want to build that foundational piece of software in AI.[00:48:38] swyx: And I would see, like, the potential there for desktop agents being that, that you can have sort of self driving computers. You know, don't write the horizontal piece out. I just think we took a while to get there.[00:48:48] NLW: What else are you guys seeing that's interesting to you? I'm looking at your notes and I see a ton of categories.[00:48:54] Top Model Research from January Recap[00:48:54] swyx: Yeah so I'll take the next two as like as one category, which is basically alternative architectures, right? The two main things that everyone following AI kind of knows now is, one, the diffusion architecture, and two, the let's just say the, Decoder only transformer architecture that is popularized by GPT.[00:49:12] swyx: You can read, you can look on YouTube for thousands and thousands of tutorials on each of those things. What we are talking about here is what's next, what people are researching, and what could be on the horizon that takes the place of those other two things. So first of all, we'll talk about transformer architectures and then diffusion.[00:49:25] swyx: So transformers the, the two leading candidates are effectively RWKV and the state space models the most recent one of which is Mamba, but there's others like the Stripe, ENA, and the S four H three stuff coming out of hazy research at Stanford. And all of those are non quadratic language models that scale the promise to scale a lot better than the, the traditional transformer.[00:49:47] swyx: That this might be too theoretical for most people right now, but it's, it's gonna be. It's gonna come out in weird ways, where, imagine if like, Right now the talk of the town is that Claude and Gemini have a million tokens of context and like whoa You can put in like, you know, two hours of video now, okay But like what if you put what if we could like throw in, you know, two hundred thousand hours of video?[00:50:09] swyx: Like how does that change your usage of AI? What if you could throw in the entire genetic sequence of a human and like synthesize new drugs. Like, well, how does that change things? Like, we don't know because we haven't had access to this capability being so cheap before. And that's the ultimate promise of these two models.[00:50:28] swyx: They're not there yet but we're seeing very, very good progress. RWKV and Mamba are probably the, like, the two leading examples, both of which are open source that you can try them today and and have a lot of progress there. And the, the, the main thing I'll highlight for audio e KV is that at, at the seven B level, they seem to have beat LAMA two in all benchmarks that matter at the same size for the same amount of training as an open source model.[00:50:51] swyx: So that's exciting. You know, they're there, they're seven B now. They're not at seven tb. We don't know if it'll. And then the other thing is diffusion. Diffusions and transformers are are kind of on the collision course. The original stable diffusion already used transformers in in parts of its architecture.[00:51:06] swyx: It seems that transformers are eating more and more of those layers particularly the sort of VAE layer. So that's, the Diffusion Transformer is what Sora is built on. The guy who wrote the Diffusion Transformer paper, Bill Pebbles, is, Bill Pebbles is the lead tech guy on Sora. So you'll just see a lot more Diffusion Transformer stuff going on.[00:51:25] swyx: But there's, there's more sort of experimentation with diffusion. I'm holding a meetup actually here in San Francisco that's gonna be like the state of diffusion, which I'm pretty excited about. Stability's doing a lot of good work. And if you look at the, the architecture of how they're creating Stable Diffusion 3, Hourglass Diffusion, and the inconsistency models, or SDXL Turbo.[00:51:45] swyx: All of these are, like, very, very interesting innovations on, like, the original idea of what Stable Diffusion was. So if you think that it is expensive to create or slow to create Stable Diffusion or an AI generated art, you are not up to date with the latest models. If you think it is hard to create text and images, you are not up to date with the latest models.[00:52:02] swyx: And people still are kind of far behind. The last piece of which is the wildcard I always kind of hold out, which is text diffusion. So Instead of using autogenerative or autoregressive transformers, can you use text to diffuse? So you can use diffusion models to diffuse and create entire chunks of text all at once instead of token by token.[00:52:22] swyx: And that is something that Midjourney confirmed today, because it was only rumored the past few months. But they confirmed today that they were looking into. So all those things are like very exciting new model architectures that are, Maybe something that we'll, you'll see in production two to three years from now.[00:52:37] swyx: So the couple of the trends[00:52:38] NLW: that I want to just get your takes on, because they're sort of something that, that seems like they're coming up are one sort of these, these wearable, you know, kind of passive AI experiences where they're absorbing a lot of what's going on around you and then, and then kind of bringing things back.[00:52:53] NLW: And then the, the other one that I, that I wanted to see if you guys had thoughts on were sort of this next generation of chip companies. Obviously there's a huge amount of emphasis. On on hardware and silicon and, and, and different ways of doing things, but, y
Did you know that the concept of deep learning goes way back to the 1950s? However, it is only in recent years that this technology has created a tremendous amount of buzz (and for good reason!). A subset of machine learning, deep learning is inspired by the structure of the human brain, making it fascinating to learn about. In this episode, Lois Houston and Nikita Abraham interview Senior Principal OCI Instructor Hemant Gahankari about deep learning concepts, including how Convolution Neural Networks work, and help you get your deep learning basics right. Oracle MyLearn: https://mylearn.oracle.com/ Oracle University Learning Community: https://education.oracle.com/ou-community LinkedIn: https://www.linkedin.com/showcase/oracle-university/ X (formerly Twitter): https://twitter.com/Oracle_Edu Special thanks to Arijit Ghosh, David Wright, Himanshu Raj, and the OU Studio Team for helping us create this episode. -------------------------------------------------------- Episode Transcript: 00:00 Welcome to the Oracle University Podcast, the first stop on your cloud journey. During this series of informative podcasts, we'll bring you foundational training on the most popular Oracle technologies. Let's get started! 00:26 Lois: Hello and welcome to the Oracle University Podcast. I'm Lois Houston, Director of Innovation Programs with Oracle University, and with me is Nikita Abraham, Principal Technical Editor. Nikita: Hi everyone! Last week, we covered the new MySQL HeatWave Implementation Associate certification. So do go check out that episode if it interests you. Lois: That was a really interesting discussion for sure. Today, we're going to focus on the basics of deep learning with our Senior Principal OCI Instructor, Hemant Gahankari. 00:58 Nikita: Hi Hemant! Thanks for being with us today. So, to get started, what is deep learning? Hemant: Deep learning is a subset of machine learning that focuses on training Artificial Neural Networks to solve a task at hand. Say, for example, image classification. A very important quality of the ANN is that it can process raw data like pixels of an image and extract patterns from it. These patterns are treated as features to predict the outcomes. Let us say we have a set of handwritten images of digits 0 to 9. As we know, everyone writes the digits in a slightly different way. So how do we train a machine to identify a handwritten digit? For this, we use ANN. ANN accepts image pixels as inputs, extracts patterns like edges and curves and so on, and correlates these patterns to predict an outcome. That is what digit does the image has in this case. 02:04 Lois: Ok, so what you're saying is given a bunch of pixels, ANN is able to process pixel data, learn an internal representation of the data, and predict outcomes. That's so cool! So, why do we need deep learning? Hemant: We need to specify features while we train machine learning algorithm. With deep learning, features are automatically extracted from the data. Internal representation of features and their combinations is built to predict outcomes by deep learning algorithms. This may not be feasible manually. Deep learning algorithms can make use of parallel computations. For this, usually data is split into small batches and process parallelly. So these algorithms can process large amount of data in a short time to learn the features and their combinations. This leads to scalability and performance. In short, deep learning complements machine learning algorithms for complex data for which features cannot be described easily. 03:13 Nikita: What can you tell us about the origins of deep learning? Hemant: Some of the deep learning concepts like artificial neuron, perceptron, and multilayer perceptron existed as early as 1950s. One of the most important concept of using backpropagation for training ANN came in 1980s. In 1990s, convolutional neural network were also introduced for image analysis task. Starting 2000, GPUs were introduced. And 2010 onwards, GPUs became cheaper and widely available. This fueled the widespread adoption of deep learning uses like computer vision, natural language processing, speech recognition, text translation, and so on. In 2012, major networks like AlexNet and Deep-Q Network were built. 2016 onward, generative use cases of the deep learning also started to come up. Today, we have widely adopted deep learning for a variety of use cases, including large language models and many other types of generative models. 04:29 Lois: Hemant, what are various applications of deep learning algorithms? Hemant: Deep learning algorithms are targeted at a variety of data and applications. For data, we have images, videos, text, and audio. For images, applications can be image classification, object detection, and so on. For textual data, applications are to translate the text or detect a sentiment of a text. For audio, the applications can be music generation, speech to text, and so on. 05:08 Lois: It's important that we select the right deep learning algorithm based on the data and application, right? So how do we do that? Hemant: For image task like image classification, object detection, image segmentation, or facial recognition, CNN is a suitable architecture. For text, we have a choice of the latest transformers or LSTM or even RNN. For generative tasks like text summarization, question answering, transformers is a good choice. For generating images, text to image generation, transformers, GANs, or diffusion models are available choice. 05:51 Nikita: Let's dive a little deeper into Artificial Neural Networks. Can you tell us more about them, Hemant? Hemant: Artificial Neural Networks are inspired by the human brain. They are made up of interconnected nodes called as neurons. Nikita: And how are inputs processed by a neuron? Hemant: In ANN, we assign weights to the connection between neurons. Weighted inputs are added up. And if the sum crosses a specified threshold, the neuron is fired. And the outputs of a layer of neuron become an input to another layer. 06:27 Lois: Hemant, tell us about the building blocks of ANN so we understand this better. Hemant: So first, building block is layers. We have input layer, output layer, and multiple hidden layers. The input layer and output layer are mandatory. And the hidden layers are optional. The second unit is neurons. Neurons are computational units, which accept an input and produce an output. Weights determine the strength of connection between neurons. So the connection could be between input and a neuron, or it could be between a neuron and another neuron. Activation functions work on the weighted sum of inputs to a neuron and produce an output. Additional input to the neuron that allows a certain degree of flexibility is called as a bias. 07:27 Nikita: I think we've got the components of ANN straight but maybe you should give us an example. You mentioned this example earlier…of needing to train ANN to recognize handwritten digits from images. How would we go about that? Hemant: For that, we have to collect a large number of digit images, and we need to train ANN using these images. So, in this case, the images consist of 28 by 28 pixels which act as input layer. For the output, we have neurons-- 10 neurons which represent digits 0 to 9. And we have multiple hidden layers. So, in this case, we have two hidden layers which are consisting of 16 neurons each. The hidden layers are responsible for capturing the internal representation of the raw image data. And the output layer is responsible for producing the desired outcomes. So, in this case, the desired outcome is the prediction of whether the digit is 0 or 1 or up to digit 9. So how do we train this particular ANN? So the first thing we use the backpropagation algorithm. During training, we show an image to the ANN. Let us say it is an image of digit 2. So we expect output neuron for digit 2 to fire. But in real, let us say output neuron of a digit 6 fired. 09:12 Lois: So, then, what do we do? Hemant: We know that there is an error. So to correct an error, we adjust the weights of the connection between neurons based on a calculation, which we call as backpropagation algorithm. By showing thousands of images and adjusting the weights iteratively, ANN is able to predict correct outcome for most of the input images. This process of adjusting weights through backpropagation is called as model training. 09:48 Do you have an idea for a new course or learning opportunity? We'd love to hear it! Visit the Oracle University Learning Community and share your thoughts with us on the Idea Incubator. Your suggestion could find a place in future development projects! Visit mylearn.oracle.com to get started. 10:09 Nikita: Welcome back! Let's move on to CNN. Hemant, what is a Convolutional Neural Network? Hemant: CNN is a type of deep learning model specifically designed for processing and analyzing grid-like data, such as images and videos. In the ANN, the input image is converted to a single dimensional array and given as an input to the network. But that does not work well with the image data because image data is inherently two dimensional. CNN works better with two dimensional data. The role of the CNN is to reduce the image into a form, which is easier to process and without losing features, which are critical for getting a good prediction. 10:53 Lois: A CNN has different layers, right? Could you tell us a bit about them? Hemant: The first one is input layer. Input layer is followed by feature extraction layers, which is a combination and repetition of multiple feature extraction layers, including convolutional layer with ReLu activation and a pooling layer. And this is followed by a classification layer. These are the fully connected output layers, where the classification occurs as output classes. The feature extraction layers play a vital role in image classification. 11:33 Nikita: Can you explain these layers with an example? Hemant: Let us say we have a robot to inspect a house and tell us what type of a house it is. It uses many tools for this purpose. The first tool is a blueprint detector. It scans different parts of the house, like walls, floors, or windows, and looks for specific patterns or features. The second tool is a pattern highlighter. This tool marks areas detected by the blueprint detector. The next tool is a summarizer. It tries to capture the most significant features of every room. The next tool is house expert, which looks at all the highlighted patterns and features, and tries to understand the house. The next tool is a guess maker. It assigns probabilities to the different possible house types. And finally, the quality checker randomly checks different parts of the analysis to make sure that the robot doesn't rely too much on any single piece of information. 12:40 Nikita: Ok, so how are you mapping these to the feature extraction layers? Hemant: Similar to blueprint detector, we have a convolutional layer. This layer applies convolutional operations to the input image using small filters known as kernels. Each filter slides across the input image to detect specific features, such as edges, corners, or textures. Similar to pattern highlighter, we have a activation function. The activation function allows the network to learn more complex and non-linear relationships in the data. Pooling layer is similar to room summarizer. Pooling helps reduce the spatial dimensions of the feature maps generated by the convolutional layers. Similar to house expert, we have a fully connected layer, which is responsible for making final predictions or classifications based on the learned features. Softmax layer converts the output of the last fully connected layers into probability scores. The class with the highest probability is the predicted class. This is similar to the guess maker. And finally, we have the dropout layer. This layer is a regularization technique used to prevent overfitting in the network. This has the same role as that of a quality checker. 14:05 Lois: Do CNNs have any limitations that we need to be aware of? Hemant: Training CNNs on large data sets can be computationally expensive and time consuming. CNNs are susceptible to overfitting, especially when the training data is limited or imbalanced. CNNs are considered black box models making it difficult to interpret. And CNNs can be sensitive to small changes in the input leading to unstable predictions. 14:33 Nikita: And what are the top applications of CNN? Hemant: One of the most widely used applications of CNNs is image classification. For example, classifying whether an image contains a specific object, say cat or a dog. CNNs are used for object detection tasks. The goal here is to draw bounding boxes around objects in an image. CNNs can perform pixel level segmentation, where each pixel in the image is labeled to represent different objects or regions. CNNs are employed for face recognition tasks as well, identifying and verifying individuals based on facial features. CNNs are widely used in medical image analysis, helping with tasks like tumor detection, diagnosis, and classification of various medical conditions. CNNs play an important role in the development of self-driving cars, helping them to recognize and understand the road traffic signs, pedestrians, and other vehicles. And CNNs are applied in analyzing satellite images and remote sensing data for tasks, such as land cover classification and environmental monitoring. 15:50 Nikita: Hemant, let's talk about sequence models. What are they and what are they used for? Hemant: Sequence models are used to solve problems, where the input data is in the form of sequences. The sequences are ordered lists of data points or events. The goal in sequence models is to find patterns and dependencies within the data and make predictions, classifications, or even generate new sequences. 16:17 Lois: Can you give us some examples of sequence models? Hemant: Some common examples of the sequence models are in natural language processing, deep learning models are used for tasks, such as machine translation, sentiment analysis, or text generation. In speech recognition, deep learning models are used to convert a recorded audio into text. In deep learning models, can generate new music or create original compositions. Even sequences of hand gestures are interpreted by deep learning models for applications like sign language recognition. In fields like finance or weather prediction, time series data is used to predict future values. 17:03 Nikita: Which deep learning models can be used to work with sequence data? Hemant: Recurrent Neural Networks, abbreviated as RNNs, are a class of neural network architectures specifically designed to handle sequential data. Unlike traditional feedforward neural network, RNNs have a feedback loop that allows information to persist across different timesteps. The key features of RNN is their ability to maintain an internal state often referred to as a hidden state or memory, which is updated as the network processes each element in the input sequence. The hidden state is then used as input to the network for the next time step, allowing the model to capture dependencies and patterns in the data that are spread across time. 17:58 Nikita: Are there various types of RNNs? Hemant: There are different types of RNN architecture based on application. One of them is one to one. This is like feed forward neural network and is not suited for sequential data. A one to many model produces multiple output values for one input value. Music generation or sequence generation are some applications using this architecture. A many to one model produces one output value after receiving multiple input values. Example is sentiment analysis based on the review. Many to many model produces multiple output values for multiple input values. Examples are machine translation and named entity recognition. RNN does not perform that well when it comes to capturing long term dependencies. This is due to the vanishing gradients problem, which is overcome by using LSTM model. 19:07 Lois: Another acronym. What is LSTM, Hemant? Hemant: Long Short-Term memory, abbreviated as LSTM, works by using a specialized memory cell and a gating mechanisms to capture long term dependencies in the sequential data. The key idea behind LSTM is to selectively remember or forget information over time, enabling the model to maintain relevant information over long sequences, which helps overcome the vanishing gradients problem. 19:40 Nikita: Can you take us, step-by-step, through the working of LSTM? Hemant: At each timestep, the LSTM takes an input vector representing the current data point in the sequence. The LSTM also receives the previous hidden state and cell state. These represent what the LSTM has remembered and forgotten up to the current point in the sequence. The core of the LSTM lies in its gating mechanisms, which include three gates: the input gate, the forget gate, and the output gate. These gates are like the filters that control the flow of information within the LSTM cell. The input gate decides what new information from the current input should be added to the memory cell. The forget gate determines what information in the current memory cell should be discarded or forgotten. The output gate regulates how much of the current memory cell should be exposed as the output of the current time step. Using the information from the input gate and forget gate, the LSTM updates its cell state. The LSTM then uses the output gate to produce the current hidden state, which becomes the output of the LSTM for the next time step. 21:12 Lois: Thank you, Hemant, for joining us in this episode of the Oracle University Podcast. I learned so much today. If you want to learn more about deep learning, visit mylearn.oracle.com and search for the Oracle Cloud Infrastructure AI Foundations course. And remember, the AI Foundations course and certification are free. So why not get started now? Nikita: Right, Lois. In our next episode, we will discuss generative AI and language learning models. Until then, this is Nikita Abraham… Lois: And Lois Houston signing off! 21:45 That's all for this episode of the Oracle University Podcast. If you enjoyed listening, please click Subscribe to get all the latest episodes. We'd also love it if you would take a moment to rate and review us on your podcast app. See you again on the next episode of the Oracle University Podcast.
Our first ever demo day aimed for 15-20 people and ended up ballooning to >200 and covered in the news. We are now running the 2024 edition in SF on Feb 23: Latent Space Final Frontiers, a startup and research competition in “The Autonomous Workforce”, ”Beyond Transformers & GPUs”, and “Embodied AI”. RSVP here! You can find all LS online/IRL events on our new calendar. Super Early Bird tickets have just gone on sale for AI Engineer World's Fair, June 25-27!Today we have the honor of hosting two of Together AI's co-founders: Ce Zhang (CTO) and Vipul Ved Prakash (CEO). This is a rare opportunity to recap the history of the company since our last check-in with Tri Dao (Chief Scientist), some of their big releases, and do a deep dive into the state of the AI inference market. Together has emerged as one of the most consequential new startups in the new AI summer, last announcing a ~$100m Series A raise in November (at a ~$360-565m valuation). But there are at least three Togethers - Together the Research Lab, Together the Fine Tuning & Inference platform, and Together the custom models service. As we clarify on the pod, the overarching philosophy of Together is the ability to improve on all these fronts simultaneously by being “full stack”, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms.Bringing Research and Industry TogetherIn just one year, Together has been behind some of the most exciting research in AI:* RedPajama, a fully open source dataset for model pre-training which mirrored the Llama1 recipe. Then followed by RedPajama2, a 30T tokens dataset of filtered and de-duplicated tokens. * RedPajama-INCITE-3B and 7B, which were SOTA in a few benchmarks at the time of release. * FlashAttention-2, developed by Together's Chief Scientist Tri Dao. We covered FA-2 in a previous episode with him.* Mamba-3B, the most promising transformer-alternative model that they released in collaboration with Cartesia. * StripedHyena, a SOTA graft of Hyena state space models and transformer models together* Medusa, an alternative to speculative decoding that lets you use multiple decoding heads instead of a draft model. * MonarchMixer, which was one of the most popular orals at NeurIPS 2023. It's an approach to transformers that replaces many of its core parts with Monarch matrices for better computational efficiency. And I'm sure we missed something! As Vipul reveals, almost 50% of Together staff is researchers, and two of their co-founders (Chris Ré and Percy Liang) are professors at Stanford, so we can expect a lot more here.Bringing “Disaggregated” GPUs TogetherOn their cloud, they offer inference as a service, fine-tuning, pre-training, etc, but unlike other providers they think of themselves as a disaggregated cloud. Today, they have ~8,000 A100 and H100 GPUs on their platform (an exclusive revealed on the pod!) totaling over 20 exaflops of compute, but instead of just buying more and putting them in a cluster and then exposing a `us-east-1` option for customers, they are taking heterogenous compute sources and adding a unified layer on top of it for developers to consume. Building on Ce's research, Together's GPU Clusters are taking on comparable AWS and GCP offerings in both cost and speed:Take the Hessian AI center in Germany or the DoE's INCITE; they have GPUs that they want to share with researchers, but they lack the cloud layer over it. Similarly, there's starting to be more and more differentiation amongst types of GPUs: H100s, A100s, MI3000s, etc. Each of them has different availability and performance based on task, and the end user shouldn't have to be an hardware expert to run inference on a model, so Together abstracts a lot of that away.A big theme of the Together inference stack, a “bag of 50 tricks” that we discuss on the pod, is also “hardware-aware” algorithms like FlashAttention and Mamba, which further emphasize the benefits of co-developing everything together:Special Focus: Transformer AlternativesAs we mentioned above, they are also funding a lot of research in Transformer alternatives. To reiterate a few points on why they matter:* Longer context is not the motivation for sub-quadratic architectures: Transformers don't inherently have hard limitations on context size, but they just get extremely expensive. When developing sub-quadratic alternatives, you easily enable very long context, but that's now how you should compare them. Even at same context size, inference and training is much cheaper on sub-quadratic architectures like Hyena.* Emergence of hybrid architectures: a lot of early conversations have been around the “post-Transformers” era, but it might be more like “half-Transformers”. Hybrid architectures could have split layers with some transformer-based and some state-space ones. One of the challenges is that a lot of hardware kernels are optimized for transformer operations, so you'd lose a lot by moving away completely.* Higher speed = higher GPU throughput: if we could reach the same benchmark performance on subquadratic architectures, it'd solve a lot of the GPU crunch. Today we peak at ~170 tok/s on inference in some open models; if we could reach 5,000 tok/s on the same card, you'd be able to serve 30x more customers on the same hardware. As a cloud provider, you're obviously incentivized to get there.We had a lot of fun chatting with the Together guys and we covered a lot of ground, so enjoy the conversation!Note: This is the first episode of a “cloud providers mini-series”. We have Erik from Modal and Ben from Replicate coming up next!Video PodcastJoin us to watching the video version of this pod on our snazzy YouTube!Show Notes* Together AI* RedPajama Dataset v1 Announcement* RedPajama Models v1 Announcement* Together Embeddings* StripedHyena-7B* Mamba-3B-SlimPJ* Vipul's X thread on Anyscale* Vipul's Razor* SemiAnalysis' "Inference Race to the Bottom" post* Chris Ré* Mike Conover's episode* Slim Pajama by Cerebras* Dolma by AI2* Jina AI* Tengyu's Voyage AITimestamps* [00:00:00] Introductions* [00:00:43] Origin and current state of Together.ai* [00:02:15] Transition from Apple to Together and the vision for open AI* [00:04:54] How Chris Ré introduced Ce and Vipul* [00:08:43] How RedPajama came to be* [00:13:34] Model training and Transformer alternatives* [00:15:37] DSIR and the importance of data in LLMs* [00:21:19] Inference vs Fine-tuning vs Pre-training usage on Together* [00:23:20] Together's GPU stash* [00:27:02] Why standardization of inference metrics is important* [00:29:26] Building moats in AI inference* [00:31:49] Federated vs disaggregated cloud computing* [00:34:57] Opportunities for improvement in the inference stack* [00:36:13] Anyscale benchmarking drama* [00:41:27] Not just an inference platform* [00:43:50] Together Embeddings and the future of embedding models* [00:45:53] State space models and hybrid architectures* [00:53:52] The need for 5,000 tokens/s speed in AI inference* [01:00:23] What's the most interesting unsolved question in AI?TranscriptAlessio [00:00:00]: Hey, everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:14]: Hey, and today we're together with Together. Welcome to the studio, guys.Ce / Vipul [00:00:20]: Thank you.Swyx [00:00:21]: I don't know how you typically give self intros, but does anyone want to go first? How do we get our audience acquainted, especially to who's speaking, because it's unusual for us to do a four-person pod. Yeah.Ce [00:00:33]: Hi, everyone. I'm Ce. I'm one of the co-founders of Together and the CTO, working with the team on technical things.Vipul [00:00:40]: I'm Vipul Ved Prakash, co-founder and CEO of Together.Swyx [00:00:43]: I always consider you guys as one of the sort of all-in-one companies. I always want to say labs, but I feel like you're not a lab. What is the sort of origin of Together, and then what is it today? I feel like it used to be Together.xyz, and then now you're Together.ai.Vipul [00:01:00]: I think fundamentally, Together is about open and independent AI systems. We think this is one of the most consequential technologies of our time, and when we started the company in June 2022, our focus was to build a platform for open source, independent, user-owned AI systems. One way to think about it is big labs, frontier model labs, have built their own platforms for developer platforms for their models. We think of Together as a platform for everything else, whether these are open models, whether these are models being built by companies that are owned by them. Our sort of XYZ roots, we have a fairly deep decentralization and open ethos that kind of reflects in all our platform and strategy and business. And we also, the way we structure our cloud is by combining data centers around the world instead of, you know, we are today not located in hyperscalers, we have built a footprint of AI supercomputers in this sort of very disaggregated, decentralized manner.Alessio [00:02:15]: I know before Together, you were at Apple, so you go from like the most walled garden, private, we don't say anything company, to we want everything to be open and everybody to know somebody. What maybe did you learn from like the Apple way of being super close and polished and maybe what are you taking now to Together to make it open, but also a very nice developer experience?Vipul [00:02:37]: Yeah, I would say, you know, one sort of my, you know, background has been in open source for a long time. One of the first things I created was a collaborative spam filter, you know, this was back in the day. It's called Vipul's Razor. And it became quite popular. And the first company I founded called CloudMark was built around, you know, taking open source and building both an open side of it and a commercial product around it. I think Apple is sort of very focused on providing this amazing experience to its customers with, you know, most of the technology sort of hidden behind the product. And certainly the focus on fluidity and applying complex technology to make everyday things simple is something that Apple does really well. And, you know, that's been a sort of big part of how we think about our developer platforms. I think it informs it. The other thing is that during my years at Apple, we, you know, worked a lot on deep learning. And one of the things that was sort of very viscerally accessible to me was how well these systems worked. We, you know, we built an open domain Q&A system. This was based on Facebook's LSTM paper in 2016. And it was remarkable because we had a parallel system based on sort of information retrieval techniques, which is extremely complicated, didn't work that well. And you know, this thing we wrote in a week was just incredible performance. So I think some of those experiences, at least for me personally, sort of were creating this roadmap of how important and powerful this technology is. And you know, when the scaling loss paper was published, I was very clear, like it was in some ways something very profound. We've never had algorithms that improve in capabilities with scale out. So this is almost a new era of computing. So that's been, I think, the influence of Apple, my years at Apple, really for me, like crystallized the value of what we are doing together.Alessio [00:04:54]: And how did you decide to join forces? Because you did a postdoc with Chris Ré at Stanford. You know, we already had Tri Dao from Together and we talked about Hazy. What was like the meeting of the mind of, hey, I come from like the more technical postdoc assistant professor background and we've got yet a more product thing. What got you excited to like build this now?Ce [00:05:15]: So we have been working on this together, Chris, in the essentially last like 10 years, right? So it was like a machine learning system 10 years ago was like Power BI's graphic model, right? And then convolutional neural network and then all the foundation model that we see today. But if you look at this, I think that fundamentally the thing we are actually optimizing is actually not that different. It's always about data movement across essentially all the stacks, right? So when you do distributed like computing, it's about communication across different machines. When you do, for example, flash attention, it's about data movement at a different essentially memory hierarchy, right? So we have been doing this in the last 10 years and seeing the field start grow, grow, grow. So we kind of feel the current kind of this like wave of technology is actually the perfect time to actually bring all the research essentially into something real. And we are super lucky that we got introduced to Weibo, right? And then we hope to join forces and bring this to real world.Swyx [00:06:10]: It's an unusual team of like sort of research and industry. Like you've been like a third or fourth time founder now. Third time founder, yeah. And so like what is your first order of business when you like set up together? Like how do you sort of put something like this together? Oh my God, I'm going to use this word so much.Vipul [00:06:27]: I feel AI companies are really kind of driven by research. And Chris and I had been talking about how to reduce the cost of building models. We felt that there aren't really big data modes around foundation models. They are built from a subset of the web. What is difficult is the cost of capital to build these. And one of the ways in which you can reduce this cost is by making more efficient systems. With that, it was really about finding the right set of co-founders and team. In fact, when Chris introduced me to Ce, and I think within the first five minutes of talking to Ce, I was like, we are starting this company. And our early focus was thinking about this more sort of disparate set of resources, you know, GPUs around the internet. Can we use those to build? And we really have to compress communication for, you know, when we do gradient averaging, there's just a lot of traffic. And if you can reduce that somehow, you sort of open up the possibility of using cheaper compute, you know, across the network. And Ce's research for a decade has been in that subject. You know, and from there, finding, you know, other folks in the network, I think there is generally a lot of excitement and philosophical alignment around what we are doing, which, you know, we publish papers, we publish open source libraries and code, we build open models. And I think the people in academia in, you know, machine learning and NLP, that's really what they want to do. So I think that's been really a kind of kernel for, you know, composition of the company. And we're lucky to have, you know, at this point, attracted some of the best researchers in the field. So I think that's the most important thing. And, you know, the rest of it is sort of driven by us. A couple of these philosophies around independent systems and decentralization and good developer interfaces, you want to make it accessible. That's, you know, just as important. And the rest follows from there, I think.Alessio [00:08:43]: I want to try and fill in some of the blanks in the history of Together. I think people come on your website today and they say, you raised a hundred million dollars Series A. They're like, wow, these guys are like super legit company. But it feels like Red Pajama just came out a year ago. I remember we had Mike Conover in the studio, who had built Dolly at Databricks. And you announced it literally the morning we were recording. So we're like in the studio on our phones, looking at it. And it's like, wow, this is like the first time now there's like a good curated dataset to do open pre-training. So maybe let's start from there. Like, what was the motivation behind it? Why did you decide to do that? It's, datasets are one of the things that most people don't want to work on. They just want to do models, not datasets.Ce [00:09:27]: Yeah. So, yeah, first one is not the first, right? So I think it's actually built on a whole bunch of amazing effort the community already have. For example, Eleuther have the pile, right? There's a whole bunch of amazing datasets they have, like C4, right, from Google, right? So I think really get inspired by the impact those like datasets have on the community, right? So I think when we did Red Pajama, it was a time that people are really fascinated by Lama, the model, like Lama 1, right? Which I feel like decades ago, right? But it's kind of, people are really excited about the quality, right? So that's really like a big shift in people how to think about open model. People start to see hope, right? So, but the one problem of Lama is the data recipe is being described in a pretty detailed way in the paper, but the data is actually not there. So, and our original thinking is how about we take the recipe and we try to do our best effort reproduction and try to put it out, such that we can learn from our mistakes in the reproduction together, right? So that's essentially the original thinking behind Red Pajama. And we have been pretty happy and excited about what community have been kind of build on it. For example, there's a dataset called Slim Pajama, right? Which do deduplication over our data, right?Swyx [00:10:38]: From Cerebras, did they talk to you before?Ce [00:10:39]: Oh, yeah, yeah, yeah, yeah. So, yeah, so we are very good friends so we can discuss about technical perspective. We are pretty excited because I think it's kind of why we do Red Pajama in the first place is that people can actually build not only models, but also datasets essentially over that piece of artifact, right? So that's actually what inspired us to do the first version of Red Pajama dataset.Swyx [00:11:01]: Yeah, and then you released V2 maybe two months ago.Ce [00:11:04]: Yeah.Swyx [00:11:05]: 30 trillion tokens.Ce [00:11:06]: Yeah, 30 trillion tokens. So I think what's exciting about Red Pajama V2 is not only the number of tokens, but we start to kind of learn from Red Pajama V1. So one thing that we learned was that data quality is really the core, right? So you want to take this couple trillion token dataset and try to bring them down maybe to one trillion or two trillion, right? The way that you actually filter them, deduplicate them is not something that kind of pre-decided before you see the application, right? So you kind of want to have a modular framework to think about data quality, right? So like given application, let's automatically or maybe semi-automatically try to come up with a way to filter it down. So that's why in Red Pajama V2, we kind of overlay the dataset with like 40 different pre-computed quality signal, right? If you want to reproduce your best effort, like C4 filter, it's kind of like 20 lines of code, right? And this open up this opportunity you can actually put different filter together, learn the combination of filter. We are very excited to see what community actually come up with using Red Pajama V2.Swyx [00:12:11]: It was retrospectively so obvious that this is a good idea that I wonder how come more datasets don't do this. You release the dataset with all these toggles that you can turn on and off, right? And you can sort of tune up and down the quality in ways that you believe is important to you. Yeah, I just, it makes so much sense now in retrospect. Because everyone just publishes like their pipeline and then the end result. But what about all the intermediate stages? Yeah.Ce [00:12:35]: Yeah, so I think, so there are multiple things there. I don't think we are the only one like doing that. For example, like Doma from AI2, right? They have this very flexible format to actually put in those quality signals, right? Think like, we are actually calling them some, right? So you can actually load Red Pajama using their tool. That whole thing should work, right? So I think one fundamental thing that changed in the last year, essentially, in the beginning when people think about data, it's always like a byproduct of the model, right? You release the model, you also release the data, right? The data side is there essentially to show people, ah, if you train on this data, you'll get a good model. But I think what started to change is when people started building more and more of those models, people started to realize like different subset of data side is kind of valuable for different applications, right? The data becomes something to play with, right? So I think we are kind of lucky that we happen to release Red Pajama right at that point that we get this opportunity to actually learn from that.Alessio [00:13:34]: And you guys have a custom model training platform on Together 2. You have a bunch of stuff in there for data selection, like the DSIR and things like that. How did you decide to work on that versus, because you first started with like some of the fine tunes on LLAMA. Do you see a lot of interest there? And I know you've been doing a lot of research on state space models and other transformer alternatives. Like, do you also see that as something you'll keep working on this year and push more people towards?Vipul [00:14:02]: Yeah, I mean, we, you know, we think of how to make training more efficient and building models more efficient. Part of that is being able to select the right dataset. This is why you have signals, DSIR. You can start with a small dataset and find similar documents, build models with that. So we think it's an important part of the kind of model build tooling that, you know, sort of widely useful for people building different kinds of models. Similarly, you know, we are running into the limits of how fast you can make transformers. And we want inference at 5,000 tokens per second. I don't think we will get there with transformers and we need to learn longer sequences. Data, again, becomes very, very expensive with transformers. So I work on space state models and all the research that we are doing there. And hopefully other labs will pick up on this and make it a kind of important target for optimization. But we think that, you know, open source is a great place for this. We can provide these recipes for data and for training to our customers who are building, you know, custom models themselves. And, you know, we are quite excited about the sort of progress we are seeing there.Alessio [00:15:18]: Do you have some of these models available for inference on Together? Can people play around with a strictly, you know?Swyx [00:15:25]: Yeah.Vipul [00:15:25]: Yeah, they're available for inference on our serverless platform.Swyx [00:15:29]: I always try to be the person who asks about acronyms in case, you know, people want to understand. Should we explain importance resampling, you know, that kind of stuff?Ce [00:15:37]: Oh, yeah. So DSIR essentially, it's a fundamental idea. So it's one of the paper from Percy, right? So essentially, if you know what you are doing, you can actually use that as a very strong signal about what data to put in to insert training process, right? So that's essentially the fundamental idea, right? So, and then more concretely, right? So there are actually different versions of DSIR, right? So one version is like if you have a validation site, right? You can actually somehow measure the similarity between the validation site and also your pre-trained corpus and essentially subset, like the subset. And often there's actually like less targeted version of DSIR where you'll say, yeah, maybe Wikipedia is actually a very good corpus. Let's try to find more Wikipedia, right? And you can think about it in two ways, either as a way to come up with different weights for different data slices. Yeah, so as like filter type of step. Yeah, for a data set, or think about that as like data augmentation. So that's how, yeah, that's how we think about DSIR.Swyx [00:16:33]: That makes sense. I will have to read the paper to understand a little bit more. Because when you say things like, we have to know in advance what we were trying to do with the model, then we do importance resampling. That is against the principle of general intelligence, right? Like the point is to train AGI.Ce [00:16:48]: Yeah, so it depends on what do you mean by being general or generic, right? So I think, I mean, you can always take a meta-learning perspective that we know the distribution of tasks that we care about, right? So you can always go kind of up in the ladder of how general the whole thing is, right? But also for many of the customers that we are actually talking to, right, they have kind of very targeted application, right? The benefit you can get out of that is you could build a better open model, often smaller, often easier to do inference, if you know what you want, right? So I think the whole trade-off would be, and the x-axis would be how generic the whole thing will be. The y-axis would be not only the top accuracy, but also a whole bunch of the deployment cost, right? The size of the model, right? The robustness of the model. So I think different people will navigate the space in different way. And we want to be the platform, essentially, whatever point that you want, we have a solution for you.Swyx [00:17:43]: One more thing on data before we go deeper on state-space models. Are we running out of data? Can we go in order of magnitude? Can we go five orders of magnitude? How do both of you think about how much data we have and how much we need?Ce [00:17:55]: Yeah, so I think that's a very, very good question. So I don't think we are running out of data on Earth.Swyx [00:18:02]: Right, so think about it globally. Training data, training class data.Ce [00:18:05]: Yeah, yeah, so I think, I mean, some of them are not accessible, right? But I do think there are many organizations in the world have enough data to actually train very, very good models, right? So, I mean, they are not publicly available, right? But there are people who actually have access to those, right? So I think in general, right? So if you think about the data in the open space, right? So I guess that was specifically that you actually mean whether we are running out of data. I do think there need to be some way, right? That people who are training open models get connected with essentially data that's not internet data. So I think that channel need to be opened up for the open model to get more data, right? But I'm kind of on the optimistic side that the society will figure out a way that we can train open models that's beyond this internet data.Swyx [00:18:57]: Beyond internet, meaning books?Ce [00:19:00]: I mean, there are a lot of those, right?Swyx [00:19:02]: Books, right?Ce [00:19:02]: Transcripts, right? Videos, audios, right? So there are a whole bunch of data sources that we are not integrating into open data side, right? So, and maybe they shouldn't be open, right? So I think the community need to figure out a way, yeah, like the best balance, yeah? Such that we can have open models, but on the other hand, also have a reasonable collection of data that we can actually use.Swyx [00:19:29]: I think a lot of people think that, there's a theory that Whisper was released so that you could transcribe YouTube and then use that as a source of tokens. Then I talked to other researchers who are like, you know, YouTube has very low quality tokens. You know, do you want your model to talk like a live streamer from YouTube? Because that's what they're going to do. So it's not clear, like what the quality of this data could be.Ce [00:19:53]: Yeah, I guess that depends on your application, right? So I think as a platform, right? So our goal is whatever application that you have, yeah, so we have a platform that you can actually achieve your goal, right? So there are definitely applications that kind of make sense to speak like YouTube, right? So, but there are probably also other application that kind of more on the formal side, right? So I think there are going to be a diverse collection of models, both open and closed, right? So, and we kind of want to be the engine that powers that.Swyx [00:20:21]: There's a lot of people who own data sources who are doing the locally optimal thing and humanity as a whole is losing out. So like New York Times is swinging open AI, you know, Stack Overflow shut down their API, Reddit shut down their API, X, you know, made their own model, right? On Twitter data. We're just going to have all these like tiny little gardens of data that it would be useful in a general model, but everyone's just trying to make their own model. And it seems like globally suboptimal.Vipul [00:20:47]: I think you need to have some kind of a marketplace for figuring out how to get this, you know, data into models and have, I think we'll increasingly see more of that. You know, I think there's a positive aspect to it too. There is a incentive for creators to participate in a system, which is sort of more fair relative to, you know, the capture of value by an AI company that's taking their data. But I agree. I think this is a big open problem that needs to be solved. And I hope there will be, you know, serious efforts around it.Alessio [00:21:19]: Let's talk about the most precious resource on planet earth, GPUs. You have a lot of compute obviously, but you also have a lot of product pieces. You have inference, you have fine tuning, you have pre-training. What's the split in terms of usage? Do you see most people are just running inference on off the shelf models? Do you see maybe some last mile fine tuning?Vipul [00:21:40]: I would say right now, the top five models on our inference stack are probably all fine-tuned versions of open models. And we've seen- Who fine-tuned them?Swyx [00:21:51]: You fine-tuned them?Vipul [00:21:52]: They were fine-tuned by our customers.Swyx [00:21:54]: By your customers.Vipul [00:21:55]: You know, either on our platform or off our platform. And we are generally seeing that, you know, that is the sort of trend where you can get better quality on your task by sort of now easily adapting these models to your data. We also have, I would say, over 20 big model builds happening on the platform, which are customer. We see a lot of training and it's also somewhat surprisingly a more continuous kind of workload. We sort of imagine that this would be more episodic. You train a model and then you do inference. But what we find is, you know, we train a model and then they train the next version and then the next version, which sort of grows in scale. I would say training is still the bigger portion. Some ways inference is super linear to model quality. And as the models are getting better, there's more and more inference.Swyx [00:22:48]: Oh, because they're more useful. Yeah, they're more useful, yeah. So, okay, so training is bigger. This is actually consistent with what we've heard from Mosaic, that, you know, people think that training is sort of like a one-time deal. You do one big run and then you're done. It's never true. And so I'm interested in, like, putting some numbers and I don't know what you have disclosed or what you want to disclose, but, like, how many GPUs do you have? What is the equivalent amount of compute that you have? Because I understand that your GPU setup is different than what people typically think of, like, a giant data center somewhere, right?Vipul [00:23:20]: I don't think we have shared this number publicly. It's, you know, so this will be the first time, I guess. Like, we have close to 7,000 to 8,000 GPUs today. It's growing monthly.Swyx [00:23:31]: What class of GPU are they?Vipul [00:23:32]: They're mostly A100s and H100s.Swyx [00:23:35]: Okay.Vipul [00:23:36]: And probably more, I think, split towards H100s now. You know, we'll be sort of building this best-of-class hardware. So as there are other versions of these coming out later this year, we plan to have those in the fleet as well.Alessio [00:23:53]: I know when we talked last year, you were also using some of the supercomputers by the Department of Energy. There was kind of like a lot of random GPU compute in the world. Have you seen that kind of getting timed out? I think maybe a year ago, people were like, oh, yeah, you can use this GPU computer that is going to be end-of-life. Has the bar changed to give access to those resources?Ce [00:24:13]: From our perspective, it's actually getting better. Yeah, so from the community perspective, because many of the institutions in the world, they're actually investing in hardware, right? So for example, we are working with one of the institutes in Germany called Hessian AI, right, which gives us a lot of help on the compute side. So they start to have this very big GPU cluster, and they're actually sharing that with the community, right? And it's not super big, right, but also not a small one, right? So you start to see this, like, different lives that start to pop up, right? And because of the power of the community, they start to actually share that. So we actually find as a researcher today, it's probably easier for them to actually get a GPU than last year.Swyx [00:24:56]: Interesting.Alessio [00:24:56]: And then for you to buy them, what's the state of the market right now? Is it still extremely hard to get any? Do you have Jensen's phone number? Do you have like GM phone number? Do you guys get like the SDR because you're like under 10,000?Vipul [00:25:12]: NVIDIA is obviously motivated to help us, both as an investor and we are their customers. I would say the market is very tight still, and it's likely going to be this way for a while, is my sense that the demand for AI computing is just kind of ramped up very, very quickly, and it will take a while for supply to catch up.Swyx [00:25:37]: So how tight it is, and let's say compared to like a year ago, two years ago, what do you mean when you say tight? The things you want, you can't get?Vipul [00:25:42]: You can't get them immediately. They're sort of, you know, minimally like two to three months out. Any inventory that shows up tends to clear very, very rapidly. And, you know, we obviously sort of look at this in a very detailed and analytic. There is four to 5 million GPUs that will be sold this year from NVIDIA and others buying. And if you think about 512 to 1,000 GPU cluster for a company, that's 4,000 to 8,000 companies, right? So it's in some ways a very small number. In other ways, the cost of GPUs will be, you know, 80 to $100 billion, and then you layer servers and data center space and electricity on top of that, and that's, you know, close to $250 billion worth of kind of compute, which when you compare it to the cloud computing of today, you know, AWS's last year was $88 billion in revenue. So this is really kind of a build-out happening of AI hyperscalers. It is much more disaggregated, and it's very, very global. So, you know, we think that GPUs are going to be sort of a precious resource for a long time, and using them optimally is very valuable.Swyx [00:27:02]: Yeah.Alessio [00:27:02]: Our friend, Dylan Patel from Semianalysis, he wrote a post about the inference market recently and obviously mentioned you guys. In his post, he said, our model indicates that Together is better off using two A180 gig system rather than a H100-based system. The temperature and performance testing also point to Together utilizing speculative decoding. Any thoughts? Is Dylan right? I don't know, what's-Swyx [00:27:26]: What is his model, man? What does he know that they don't know? Yeah, exactly.Alessio [00:27:30]: I wanna know, I guess like from the outside, and sometimes we even do it, we try and speculate on what people are actually doing. So for the first time, now we have a former guest writing about a current guest. So we wanna know what you guys thought and maybe what are some of the misconceptions that people from the outside have on what it takes to run like a GPU cloud today?Vipul [00:27:50]: Yeah, big fan of Dylan's, by the way. I religiously read Semianalysis. I think there were some errors in that analysis. In particular, we were trying to decode it and one of the things we noticed is that it assumed that input tokens weren't being priced. So I think that may have been an error in the model. I also don't think that there's this assumption that people are running this at a loss. I think it's very expensive. You can't do that for very long. And there are trade-offs in terms of batch sizes you use and the kind of tokens per second performance that are kind of system trade-offs. We've done a lot of work. This is one of the key areas of research for us. So our inference stack is a combination of 50 different sort of tricks and techniques and we think there's a lot of room for optimization here. So whichever hardware provides better performance, whether it's H100 or A100s or L40s, we can sort of measure price performance on particular hardware and we tend to use that for that model or in some cases, certain customers have data streams which can be then optimized for a particular configuration regime. So we do fairly detailed work on how to make this more efficient and so it's hard to, from the outside, looking at memory bandwidth and estimating what's actually happening.Alessio [00:29:26]: How much of these 50 tricks are you giving to yourself and how many are you gonna open? Because we have three now, obviously Flash Attention 2 is open source. He mentioned he'd love to come work together because of how much you care about open source. Yeah, how do you weigh that as a CEO and CTO?Vipul [00:29:43]: A lot of it is open, right? Flash Attention, Flash Decoding, et cetera, and we publish something that's very generally universally useful. It's going to produce better open source AI. We tend to publish as open source. I think on the inference stack, there are open source inference stacks which are pretty good and definitely today, it gives us a competitive advantage to have the best one. So we are not sort of rushing out to release everything about it. It's not overall that additive to open source out there and it is particularly useful as a business for us to provide best price performance. Yeah, we make these decisions. We have discussions. Anything that we keep closed, we generally talk about it quite a bit and decide like this is the piece that is closed for today and it may not be the case six months from now. It may not matter as much.Ce [00:30:40]: Yeah, so I think being open is kind of very important, right? So I think the whole company actually built on this idea that there's going to be ecosystem built on our open models, right? And that's also how we are really lucky to attract this top group of talents to actually join us because of the dream and the mission that we have on our side to really facilitate the open ecosystem, right? So I think in general, it's like I think all the ideas should be open. So that's why we publish papers, right? We actually talk about ideas, right? So I don't think it makes any sense to keep idea like close, right? So there are some software artifact that are kind of really deeply embedded into our kind of own kind of like stack. It kind of only useful when you're trying to build a disaggregated cloud, right? Maybe at some point that we're going to be open as people said, right? But at this moment, right? So we are kind of busy actually building it, right? So that's probably kind of getting to the picture about when that piece is going to be open, right? But I think on the research side, the ideas and for our people to publish things, I think that's really, really important, right? So I think that's how we get talent. That's how I think we as a company going to move the field forward.Swyx [00:31:49]: I noticed that you never used the word federated learning or inference. Is there a distinction that you draw?Ce [00:31:55]: So, I mean, it's definitely not intentional, but I think federated learning is, have been used in so many different ways by so many different people. It starts to lose a very precise meaning about what that really mean, right? If you go back to the original Google paper of federated learning, I think that's very different from what people are talking about today when they say federated. Yeah, we kind of want to be really precise about it.Swyx [00:32:18]: And so your term is disaggregated.Ce [00:32:19]: Yeah, so as an infrastructure, right? So that's disaggregated.Swyx [00:32:22]: Aren't most clouds disaggregated? Like what's different about it?Ce [00:32:27]: So one way is that most of the cloud are disaggregated, but some of that is actually being exposed to the user, right? If you go to AWS, you do know which region you are in, right? So I think one thing that we are trying to do is you have this disaggregated cloud, not only about location or geographically where they are, but about this reliability and also this diversity of this infrastructure. So, and if we want to build a reliable, high-quality layer over that, the user actually don't know, right? What's actually happening under the cover, right? So I think that's one of the difference of the way that we are thinking about infrastructure.Swyx [00:33:06]: Yeah, a bit closer to Cloudflare than AWS. Yeah. Yeah. We have one question here, which we'll just throw out, it's kind of fun. So going back to this sort of inference stack piece, maybe if you had to pull out like a call for researcher or just like point out interesting areas of work that you're interested in, what pieces of the stack have the most opportunity for improvement?Ce [00:33:27]: Yeah, so I think the way we are thinking about the inference stack is, so there are multiple things that can happen, right? So you can do better algorithms, like speckle decoding, you can change the model architecture, you can go really crazy on the system side, right? And you can also code it on the hardware, right? So it's not really clear innovation on a single dimension will get you there. So the key thesis on our side is, if you only push on one direction, you are going to reach diminishing return really, really quickly. Yeah, there's only that much you can do on the system side, only that much you can do on the algorithm side. I think the only big thing that's going to happen is when you ask all those dimensions to actually compound, right? So to have algorithm, model, and system all come together, so I think that's how we reach the next 10 times improvement on inference, right? So I don't think there's a single dimension that is particularly important, but looking at this space in a joint way, right? Try to co-optimize jointly multiple dimensions, I think that's going to be really important for the community to look at.Vipul [00:34:28]: Yeah, we often see, I see numbers from the team and you have these multiple methods, not all of them compound. So you mix these together, it's still similar results and some combination of them will have this incredible effect that is really, really super interesting. So it's very systems, you know, a kind of broad systems approach to it that's the most effective.Swyx [00:34:51]: I think I finally get the name of the company, like- Bring it together, yeah. Everything needs to be automated together.Alessio [00:34:57]: All right, just quickly, how does all this work change, just like some of the architectures change? I know a mixture of experts like speculative decoding is a little less efficient because of memory bandwidth. How much of it do you invest when it's a maybe model-specific improvement versus more horizontal thing? Also, you're researching different architectures, so how much do you want to spend time optimizing what state of the art today versus what's coming next?Vipul [00:35:24]: We do spend time on what state of the art today as well as what's next. You know, the value we get from doing specific optimization, even for, you know, what works well for a particular model on A100s with a particular bus versus H100s, it's a worthwhile investment for us. So we will go down fairly deep into a specific architecture and specific hardware. It does also inform what works better where, and you don't have to take the same approach for, you know, every model and every sort of hardware setup. We can take these different approaches and we do have these multiple systems now. We know that this, you know, system B is better for mixed role and system C is going to be better for stripe tying or Mamba.Alessio [00:36:13]: Before we move on from inference, we need to talk about any scale of drama. So we're actually having Sumit on the podcast tomorrow, who also talked about, kind of came to your guys' support about how, yeah, how important it's not just like, oh, together saying this benchmark's not good because they look bad in it. How, I guess like, it's a hard question to ask, but like, why did you decide to just come out and say it? And how maybe does that also reflect the values that you guys have about open source and openness and kind of like being transparent about what's real and maybe hopes for standardizing some of these benchmarks to make it more clear?Ce [00:36:56]: So it's a great service and skills doing for the community, right? I mean, it's very hard to do benchmark. The moment you do benchmark comparing N players, right, N minus one will be unhappy. You have two tables, then maybe N of them will be unhappy, right? So it's a very great thing that they're doing. And in some of the work that we are doing, we actually use RMOperf, right? So it's a great thing that they're actually doing. So I think one thing about benchmark is, and probably the professor part of me are talking, is a good benchmark should think about how it's going to incentivize the field to actually move forward, right? So if the benchmark really become a kind of standard, how are people going to over-optimize to the benchmark if you are going to do that? And when people are doing that, what are we actually trying to incentivize, right? Will that move the world to a better place? Or will that essentially have every single player focus on marketing or spending time or money on something that actually do not matter on technical side, right? It's very hard to actually strike a balance, right? So I think the reason we kind of try to give feedback on the benchmark is kind of want to open up the discussion about how does the industry should come together and define maybe a common way that we compare with each other, right? So like how database people doing TPC, right? Maybe you should have something actually similar, right? So we are trying to start some of the conversation. So it's not really that we jump out to say it's not good because there's no way we can have a perfect benchmark. That doesn't really exist, right? So just try to kickstart a conversation that maybe we should come together and do something that the community agree and align with the benefit a user going to get, right? So just get the conversation started.Vipul [00:38:42]: I've spoken to the AnyScale team after that, and I think they had really great intentions. And partly, I think it felt very objective and everyone sort of had a reaction to it because it just didn't match their benchmarks that we've all run internally against different services. I think a common industry benchmark run by an independent party versus one of the vendors.Swyx [00:39:04]: Is there one that you appoint to?Vipul [00:39:06]: I don't think one exists today. I think there should be. We're having some conversations about someone setting one up. And there's lots of interesting aspects of this. Time to first token is a function of where the test was run from. There is different load on these services at different times of the day and weekday or weekend. So you have to measure that well. And I think if all of that were done very well by an independent source, that will be a very useful service to customers and in the services themselves.Swyx [00:39:39]: Yeah, I'll point people to artificialanalysis.ai, which is a new one that recently emerged. I don't know if they've done it right. It looks like a side project of a couple people. But I think it's in all the provider's interest to work with them. And ensure that there's an independent third party that's measuring these things, right? At least on the baseline. For me, what's worrying is more about what Toa was saying, which is, do these benchmarks skew things in ways that customers might not be mindful of? Like, what are these things overemphasizing that we might be missing? And I don't really know. It seems like a lot of these services bundled together, they're a version of quantization as well. So that means there's performance trade-offs, right? You're not comparing apples to apples, the same model itself, even though it's like a llama variant or whatever. So what do people trade off? They trade off latency, they trade off price. Obviously, those are the first two. But what else, right? What factors matter in an inference business?Ce [00:40:33]: Yeah, so I think there's also the throughput, right? So there's the time to first token, right? So, and then there are things that users do not often see, for example, the reliability, right? The capacity, right? So that also have impact on user experience at a global scale. Maybe not a single query, right? But in aggregation, you can also see a whole bunch of, like, whether you are emphasizing P50, P95, right? So the whole bunch of things that you can actually play with. And of course, there's also quality. So there are different ways to actually make the whole thing faster, specification, quantization, or combination of those, right? So yeah, so there are so many things to actually play with. So they probably need a benchmark that the protocol is transparent to make sure, like, it's very clear what we are doing and a whole bunch of check on the quality to make sure we are putting the right group of stories in the same table. So I think then essentially the user can actually navigate the space. So I think that's going to be good for everyone.Swyx [00:41:27]: Yeah, makes sense. It's a very important field and I think hopefully there's a good third party that emerges from this. So I just want to touch on one more piece, which is I think I'm appreciating from this discussion that fine tuning is a bigger part of your business than I thought. The other big player in fine tuning is Mosaic. Well, Mosaic is more training, but like there's a bunch of other players in the fine tuning space. If I was a prospective fine tuning customer, what do I come to you with? Do I come to you with my custom data and that's it? Do I also have to write the fine tuning code? What level of engagement do you do with your customers?Vipul [00:42:01]: I think across the spectrum, our customers are training models, pre-training models from scratch and many of them will bring their data sets, you know, user infrastructure and training stack to train their models. There are others who have trained smaller models and want to scale up, scale up across infrastructure, scale up across data. So we'll sort of help them do that. We will have customers who are sort of initially started a little bit more consultative. They have a particular task and idea in mind and we will help them get from there to the data set and the right model to achieve that task. So it's a spectrum and, you know, our goal is to, we're trying to productize as much of this as possible. So that the whole process can be fast and scalable. I would say there is a lot more understanding around fine tuning now, like even the last six months, there are, you know, source tools, recipes, literature, podcasts, discord channels where people are figuring out and it really is in many ways, one of the successes of open source is you have small collectives of, you know, engineers who have created, who are now creating the top models on open source leaderboards. And I have tried out all sorts of different sort of, you know, data recipes, creating synthetic data. Merging models. Merging models. So it's, that's really fun to see. And I think that sort of agency that exists now is exciting. And that is, we see a lot of that sort of being applied into products and, you know, more commercial models that people are deploying in their applications.Alessio [00:43:50]: And then just to, I guess, wrap up the together, it's almost becoming like a platform as a service, because now you release together embeddings. How did you get 92.5 accuracy on 32K retrieval? And do you think we're kind of like getting to embeddings or just like, we did everything that we could, you know, we're getting to like the most optimized it's gonna get and then we should just focus on models and inference or do you think there's still room there to improve?Ce [00:44:17]: Oh, I don't think we haven't even got started on embedding. Yeah. So I think there are so many things. So like embedding is really fundamental for many things, for example, rack, right? So deep in application. So that's how people bring knowledge in. That's also the fundamental piece when you want to build a better model, right? So that's give you this understanding about what actually get into the model. You can actually use that to actually build a better data set, get a better model, then get better embedding, you'll start this loop, right? Without the good embedding, the loop is not closed, right? So I think both on the quality side, how to embed more like dedicated semantics, like into those vectors, how to deal with negation, for example, right? So, and how can you make the whole thing really, really fast? So I think for the next couple years, yeah, we will see a whole bunch of new embeddings maybe of different size and much, much faster than today. Yeah, so I think it's a very active research area. I think people should invest more, yeah.Swyx [00:45:14]: I was surprised to see, I think Jina or, yeah, there's Jina AI, and then there's another guy, Tengyu's Voyage. They are coming out as startups purely focused on embeddings.Ce [00:45:25]: Yeah. Yeah, so I think it's a very, very important piece of the system, right? So you people haven't focused on a lot on them before, and they should definitely start to do that.Swyx [00:45:36]: Yeah. Why are the Chinese universities so good at embeddings? You know what I mean, right? Like the BGE and- Yeah, yeah, yeah.Ce [00:45:44]: So I don't know. We just released our first embedded model, so we still try to learn how to build an embedded model. Yeah, so ask me again in six months.Swyx [00:45:53]: I'll probably have more insight about how to build a better one. I just noticed that you saw 8002 was used to be at the top of the MTB chart, and then it's just like sliding down and down and down, and all the new models are coming out of China for some reason. And I'm like, I don't know what's going on there. So we cannot leave this discussion without talking about state space models. But first of all, how much of the company is dedicated to research? Like it's obviously like not production quality yet, but-Vipul [00:46:17]: I would say it's like 40, 45% I was counting this morning. That's huge.Swyx [00:46:22]: Yeah, so that's the biggest- It's a big investment. Yeah. Okay, well, I mean, it looks like it's paying off, so. And then high level, I will confess or admit or mention for the listeners who are also similarly skeptical, I did not used to care about long contexts because I was like, you know, 30K is enough, 100K is enough, right? I'm not, you know, modeling DNA sequences or anything like that. Why do I need long context? And I mean, first of all, I'll throw that open to you. But second of all, I think what Mamba did for me was change that perception of that. It's only about a long context. The only reason you want sub-quadratic architectures is for long context. Actually, that's not true. And it's also just more efficient to train, period. Right? I'll just leave that open to you. Like what's the motivation that people should keep in their heads? There are multiple things, right?Ce [00:47:09]: So one thing is that, I mean, the moment a model can do for long context well, so it often means that it's kind of cheaper. Yeah, so I mean, that's why it's kind of long. I mean, in principle, transformer can do long context. It's just very expensive. So I think what those like state-based models trying to do is try to push the size of the state, right? Like as small as possible. That's why it's kind of long context, right? And try to kind of like decouple this like quadratical dependency, right? To make sure you can have a much better execution pattern.One direct consequence of those is you can do long context really cheaply, but on the other hand, also introduce a whole bunch of benefit even you are not doing long context. Right? So I think that's actually probably equally important. Because data gets smaller, you can do really large batch size, right? You can actually be very faster. Right? So yeah. And another thing is like, one of the hypothesis that we have is, like in Stripe Hyena, it start to have a hybrid architecture, right? It has part of it has like state-based model and part of it is still the transformer. So different component probably deal with different things kind of better. So maybe by putting them together, by thinking about how information propagate, over this whole horizon of this context, you can probably get an even better quality model than transformer. Right? So I think that's why we are kind of invest a lot of things, on those models. Not only for the context, which is very important, but also for a whole bunch of benefit it could get.Swyx [00:48:42]: Yeah. How should people treat the distinction between Mamba and Stripe Hyena? Like what's the point of releasing these two as separate models? Is one like sort of the together proprietary one and then the other is like the more open research one?Ce [00:48:53]: Yeah. So I think it's pretty much a different stage of exploration. So they kind of have different hypothesis when we try to build those. Yeah. Like for instance, there are different view about state-based model. One is Hyena, another is like Mamba, right? They're actually different architecture. So when we build Stripe Hyena, right? So the curiosity that we have is how good can we... So what is the highest quality non-transformer model we can ever build? The goal of Stripe Hyena is try to see whether we can match Mistral. And by fine-tuning well, whether we can outperform that in some way, right? So it has a very, very strong baseline that we are trying to beat. So that's why there's hybrid scene, like getting the picture, right? And for Mamba, it's kind of more... The curiosity was how far can we push for pure architecture? Then we start from this very system make from small to large, right? All the way to 3 billion, right? So the baseline was essentially the best 3 billion model. So I guess at a different stage of exploration, at some point, I think they are going to converge. We actually learn different things, like when building different models. I think they are just like this intermediate stage in the exploration at different points.Alessio [00:50:02]: You mentioned the hybrid architecture. Is that the model grafting that you mentioned in the Stripe Hyena post where I mentioned you can have transformers and not together? Like this is a concept that I hadn't heard before reading about this. So I think most people's mental models, like transformers or something else, it's not transformers AND something else. How do you train a model that is hybrid? Is there any difference in like how you construct your datasets? Is there any difference in then how you run inference on it? How should people think about starting research in this field?Ce [00:50:36]: Yeah, so we were also very surprised. Yeah, so when we come up with this hybrid architecture. So the way to think about it is like you have different layers in the neural network, right? So like the stateless model for some layer will already give you the benefit. For the other layer, they could be transformers, right? They could give you this more global view of the sequence, but for me, for other layer, don't have to have that, right? I still can have all the other things that kick in, right? So we don't know what is the optimal mixture between different architectures. I mean, in principle, we can have a mamba, hyena, and transformer, all those things that come together, right? And then you can see what makes sense. We have no idea what is optimal doing that. So what we are excited about is now the community have a whole bunch of building blocks that they can actually like playing like a Lego, right? So just put together and see what happen, right? So we are kind of very excited about that. Yeah, we are in the process of trying to learn more like about this architecture. And when we know what we are talking about, we will definitely share with the community about how to do that in a systematic way.Swyx [00:51:41]: Cool. What are we still unsure about? Like, why don't we just, you know, put all the money in the world and training these things now? Like what is left to figure out before we scale this thing?Ce [00:51:53]: So like if you look at how transformer like it's been developed, right? In the last like five to 10 years, right? So people don't start from like, you have this attention to all you need the paper and then let's put all the money in, right? Always start from this very systematic understanding about the scaling, about data quality, about essentially the limits, right? I think for a state-based model from the labs to the real world, you kind of need to go through the same process. But of course, the second time doing that is kind of easier, right? But I think there's no way we can get rid of this systematic step of studying scaling law, study what data to put in, right? So what's the impact of different data slices to the data, yeah, to the final model quality.Swyx [00:52:33]: Do you expect that the data inputs will be different?Ce [00:52:37]: I don't know, but I wouldn't take that for granted that they should be the same, right? So that's one of the hypothesis that, so we have no opinion on that because I think that's the result of the study, not the assumption. Yeah, we do not need to assume that.Swyx [00:52:51]: Okay, scaling laws and data, anything else like architectural that we are not sure about? Because now you have this selection mechanism that you're pretty happy with.Ce [00:52:59]: Yeah, so, I mean, first of all, how to mix them, right? So, and second is what is the architecture? So if you look at transformer, right? So one very interesting piece there is people optimize also the hardware, yeah, to make sure that things run very fast, right?They're very efficient kernel, they're very efficient hardware. And then that's add another boost, right, for the transformer architecture, right? So that's something that should happen for state-based model. Which architecture is kind of easier kind of to run on the hardware, right? So, hosting going kind of faster, you can put more data, it add another dimension in the scaling law. So I think we just need to plow the whole space and just be really systematic from small model to 1 billion, 3 billion, 7 billion, just go all the way up, right? So I wouldn't jump around in the space. I would just like be patient and just like be systematic. Yeah, I think we'll get there, yeah.Swyx [00:53:52]: Yeah, well, I'm looking forward for more research from you guys to figure that out. So one dimension, which we didn't talk about, we talked about long context, we talked about efficiency, but speed is very, speed is also very important. A good inference provider provides, let's say 70 tokens per second, and then maybe that's faster than less good inference providers that are more like 30 tokens per second. But that's the rough range, right? State-of-the-art today. That's around the human speaking speed, human reading speed is about 200 words per minute. Why do we need 5,000 tokens per second is my question back to Vipul. And maybe is this something that is an emphasis for research as well, or is this more just an inference only thing?Vipul [00:54:29]: There are applications that are consuming the tokens that are produced from unmodeled, so they're not necessarily being read or heard by humans. That's a place where we see that level of requirement today that really nobody can quite satisfy. There is, can I think about, as intelligence grows, how do you sort of increase the bandwidth of, you know, how do you reduce the latency of it? If we can do 5,000 tokens a second, the same card can produce, the throughput of that card goes up significantly and can support more applications. So I think it's important from that perspective. And then there are, it opens up new UX possibilities. Once you can get sort of an immediate answer
As things stand today: "We have to scale. We first test on small data sets, then on medium-sized sets and finally on the very large sets," explains Hochreiter. The XLSTM technology performs "very well" on the small and medium-sized data sets. You have to beat all models on the market, says Linz. Over 1,000 GPUs around the world are currently computing the development from Upper Austria. The time horizon is ambitious: The tests on the smaller and medium-sized data sets should be completed by the end of March. "The large data sets will then take centre stage in June," explains Albert Ortig, Managing Director of NXAI. Thanks for listening. We welcome suggestions for topics, criticism and a few stars on Apple, Spotify and Co. We thank our partner **HANNOVER MESSE** https://www.hannovermesse.de/de/ Albert Ortig [Contact](https://www.linkedin.com/in/aajjoo/) Sepp Hochreiter [Contact](https://www.linkedin.com/in/sepp-hochreiter-41514846/)
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost. 2017: Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, J. Dean Machine translation, Artificial neural network, Computation, Language model, Sparse matrix, Long short-term memory, Benchmark (computing), Algorithm, Algorithmic efficiency, Text corpus, GPU cluster, Graphics processing unit https://arxiv.org/pdf/1701.06538.pdf
On the eve of the interview, Robert met Prof. Dr Johannes Brandstetter. He comes from Microsoft Research in Amsterdam and is now doing research in Linz. He is sure: Hochreiter is on the right track. Hochreiter explains the architecture to us, talks about the future of transfomers and explains why xLSTM could also change the industrial sector. Thanks for listening. We welcome suggestions for topics, criticism and a few stars on Apple, Spotify and Co. We thank our partner [Siemens](https://new.siemens.com/global/en/products/automation/topic-areas/artificial-intelligence-in-industry.html) [Sepp Hochreiter] (https://www.linkedin.com/in/sepp-hochreiter-41514846/) **OUR EVENT IN JANUARY ** https://www.hannovermesse.de/de/rahmenprogramm/special-events/ki-in-der-industrie/ We thank our team: Barbara, Anne and Simon!
Land Surface Temperature (LST) is a vital measure derived from thermal remote sensing satellite data. In the summer of 2023, it gained attention in the media and stirred some confusion when compared to regular air temperatures.In the first episode of the special series "Thermal Lens," we feature Michael Perry from the University of Leicester. He's been part of important space missions like LSTM and Sentinel-3's SLSTR, all focused on studying LST. Mike, with a background in physics and a PhD in developing new ways to understand LST and emissivity, is our expert.During the episode, Mike explains what LST truly is, how we calculate it, and where we obtain the data. He also simplifies the various methods and algorithms used, making it easier to grasp their differences.If you're new to LST, this episode provides practical advice on selecting the right data for different applications.
Come PREVEDERE i Prezzi delle Azioni con DEEP LEARNING con Python per accedere a Mexem: https://it.lp.mexem.com/lp-ecoita-promozione Github: https://github.com/DrElegantia/prevedere_Azioni Colab: https://colab.research.google.com/drive/1KedH0yzas4GjOsYhlqrcCFcMiFW04oRh?usp=sharing Colab da cui ho preso spunto: https://colab.research.google.com/github/rawar/tensorflow-notebooks/blob/master/stock_price_prediction_keras.ipynb Oggi vediamo come utilizzare una rete neurale LSTM (Long Short-Term Memory) per prevedere i prezzi delle azioni utilizzando Python e Keras. In questo video, vi guiderò attraverso il processo di creazione di una rete neurale LSTM per analizzare dati storici dei prezzi delle azioni e fare previsioni future. Impareremo a preparare i dati, a creare il modello LSTM, ad addestrarlo e a valutarlo. Abbonati qui: https://www.youtube.com/economiaitalia/join https://www.patreon.com/join/EconomiaItalia? 00:00 Cos'è Mexem? 01:15 Pre-elaborazione dei dati: Normalizzazione, finestra temporale 03:38 Creazione del modello LSTM con Keras 05:40 Addestramento e test della rete sul dataset storico 11:43 Valutazione delle previsioni 13:07 Considerazioni conclusive Qui per segnalare temi: https://tellonym.me/dr.elegantia Podcast (su tutte le piattaforme): https://www.spreaker.com/show/dr-elegantia-podcast COME SOSTENERCI: Il nostro nuovo libro sull'economia: Guida Terrestre per Autoeconomisti https://www.poliniani.com/product-page/guida-terrestre link acquisto Amazon: https://amzn.to/36XTXs8 Acquistando le nostre T-shirt dedicate ai dati stampate in Serigrafia Artigianale con passione e orgoglio dai detenuti del Carcere Lorusso e Cutugno di Torino https://bit.ly/3zNsdkd e HTTPS://urly.it/3nga1 Guida al VOTO 2022: https://amzn.to/3KflXHd DonazionI Paypal: https://paypal.me/appuntiUAB Vuoi sostenermi ma non sborsare nemmeno un euro? Usa questo link per per il tuo prossimo acquisto su Amazon: https://amzn.to/2JGRyGT Qui trovi i libri che consiglio per iniziare a capirne di più sull'economia: https://www.youtube.com/watch?v=uEaIk8wQ3z8 Dove ci trovi: https://www.umbertobertonelli.it/info/ https://linktr.ee/economiaitalia La mia postazione: Logitech streamcam https://amzn.to/3HR6xq0 Luci https://amzn.to/3n6qtgP Shure MV7https://amzn.to/3HRh7k1 Asta https://amzn.to/3HSRvzY #economiaitalia #drelegantia #economiaDiventa un supporter di questo podcast: https://www.spreaker.com/podcast/dr-elegantia-podcast--5692498/support.
Come PREVEDERE i Prezzi delle Azioni con DEEP LEARNING con Python per accedere a Mexem: https://it.lp.mexem.com/lp-ecoita-promozione Github: https://github.com/DrElegantia/prevedere_Azioni Colab: https://colab.research.google.com/drive/1KedH0yzas4GjOsYhlqrcCFcMiFW04oRh?usp=sharing Colab da cui ho preso spunto: https://colab.research.google.com/github/rawar/tensorflow-notebooks/blob/master/stock_price_prediction_keras.ipynb Oggi vediamo come utilizzare una rete neurale LSTM (Long Short-Term Memory) per prevedere i prezzi delle azioni utilizzando Python e Keras. In questo video, vi guiderò attraverso il processo di creazione di una rete neurale LSTM per analizzare dati storici dei prezzi delle azioni e fare previsioni future. Impareremo a preparare i dati, a creare il modello LSTM, ad addestrarlo e a valutarlo. Abbonati qui: https://www.youtube.com/economiaitalia/join https://www.patreon.com/join/EconomiaItalia? 00:00 Cos'è Mexem? 01:15 Pre-elaborazione dei dati: Normalizzazione, finestra temporale 03:38 Creazione del modello LSTM con Keras 05:40 Addestramento e test della rete sul dataset storico 11:43 Valutazione delle previsioni 13:07 Considerazioni conclusive Qui per segnalare temi: https://tellonym.me/dr.elegantia Podcast (su tutte le piattaforme): https://www.spreaker.com/show/dr-elegantia-podcast COME SOSTENERCI: Il nostro nuovo libro sull'economia: Guida Terrestre per Autoeconomisti https://www.poliniani.com/product-page/guida-terrestre link acquisto Amazon: https://amzn.to/36XTXs8 Acquistando le nostre T-shirt dedicate ai dati stampate in Serigrafia Artigianale con passione e orgoglio dai detenuti del Carcere Lorusso e Cutugno di Torino https://bit.ly/3zNsdkd e HTTPS://urly.it/3nga1 Guida al VOTO 2022: https://amzn.to/3KflXHd DonazionI Paypal: https://paypal.me/appuntiUAB Vuoi sostenermi ma non sborsare nemmeno un euro? Usa questo link per per il tuo prossimo acquisto su Amazon: https://amzn.to/2JGRyGT Qui trovi i libri che consiglio per iniziare a capirne di più sull'economia: https://www.youtube.com/watch?v=uEaIk8wQ3z8 Dove ci trovi: https://www.umbertobertonelli.it/info/ https://linktr.ee/economiaitalia La mia postazione: Logitech streamcam https://amzn.to/3HR6xq0 Luci https://amzn.to/3n6qtgP Shure MV7https://amzn.to/3HRh7k1 Asta https://amzn.to/3HSRvzY #economiaitalia #drelegantia #economia
FlashAttention was first published by Tri Dao in May 2022 and it had a deep impact in the large language models space. Most open models you've heard of (RedPajama, MPT, LLaMA, Falcon, etc) all leverage it for faster inference. Tri came on the podcast to chat about FlashAttention, the newly released FlashAttention-2, the research process at Hazy Lab, and more. This is the first episode of our “Papers Explained” series, which will cover some of the foundational research in this space. Our Discord also hosts a weekly Paper Club, which you can signup for here. How does FlashAttention work?The paper is titled “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. There are a couple keywords to call out:* “Memory Efficient”: standard attention memory usage is quadratic with sequence length (i.e. O(N^2)). FlashAttention is sub-quadratic at O(N). * “Exact”: the opposite of “exact” in this case is “sparse”, as in “sparse networks” (see our episode with Jonathan Frankle for more). This means that you're not giving up any precision.* The “IO” in “IO-Awareness” stands for “Input/Output” and hints at a write/read related bottleneck. Before we dive in, look at this simple GPU architecture diagram:The GPU has access to three memory stores at runtime:* SRAM: this is on-chip memory co-located with the actual execution core. It's limited in size (~20MB on an A100 card) but extremely fast (19TB/s total bandwidth)* HBM: this is off-chip but on-card memory, meaning it's in the GPU but not co-located with the core itself. An A100 has 40GB of HBM, but only a 1.5TB/s bandwidth. * DRAM: this is your traditional CPU RAM. You can have TBs of this, but you can only get ~12.8GB/s bandwidth, which is way too slow.Now that you know what HBM is, look at how the standard Attention algorithm is implemented:As you can see, all 3 steps include a “write X to HBM” step and a “read from HBM” step. The core idea behind FlashAttention boils down to this: instead of storing each intermediate result, why don't we use kernel fusion and run every operation in a single kernel in order to avoid memory read/write overhead? (We also talked about kernel fusion in our episode with George Hotz and how PyTorch / tinygrad take different approaches here)The result is much faster, but much harder to read:As you can see, FlashAttention is a very meaningful speed improvement on traditional Attention, and it's easy to understand why it's becoming the standard for most models.This should be enough of a primer before you dive into our episode! We talked about FlashAttention-2, how Hazy Research Group works, and some of the research being done in Transformer alternatives.Show Notes:* FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv)* FlashAttention-2* Together AI* From Deep Learning to Long Learning* The Hardware Lottery by Sara Hooker* Hazy Research* Is Attention All You Need?* Nvidia CUTLASS 3* SRAM scaling slows* Transformer alternatives:* S4* Hyena* Recurrent Neural Networks (RNNs)Timestamps:* Tri's background [00:00:00]* FlashAttention's deep dive [00:02:18]* How the Hazy Research group collaborates across theory, systems, and applications [00:17:21]* Evaluating models beyond raw performance [00:25:00]* FlashAttention-2 [00:27:00]* CUDA and The Hardware Lottery [00:30:00]* Researching in a fast-changing market [00:35:00]* Promising transformer alternatives like state space models and RNNs [00:37:30]* The spectrum of openness in AI models [00:43:00]* Practical impact of models like LLAMA2 despite restrictions [00:47:12]* Incentives for releasing open training datasets [00:49:43]* Lightning Round [00:53:22]Transcript:Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO-in-Residence at Decibel Partners. Today we have no Swyx, because he's in Singapore, so it's a one-on-one discussion with Tri Dao. Welcome! [00:00:24]Tri: Hi everyone. I'm Tri Dao, excited to be here. [00:00:27]Alessio: Tri just completed his PhD at Stanford a month ago. You might not remember his name, but he's one of the main authors in the FlashAttention paper, which is one of the seminal work in the Transformers era. He's got a lot of interest from efficient transformer training and inference, long range sequence model, a lot of interesting stuff. And now you're going to be an assistant professor in CS at Princeton next year. [00:00:51]Tri: Yeah, that's right. [00:00:52]Alessio: Yeah. And in the meantime, just to get, you know, a low pressure thing, you're Chief Scientist at Together as well, which is the company behind RedPajama. [00:01:01]Tri: Yeah. So I just joined this week actually, and it's been really exciting. [00:01:04]Alessio: So what's something that is not on the internet that people should know about you? [00:01:09]Tri: Let's see. When I started college, I was going to be an economist, so I was fully on board. I was going to major in economics, but the first week I was at Stanford undergrad, I took a few math classes and I immediately decided that I was going to be a math major. And that kind of changed the course of my career. So now I'm doing math, computer science, AI research. [00:01:32]Alessio: I had a similar thing. I started with physics and then I took like a programming course and I was like, I got to do computer science. I don't want to do physics. So FlashAttention is definitely, everybody's using this. Everybody loves it. You just released FlashAttention 2 last week. [00:01:48]Tri: Yeah. Early this week on Monday. Yeah. [00:01:53]Alessio: You know, AI time. Things move fast. So maybe let's run through some of the FlashAttention highlights, some of the innovation there, and then we can dive into FlashAttention 2. So the core improvement in FlashAttention is that traditional attention is a quadratic sequence length. And to the two, FlashAttention is linear, which obviously helps with scaling some of these models. [00:02:18]Tri: There are two factors there. So of course the goal has been to make attention go faster or more memory efficient. And ever since attention became popular in 2017 with the Transformer paper, lots and lots of folks have been working on this. And a lot of approaches has been focusing on approximating attention. The goal is you want to scale to longer sequences. There are tons of applications where you want to do that. But scaling to longer sequences is difficult because attention scales quadratically in sequence length on both runtime and memory, as you mentioned. So instead of trying to approximate attention, we were trying to figure out, can we do the same computation and maybe be more memory efficient? So in the end, we ended up being the memory is linear in sequence length. In terms of computation, it's still quadratic, but we managed to make it much more hardware friendly. And as a result, we do get wall clock speed up on the order of 2 to 4x, which really helps because that just means that you'll be able to train with 2 to 4x longer sequence length for the same cost without doing any approximations. As a result, lots of folks have been using this. The thing is available in a lot of libraries that do language model training or fine tuning. [00:03:32]Alessio: And the approximation thing is important because this is an exact thing versus a sparse. So maybe explain a little bit the difference there. [00:03:40]Tri: For sure. So in addition, essentially you compute pairwise similarity between every single element in a sequence against each other. So there's been other approaches where instead of doing all that pairwise computation, you only compute similarity for some pairs of elements in the sequence. So you don't do quadratic number of comparison. And this can be seen as some form of sparsity. Essentially you're ignoring some of the elements. When you write down the matrix, you essentially say, OK, I'm going to pretend there's zero. So that has some benefits in terms of runtime and memory. But the trade-off is that it tends to do worse in terms of quality because you're essentially approximating or ignoring some elements. And I personally have worked on this as well for a few years. But when we talk to practitioners who actually train models, especially at large scale, they say, tend not to use these approximate attention methods. Because it turns out, this was surprising to me at the time, was that these approximation methods, even though they perform fewer computation, they tend to not be faster in walk-on time. So this was pretty surprising because back then, I think my background was more on the theoretical side. So I was thinking of, oh, how many flops or floating point operations are you performing? And hopefully that correlates well with walk-on time. But I realized that I was missing a bunch of ideas from the system side where flops or floating point operations don't necessarily correlate with runtime. There are other factors like memory reading and writing, parallelism, and so on. So I learned a ton from just talking to systems people because they kind of figured this stuff out a while ago. So that was really eye-opening. And then we ended up focusing a lot more on memory reading and writing because that turned out to be the majority of the time when you're doing attention is reading and writing memory. [00:05:34]Alessio: Yeah, the I.O. awareness is probably one of the biggest innovations here. And the idea behind it is, like you mentioned, the FLOPS growth of the cards have been going up, but the memory bandwidth, not as much. So I think maybe that was one of the assumptions that the original attention paper had. So talk a bit about how that came to be as an idea. It's one of those things that like in insight, it's like, obviously, why are we like rewriting to like HBM every time, you know, and like once you change it, it's clear. But what was that discovery process? [00:06:08]Tri: Yeah, in hindsight, a lot of the ideas have already been there in the literature. And I would say is it was somehow at the intersection of both machine learning and systems. And you kind of needed ideas from both sides. So on one hand, on the system side, so lots of systems folks have known that, oh, you know, kernel fusion is great. Kernel fusion just means that instead of performing, you know, loading the same element, instead of performing an operation, write it down, load it back up and perform the second operation, you just load it once, perform two operations and then write it down again. So that saves you kind of memory read and write in the middle there. So kernel fusion has been a classic. There's been other techniques from the system side, like tiling, where you perform things in the form of computations in block, again, so that you can load it into a really fast memory. Think of it as a cache. And this is, again, classical computer science ideas, right? You want to use the cache. So the system folks have been thinking about these ideas for a long time, and they apply to attention as well. But there were certain things in attention that made it difficult to do a complete kernel fusion. One of which is there is this softmax operation in the middle, which requires you to essentially sum across the row of the attention matrix. So it makes it difficult to kind of break it, because there's this dependency. So it makes it difficult to break things into a block. So on the system side, people have been thinking about these ideas, but it's been difficult to kind of do kernel fusion for the entire operation. On the machine learning side, people have been thinking more algorithmically. They say, okay, either we can approximate attention, or there's this trick called the online softmax trick, which says that because of softmax, the way it's written mathematically, you can actually break it up into smaller pieces, do some rescaling, and still get the right answer. So this online softmax trick has been around for a while. I think there was a paper from NVIDIA folks back in 2018 about this. And then there was a paper from Google. So Marcus, Rob, and Stats wrote a paper late 2021 on using this online softmax trick to break attention up into smaller pieces. So a lot of the ideas were already there. But it turns out, you kind of need to combine ideas from both sides. So you need to understand that, hey, we want to do kernel fusion to reduce memory written writes. But we also need this online softmax trick to be able to break the softmax into smaller pieces so that a lot of the systems tricks kind of carry through. We saw that, and it was kind of a natural idea that we ended up using ideas from both sides, and it ended up working pretty well. Yeah. [00:08:57]Alessio: Are there any downsides to kernel fusion? If I think about databases and the reasons why we have atomic operations, you know, it's like, you have observability and fallback in between them. How does that work with attention? Is there anything that we lose by fusing the operations? [00:09:13]Tri: Yeah, I think mostly on the practical side is that you lose a little bit of flexibility in the sense that, hey, now you have, for example, faster attention, it's just a subroutine that you would call to do attention. But as a researcher, let's say you don't want that exact thing, right? You don't want just attention, let's say you want some modification to attention. You want to do, hey, I'm going to multiply the query and key, but then I'm going to do this extra thing before I carry on. So kernel fusion just means that, okay, we have a subroutine that does the entire thing. But if you want to experiment with things, you won't be able to use that fused kernel. And the answer is, can we have a compiler that then automatically does a lot of this kernel fusion? Lots of compiler folks are thinking about this, either with a new language or you can embed it in PyTorch. PyTorch folks have been working on this as well. So if you write just your code in PyTorch and they can capture the graph, can they generate code that will fuse everything together? That's still ongoing, and it works for some cases. But for attention, because of this kind of softmax rewriting stuff, it's been a little bit more difficult. So maybe in a year or two, we'll have compilers that are able to do a lot of these optimizations for you. And you don't have to, for example, spend a couple months writing CUDA to get this stuff to work. Awesome. [00:10:41]Alessio: And just to make it clear for listeners, when we say we're not writing it to memory, we are storing it, but just in a faster memory. So instead of the HBM, we're putting it in the SRAM. Yeah. [00:10:53]Tri: Yeah. [00:10:54]Alessio: Maybe explain just a little bit the difference there. [00:10:56]Tri: Yeah, for sure. This is kind of a caricature of how you think about accelerators or GPUs in particular, is that they have a large pool of memory, usually called HBM, or high bandwidth memory. So this is what you think of as GPU memory. So if you're using A100 and you list the GPU memory, it's like 40 gigs or 80 gigs. So that's the HBM. And then when you perform any operation, you need to move data from the HBM to the compute unit. So the actual hardware unit that does the computation. And next to these compute units, there are on-chip memory or SRAM, which are much, much smaller than HBM, but much faster. So the analogy there is if you're familiar with, say, CPU and RAM and so on. So you have a large pool of RAM, and then you have the CPU performing the computation. But next to the CPU, you have L1 cache and L2 cache, which are much smaller than DRAM, but much faster. So you can think of SRAM as the small, fast cache that stays close to the compute unit. Physically, it's closer. There is some kind of asymmetry here. So HBM is much larger, and SRAM is much smaller, but much faster. One way of thinking about it is, how can we design algorithms that take advantage of this asymmetric memory hierarchy? And of course, lots of folks have been thinking about this. These ideas are pretty old. I think back in the 1980s, the primary concerns were sorting. How can we sort numbers as efficiently as possible? And the motivating example was banks were trying to sort their transactions, and that needs to happen overnight so that the next day they can be ready. And so the same idea applies, which is that they have slow memory, which was hard disk, and they have fast memory, which was DRAM. And people had to design sorting algorithms that take advantage of this asymmetry. And it turns out, these same ideas can apply today, which is different kinds of memory. [00:13:00]Alessio: In your paper, you have the pyramid of memory. Just to give people an idea, when he says smaller, it's like HBM is like 40 gig, and then SRAM is like 20 megabytes. So it's not a little smaller, it's much smaller. But the throughput on card is like 1.5 terabytes a second for HBM and like 19 terabytes a second for SRAM, which is a lot larger. How do you think that evolves? So TSMC said they hit the scaling limits for SRAM, they just cannot grow that much more. HBM keeps growing, HBM3 is going to be 2x faster than HBM2, I think the latest NVIDIA thing has HBM3. How do you think about the future of FlashAttention? Do you think HBM is going to get fast enough when maybe it's not as useful to use the SRAM? [00:13:49]Tri: That's right. I think it comes down to physics. When you design hardware, literally SRAM stays very close to compute units. And so you don't have that much area to essentially put the transistors. And you can't shrink these things too much. So just physics, in terms of area, you don't have that much area for the SRAM. HBM is off-chip, so there is some kind of bus that essentially transfers data from HBM to the compute unit. So you have more area to essentially put these memory units. And so yeah, I think in the future SRAM probably won't get that much larger, because you don't have that much area. HBM will get larger and faster. And so I think it becomes more important to design algorithms that take advantage of this memory asymmetry. It's the same thing in CPU, where the cache is really small, the DRAM is growing larger and larger. DRAM could get to, I don't know, two terabytes, six terabytes, or something, whereas the cache stays at, I don't know, 15 megabytes or something like that. I think maybe the algorithm design becomes more and more important. There's still ways to take advantage of this, I think. So in the future, I think flash attention right now is being used. I don't know if in the next couple of years, some new architecture will come in and whatnot, but attention seems to be still important. For the next couple of years, I still expect some of these ideas to be useful. Not necessarily the exact code that's out there, but I think these ideas have kind of stood the test of time. New ideas like IO awareness from back in the 1980s, ideas like kernel fusions, tiling. These are classical ideas that have stood the test of time. So I think in the future, these ideas will become more and more important as we scale models to be larger, as we have more kinds of devices, where performance and efficiency become much, much more important. [00:15:40]Alessio: Yeah, and we had Jonathan Frankle on the podcast, and if you go to issattentionallyouneed.com, he has an outstanding bet, and he does believe that attention will be the state of the art architecture still in a few years. Did you think flash attention would be this popular? I'm always curious on the research side, you publish a paper, and obviously you know it's great work, but sometimes it just kind of falls flat in the industry. Could you see everybody just starting to use this, or was that a surprise to you? [00:16:11]Tri: Certainly, I didn't anticipate the level of popularity. Of course, we were extremely happy to have people using this stuff and giving us feedback and so on, and help us improve things. I think when we were writing the paper, I remember sending an email to one of my advisors, and like, hey, I'm excited about this paper, but I think the most important thing will be the artifact, which is the code. So I knew that the code will be valuable. So we kind of focus a lot on the code and make sure that the code is usable and as fast as can be. Of course, the idea, the paper presents the ideas and explain it and have experiments that validate the idea, but I knew that the artifact or the code was also pretty important. And that turned out to be the right focus, which is, you know, we put out the paper, we release the code and continue working on the code. So it's a team effort with my co-authors as well. [00:17:07]Alessio: We mentioned Hazy Research a bunch of times on the podcast before. I would love for you to spend five minutes just talking about how does the group work? How do people get together? How do you bounce ideas off of each other? Yeah. [00:17:21]Tri: So Hazy Research is a research group at Stanford led by one of my advisors, Chris Re. I love the people there. It was one of the best experiences I had. They've made my PhD so much more enjoyable. And I think there are a couple of ways that the group has been working pretty well. So one is, I think there's a diverse pool of people who either, you know, some of them focus on algorithms and theory, some of them focus on building systems, some of them focus on applications. And as a result, there is this flow of idea. So as an example, some of us were working on like more algorithms and theory, and then we can talk to the folks building systems and say, hey, let's try it out and let's put it in the systems and see how it is. And there you will get feedback from systems folks. They will say, hey, we implemented this, or we tried this and this is where it doesn't work, something like that. And once we put it in the systems, the application folks can use the algorithm or new methods or new models. And we again get great feedback from them because the application folks, for example, some of my good friends, they focus on medical imaging or seizure detection. And that is the problem they care about. And if your method doesn't work on the task they care about, they will tell you. Whereas I think a lot of people in machine learning, they're a little bit more flexible. So they will be like, hey, it doesn't work on seizure detection. Let's try some other task, right? But having that direct feedback of like, hey, it doesn't work there, let's figure out why. I think that that feedback allows us to do better work. And I think that kind of process of exchanging ideas, validating it in a real system so that applications folks can try it out and give you feedback. That cycle has been very, very useful. And so that's one, having a diverse group of people. The other one is, and this is something I really appreciate from advice from Chris was try to understand the fundamental, right? And he's happy letting me go off and read some textbooks and playing with things because I think a lot of research ideas come from understanding the old literature and see how it fits with the new landscape. And so if you just new archive papers every day, that's great, but you also need to read textbooks. And that's one advice I got from Chris, which is understand the fundamentals. And I think that allows us to do more impactful work. [00:19:46]Alessio: How do you think about academia versus industry? I feel like AI / Machine Learning has been an area where up until three, four years ago, most of the cutting edge work was being done in academia. And now there's all these big industry research labs. You're obviously going to Princeton, so you're an academia believer. How should people think about where to go? Say I'm doing my master's, I have to decide between doing a PhD and going into OpenAI Anthropic. How should I decide? [00:20:15]Tri: I think they kind of play a complementary role, in my opinion. Of course, I also was considering different paths as well. So I think right now, scaling matters a lot, especially when you talk about language models and AI and so on. Scaling matters a lot. And that means that you need compute resources and you need infrastructure and you need engineers time. And so industry tends to have an advantage when it comes to scaling things. But a lot of the ideas actually came from academia. So let's take Attention, which got popular with the Transformer in 2017. Attention actually has been around for a while. So I think the first mention was in 2014, a paper from Bernadot and others and Yoshua Bengio, which is coming from academia. A lot of ideas did come from academia. And scaling things up, of course, I think OpenAI has been great at scaling things up. That was the bet that they made after, I think, GPT-2. So they saw that scaling these things up to back then was 1.5 billion parameter seemed to give you amazing capabilities. So they really committed to that. They really committed to scaling things. And that turned out to be, it's been a pretty successful bet. I think for academia, we're still trying to figure out exactly what we're doing in this shifting landscape. And so lots of folks have been focusing on, for example, evaluation. So I know the Stanford Center for Foundation Model led by Percy, they have this benchmark called HELM, which is this holistic benchmark. So trying to figure out, okay, characterizing the landscape of different kinds of models, what people should evaluate, what people should measure, and things like that. So evaluation is one role. The other one is understanding. So this has happened historically where there's been some development in the industry and academia can play a role in explaining, understanding. They have the luxury to slow down trying to understand stuff, right? So lots of paper on understanding what's really going on, probing these models, and so on. I think I'm not as familiar with the NLP literature, but my impression is there's a lot of that going on in the NLP conferences, which is understanding what these models are doing, what capabilities they have, and so on. And the third one I could see is that the academia can take more risky bets in the sense that we can work on stuff that is quite different from industry. I think industry, my impression is you have some objective. You're trying to say, hey, for this quarter, we want to scale the model in this particular way. Next quarter, we want the model to have these capabilities. You're trying to get objectives that maybe, I don't know, 70% that will work out because it's important for the company's direction. I think for academia, the way things work is you have many, many researchers or PhD students, and they're kind of pursuing independent directions. And they have a little bit more flexibility on, hey, I'm going to try out this seemingly crazy idea and see, let's say there's a 30% chance of success or something. And however you define success, for academia, a lot of the time, success just means like, hey, we found something interesting. That could eventually go into industry through collaboration and so on. So I do see academia and industry kind of playing complementary roles. And as for someone choosing a career, I think just more and more generally, industry would be probably better in terms of compensation, in terms of probably work-life balance. But my biased perspective is that maybe academia gives you a little bit more freedom to think and understand things. So it probably comes down to personal choice. I end up choosing to be a professor next year at Princeton. But of course, I want to maintain a relationship with industry folks. I think industry folks can provide very valuable feedback to what we're doing in academia so that we understand where the field is moving because some of the directions are very much influenced by what, for example, OpenAI or Google is doing. So we want to understand where the field is moving. What are some promising applications? And try to anticipate, okay, if the field is moving like this, these applications are going to be popular. What problems will be important in two, three years? And then we try to start thinking about those problems so that hopefully in two, three years, we have some of the answers to some of these problems in two, three years. Sometimes it works out, sometimes it doesn't. But as long as we do interesting things in academia, that's the goal. [00:25:03]Alessio: And you mentioned the eval side. So we did a Benchmarks 101 episode. And one of the things we were seeing is sometimes the benchmarks really influence the model development. Because obviously, if you don't score well on the benchmarks, you're not going to get published and you're not going to get funded. How do you think about that? How do you think that's going to change now that a lot of the applications of these models, again, is in more narrow industry use cases? Do you think the goal of the academia eval system is to be very broad and then industry can do their own evals? Or what's the relationship there? [00:25:40]Tri: Yeah, so I think evaluation is important and often a little bit underrated. So it's not as flashy as, oh, we have a new model that can do such and such. But I think evaluation, what you don't measure, you can't make progress on, essentially. So I think industry folks, of course, they have specific use cases that their models need to do well on. And that's what they care about. Not just academia, but other groups as well. People do understand what are some of the emerging use cases. So for example, now one of the most popular use cases is Chatbot. And then I think folks from Berkeley, some of them are from Berkeley, call them MLCs. They set up this kind of Chatbot arena to essentially benchmark different models. So people do understand what are some of the emerging use cases. People do contribute to evaluation and measurement. And as a whole, I think people try to contribute to the field and move the field forward, albeit that maybe slightly different directions. But we're making progress and definitely evaluation and measurement is one of the ways you make progress. So I think going forward, there's still going to be just more models, more evaluation. We'll just have better understanding of what these models are doing and what capabilities they have. [00:26:56]Alessio: I like that your work has been focused on not making benchmarks better, but it's like, let's just make everything faster. So it's very horizontal. So FlashAttention 2, you just released that on Monday. I read in the blog post that a lot of the work was also related to some of the NVIDIA library updates. Yeah, maybe run us through some of those changes and some of the innovations there. Yeah, for sure. [00:27:19]Tri: So FlashAttention 2 is something I've been working on for the past couple of months. So the story is the NVIDIA CUTLASS team, they released a new version of their library, which contains all these primitives to allow you to do matrix multiply or memory loading on GPU efficiently. So it's a great library and I built on that. So they released their version 3 back in January and I got really excited and I wanted to play with that library. So as an excuse, I was just like, okay, I'm going to refactor my code and use this library. So that was kind of the start of the project. By the end, I just ended up working with the code a whole lot more and I realized that, hey, there are these inefficiencies still in Flash Attention. We could change this way or that way and make it, in the end, twice as fast. But of course, building on the library that the NVIDIA folks released. So that was kind of a really fun exercise. I was starting out, it's just an excuse for myself to play with the new library. What ended up was several months of improvement, improving Flash Attention, discovering new ideas. And in the end, we managed to make it 2x faster and now it's pretty close to probably the efficiency of things like matrix multiply, which is probably the most optimized subroutine on the planet. So we're really happy about it. The NVIDIA Cutlass team has been very supportive and hopefully in the future, we're going to collaborate more. [00:28:46]Alessio: And since it's an NVIDIA library, can you only run this on CUDA runtimes? Or could you use this and then run it on an AMD GPU? [00:28:56]Tri: Yeah, so it's an NVIDIA library. So right now, the code we release runs on NVIDIA GPUs, which is what most people are using to train models. Of course, there are emerging other hardware as well. So the AMD folks did implement a version of Flash Attention, I think last year as well, and that's also available. I think there's some implementation on CPU as well. For example, there's this library, ggml, where they implemented the same idea running on Mac and CPU. So I think that kind of broadly, the idea would apply. The current implementation ended up using NVIDIA's library or primitives, but I expect these ideas to be broadly applicable to different hardware. I think the main idea is you have asymmetry in memory hierarchy, which tends to be everywhere in a lot of accelerators. [00:29:46]Alessio: Yeah, it kind of reminds me of Sara Hooker's post, like the hardware lottery. There could be all these things that are much better, like architectures that are better, but they're not better on NVIDIA. So we're never going to know if they're actually improved. How does that play into some of the research that you all do too? [00:30:04]Tri: Yeah, so absolutely. Yeah, I think Sara Hooker, she wrote this piece on hardware lottery, and I think she captured really well of what a lot of people have been thinking about this. And I certainly think about hardware lottery quite a bit, given that I do some of the work that's kind of really low level at the level of, hey, we're optimizing for GPUs or NVIDIA GPUs and optimizing for attention itself. And at the same time, I also work on algorithms and methods and transformer alternatives. And we do see this effect in play, not just hardware lottery, but also kind of software framework lottery. You know, attention has been popular for six years now. And so many kind of engineer hours has been spent on making it as easy and efficient as possible to run transformer, right? And there's libraries to do all kinds of tensor parallel, pipeline parallel, if you use transformer. Let's say someone else developed alternatives, or let's just take recurrent neural nets, like LSTM, GRU. If we want to do that and run that efficiently on current hardware with current software framework, that's quite a bit harder. So in some sense, there is this feedback loop where somehow the model architectures that take advantage of hardware become popular. And the hardware will also kind of evolve to optimize a little bit for that kind of architecture and software framework will also evolve to optimize for that particular architecture. Right now, transformer is the dominant architecture. So yeah, I'm not sure if there is a good way out of this. Of course, there's a lot of development. Things like, I think compilers will play a role because compilers allow you to maybe still be much more efficient across different kinds of hardware because essentially you write the same code and compiler will be able to make it run efficiently different kinds of hardware. So for example, there's this language Mojo, they're compiler experts, right? And their bet is AI models will be running on different kinds of devices. So let's make sure that we have really good compilers with a good language that then the compiler can do a good job optimizing for all kinds of devices. So that's maybe one way that you can get out of this cycle. But yeah, I'm not sure of a good way. In my own research, I have to think about both the algorithm new model and how it maps to hardware. So there are crazy ideas that seem really good, but will be really, really difficult to run efficiently. And so as a result, for example, we can't really scale some of the architectures up simply because they're not hardware friendly. I have to think about both sides when I'm working on new models. [00:32:50]Alessio: Yeah. Have you spent any time looking at some of the new kind of like AI chips companies, so to speak, like the Cerebras of the world? Like one of their innovations is co-locating everything on the chip. So you remove some of this memory bandwidth issue. How do you think about that? [00:33:07]Tri: Yeah, I think that's an interesting bet. I think Tesla also has this Dojo supercomputer where they try to have essentially as fast on-chip memory as possible and removing some of these data transfer back and forth. I think that's a promising direction. The issues I could see, you know, I'm definitely not a hardware expert. One issue is the on-chip memory tends to be really expensive to manufacture, much more expensive per gigabyte compared to off-chip memory. So I talked to, you know, some of my friends at Cerebros and, you know, they have their own stack and compiler and so on, and they can make it work. The other kind of obstacle is, again, with compiler and software framework and so on. For example, if you can run PyTorch on this stuff, lots of people will be using it. But supporting all the operations in PyTorch will take a long time to implement. Of course, people are working on this. So I think, yeah, we kind of need these different bets on the hardware side as well. Hardware has, my understanding is, has a kind of a longer time scale. So you need to design hardware, you need to manufacture it, you know, maybe on the order of three to five years or something like that. So people are taking different bets, but the AI landscape is changing so fast that it's hard to predict, okay, what kind of models will be dominant in, let's say, three or five years. Or thinking back five years ago, would we have known that Transformer would have been the dominant architecture? Maybe, maybe not, right? And so different people will make different bets on the hardware side. [00:34:39]Alessio: Does the pace of the industry and the research also influence the PhD research itself? For example, in your case, you're working on improving attention. It probably took you quite a while to write the paper and everything, but in the meantime, you could have had a new model architecture come out and then it's like nobody cares about attention anymore. How do people balance that? [00:35:02]Tri: Yeah, so I think it's tough. It's definitely tough for PhD students, for researchers. Given that the field is moving really, really fast, I think it comes down to understanding fundamental. Because that's essentially, for example, what the PhD allows you to do. It's been a couple of years understanding the fundamentals. So for example, when I started my PhD, I was working on understanding matrix vector multiply, which has been a concept that's been around for hundreds of years. We were trying to characterize what kind of matrices would have theoretically fast multiplication algorithm. That seems to have nothing to do with AI or anything. But I think that was a time when I developed mathematical maturity and research taste and research skill. The research topic at that point didn't have to be super trendy or anything, as long as I'm developing skills as a researcher, I'm making progress. And eventually, I've gotten quite a bit better in terms of research skills. And that allows, for example, PhD students later in their career to quickly develop solutions to whatever problems they're facing. So I think that's just the natural arc of how you're being trained as a researcher. For a lot of PhD students, I think given the pace is so fast, maybe it's harder to justify spending a lot of time on the fundamental. And it's tough. What is this kind of explore, exploit kind of dilemma? And I don't think there's a universal answer. So I personally spend some time doing this kind of exploration, reading random textbooks or lecture notes. And I spend some time keeping up with the latest architecture or methods and so on. I don't know if there's a right balance. It varies from person to person. But if you only spend 100% on one, either you only do exploration or only do exploitation, I think it probably won't work in the long term. It's probably going to have to be a mix and you have to just experiment and kind of be introspective and say, hey, I tried this kind of mixture of, I don't know, one exploration paper and one exploitation paper. How did that work out for me? Should I, you know, having conversation with, for example, my advisor about like, hey, did that work out? You know, should I shift? I focus more on one or the other. I think quickly adjusting and focusing on the process. I think that's probably the right way. I don't have like a specific recommendation that, hey, you focus, I don't know, 60% on lecture notes and 40% on archive papers or anything like that. [00:37:35]Alessio: Let's talk about some Transformer alternatives. You know, say Jonathan Franco loses his bet and Transformer is not the state of the art architecture. What are some of the candidates to take over? [00:37:49]Tri: Yeah, so this bet is quite fun. So my understanding is this bet between Jonathan Franco and Sasha Rush, right? I've talked to Sasha a bunch and I think he recently gave an excellent tutorial on Transformer alternatives as well. So I would recommend that. So just to quickly recap, I think there's been quite a bit of development more recently about Transformer alternatives. So architectures that are not Transformer, right? And the question is, can they do well on, for example, language modeling, which is kind of the application that a lot of people care about these days. So there are methods based on state space methods that came out in 2021 from Albert Gu and Curran and Chris Re that presumably could do much better in terms of capturing long range information while not scaling quadratically. They scale sub-quadratically in terms of sequence length. So potentially you could have a much more efficient architecture when sequence length gets really long. The other ones have been focusing more on recurrent neural nets, which is, again, an old idea, but adapting to the new landscape. So things like RWKV, I've also personally worked in this space as well. So there's been some promising results. So there's been some results here and there that show that, hey, these alternatives, either RNN or state space methods, can match the performance of Transformer on language modeling. So that's really exciting. And we're starting to understand on the academic research side, we want to understand, do we really need attention? I think that's a valuable kind of intellectual thing to understand. And maybe we do, maybe we don't. If we want to know, we need to spend serious effort on trying the alternatives. And there's been folks pushing on this direction. I think RWKV scale up to, they have a model at 14 billion that seems pretty competitive with Transformer. So that's really exciting. That's kind of an intellectual thing. We want to figure out if attention is necessary. So that's one motivation. The other motivation is Transformer Alternative could have an advantage in practice in some of the use cases. So one use case is really long sequences. The other is really high throughput of generation. So for really long sequences, when you train with Transformer, with flash attention and so on, the computation is still quadratic in the sequence length. So if your sequence length is on the order of, I don't know, 16K, 32K, 100K or something, which some of these models have sequence length 100K, then you do get significantly slower in terms of training, also in terms of inference. So maybe these alternative architectures could scale better in terms of sequence length. I haven't seen actual validation on this. Let's say an RNN model release with context length, I don't know, 100K or something. I haven't really seen that. But the hope could be that as we scale to long sequences, these alternative architectures could be more well-suited. Not just text, but things like high resolution images, audio, video, and so on, which are emerging applications. So that's one, long sequences. Number two is a high throughput generation, where I can imagine scenarios where the application isn't like an interactive chatbot, but let's say a company wants to batch as many requests as possible on their server, or they're doing offline processing, they're generating stuff based on their internal documents, that you need to process in batch. And the issue with Transformer is that during generation, it essentially needs to keep around all the previous history. It's called the KV cache. And that could take a significant amount of memory, so you can't really batch too much because you run out of memory. I am personally bullish on RNNs. I think RNNs, they essentially summarize the past into a state vector that has fixed size, so the size doesn't grow with the history. So that means that you don't need as much memory to keep around all the previous tokens. And as a result, I think you can scale to much higher batch sizes. And as a result, you can make much more efficient use of the GPUs or the accelerator, and you could have much higher generation throughput. Now, this, I don't think, has been validated at scale. So as a researcher, I'm bullish on this stuff because I think in the next couple of years, these are use cases where these alternatives could have an advantage. We'll just kind of have to wait and see to see if these things will happen. I am personally bullish on this stuff. At the same time, I also spend a bunch of time making attention as fast as possible. So maybe hatching and playing both sides. Ultimately, we want to understand, as researchers, we want to understand what works, why do the models have these capabilities? And one way is, let's push attention to be as efficient as possible. On the other hand, let's push other alternatives to be as efficient at scale, as big as possible, and so that we can kind of compare them and understand. Yeah, awesome. [00:43:01]Alessio: And I think as long as all of this work happens and open, it's a net positive for everybody to explore all the paths. Yeah, let's talk about open-source AI. Obviously, together, when Red Pajama came out, which was an open clone of the LLAMA1 pre-training dataset, it was a big thing in the industry. LLAMA2 came out on Tuesday, I forget. And this week, there's been a lot of things going on, which they call open-source, but it's not really open-source. Actually, we wrote a post about it that was on the front page of Hacker News before this podcast, so I was frantically responding. How do you think about what open-source AI really is? In my mind, in open-source software, we have different levels of open. So there's free software, that's like the GPL license. There's open-source, which is Apache, MIT. And then there's kind of restricted open-source, which is the SSPL and some of these other licenses. In AI, you have the open models. So Red Pajama is an open model because you have the pre-training dataset, you have the training runs and everything. And then there's obviously RandomLens that doesn't make it one-to-one if you retrain it. Then you have the open-weights model that's kind of like StableLM, where the weights are open, but the dataset is not open. And then you have LLAMA2, which is the dataset is not open, the weights are restricted. It's kind of like not really open-source, but open enough. I think it's net positive because it's like $3 million of flops donated to the public. [00:44:32]Tri: How do you think about that? [00:44:34]Alessio: And also, as you work together, what is your philosophy with open-source AI? Right, right. [00:44:40]Tri: Yeah, I think that's a great question. And I think about it on maybe more practical terms. So of course, Meta has done an amazing job training LLAMA1, LLAMA2. And for LLAMA2, they make it much less restrictive compared to LLAMA1. Now you can use it for businesses, unless you are a monthly active user or something like that. I think just this change will have a very significant impact in the kind of landscape of open-source AI, where now lots of businesses, lots of companies will be using, I expect will be using things like LLAMA2. They will fine-tune on their own dataset. They will be serving variants or derivatives of LLAMA2. Whereas before, with LLAMA1, it was also a really good model, but your business companies weren't allowed to do that. So I think on a more practical term, it's kind of shifting the balance between a closed-source model like OpenAI and Anthropic and Google, where you're making API calls, right? And maybe you don't understand as much of what the model is doing, how the model is changing, and so on. Versus now, we have a model with open weight that is pretty competitive from what I've seen in terms of benchmarks, pretty competitive with GPT 3.5, right? And if you fine-tune it on your own data, maybe it's more well-suited for your own data. And I do see that's going to shift the balance of it. More and more folks are going to be using, let's say, derivatives of LLAMA2. More and more folks are going to fine-tune and serve their own model instead of calling an API. So that shifting of balance is important because in one way, we don't want just a concentration of decision-making power in the hands of a few companies. So I think that's a really positive development from Meta. Of course, training the model takes a couple of millions of dollars, but engineers have and I'm sure they spend tons of time trying many, many different things. So the actual cost is probably way more than that. And they make the weights available and they allow probably a lot of companies are going to be using this. So I think that's a really positive development. And we've also seen amazing progress on the open source community where they would take these models and they either fine-tune on different kinds of data sets or even make changes to the model. So as an example, I think for LLAMA1, the context lane was limited to 2K. Like a bunch of folks figured out some really simple methods to scale up to like 8K. [00:47:12]Alessio: Like the RoPE. [00:47:13]Tri: Yes. I think the open source community is very creative, right? And lots of people. LLAMA2 will, again, kind of accelerate this where more people will try it out. More people will make tweaks to it and make a contribution and then so on. So overall, I think I see that as still a very positive development for the field. And there's been lots of libraries that will allow you to host or fine-tune these models, like even with quantization and so on. Just a couple of hours after LLAMA2 was released, tons of companies announcing that, hey, it's on our API or hosting and so on and together did the same. So it's a very fast-paced development and just kind of a model with available weights that businesses are allowed to use. I think that alone is already a very positive development. At the same time, yeah, we can do much better in terms of releasing data sets. Data sets tend to be... Somehow people are not incentivized to release data sets. So philosophically, yeah, you want to be as open as possible. But on a practical term, I think it's a little bit harder for companies to release data sets. Legal issues. The data sets released tend to be not as eye-catchy as the model release. So maybe people are less incentivized to do that. We've seen quite a few companies releasing data sets together. Released a red pajama data set. I think Cerebus then worked on that and deduplicate and clean it up and release slim pajama and so on. So we're also seeing positive development on that front, kind of on the pre-training data set. So I do expect that to continue. And then on the fine-tuning data set or instruction tuning data set, I think we now have quite a few open data sets on instruction tuning and fine-tuning. But these companies do pay for human labelers to annotate these instruction tuning data set. And that is expensive. And maybe they will see that as their competitive advantage. And so it's harder to incentivize these companies to release these data sets. So I think on a practical term, we're still going to make a lot of progress on open source AI, on both the model development, on both model hosting, on pre-training data set and fine-tuning data set. Right now, maybe we don't have the perfect open source model since all the data sets are available. Maybe we don't have such a thing yet, but we've seen very fast development on the open source side. I think just maybe this time last year, there weren't as many models that are competitive with, let's say, ChatGPT. [00:49:43]Alessio: Yeah, I think the open data sets have so much more impact than open models. If you think about Elusive and the work that they've done, GPT-J was great, and the Pythia models are great, but the Pyle and the Stack, everybody uses them. So hopefully we get more people to contribute time to work on data sets instead of doing the 100th open model that performs worse than all the other ones, but they want to say they released the model. [00:50:14]Tri: Yeah, maybe the question is, how do we figure out an incentive structure so that companies are willing to release open data sets? And for example, it could be like, I think some of the organizations are now doing this where they are asking volunteers to annotate and so on. And maybe the Wikipedia model of data set, especially for instruction tuning, could be interesting where people actually volunteer their time and instead of editing Wikipedia, add annotation. And somehow they acknowledge and feel incentivized to do so. Hopefully we get to that kind of level of, in terms of data, it would be kind of like Wikipedia. And in terms of model development, it's kind of like Linux where people are contributing patches and improving the model in some way. I don't know exactly how that's going to happen, but based on history, I think there is a way to get there. [00:51:05]Alessio: Yeah, I think the Dolly-15K data set is a good example of a company saying, let's do this smaller thing, just make sure we make it open. We had Mike Conover from Databricks on the podcast, and he was like, people just bought into it and leadership was bought into it. You have companies out there with 200,000, 300,000 employees. It's like, just put some of them to label some data. It's going to be helpful. So I'm curious to see how that evolves. What made you decide to join Together? [00:51:35]Tri: For Together, the focus has been focusing a lot on open source model. And I think that aligns quite well with what I care about, of course. I also know a bunch of people there that I know and trust, and I'm excited to work with them. Philosophically, the way they've been really open with data set and model release, I like that a lot. Personally, for the stuff, for example, the research that I've developed, like we also try to make code available, free to use and modify and so on, contributing to the community. That has given us really valuable feedback from the community and improving our work. So philosophically, I like the way Together has been focusing on open source model. And the nice thing is we're also going to be at the forefront of research and the kind of research areas that I'm really excited about, things like efficient training and inference, aligns quite well with what the company is doing. We'll try our best to make things open and available to everyone. Yeah, but it's going to be fun being at the company, leading a team, doing research on the topic that I really care about, and hopefully we'll make things open to benefit the community. [00:52:45]Alessio: Awesome. Let's jump into the lightning round. Usually, I have two questions. So one is on acceleration, one on exploration, and then a takeaway. So the first one is, what's something that already happened in AI machine learning that you thought would take much longer than it has? [00:53:01]Tri: I think understanding jokes. I didn't expect that to happen, but it turns out scaling model up and training lots of data, the model can now understand jokes. Maybe it's a small thing, but that was amazing to me. [00:53:16]Alessio: What about the exploration side? What are some of the most interesting unsolved questions in the space? [00:53:22]Tri: I would say reasoning in the broad term. We don't really know how these models do. Essentially, they do something that looks like reasoning. We don't know how they're doing it. We have some ideas. And in the future, I think we will need to design architecture that explicitly has some kind of reasoning module in it if we want to have much more capable models. [00:53:43]Alessio: What's one message you want everyone to remember today? [00:53:47]Tri: I would say try to understand both the algorithm and the systems that these algorithms run on. I think at the intersection of machine learning system has been really exciting, and there's been a lot of amazing results at this intersection. And then when you scale models to large scale, both the machine learning side and the system side really matter. [00:54:06]Alessio: Awesome. Well, thank you so much for coming on 3. [00:54:09]Tri: This was great. Yeah, this has been really fun. [00:54:11] Get full access to Latent Space at www.latent.space/subscribe
Robert DT on twitter: @DeeperThrill Doctorate on biomedical engineering with a focus on AI Entrepreneur building biomedical systems with AI specifically; medical imaging The conversation centers on the role of artificial intelligence (AI) in medical imaging, with an emphasis on computer vision and the utilization of existing imaging algorithms. Transformers, a type of deep learning model, are discussed for their unique self-attention mechanism and applications in natural language processing and computer vision. The talk pivots to data cleaning, specifically anonymization and safeguarding personal identifiers in the context of healthcare. Questions arise about data storage in healthcare facilities and the process of transferring it to the cloud. The conversation broadens to encompass AI's predictive capabilities and inherent risks, including the possibility of AI predicting third-order effects of its own interventions and concerns about excessive trust in AI predictions. The potential of AI in genetic engineering surfaces, particularly regarding CRISPR technology and nanobots. The conversation explores the benefits and risks of such advancements, including the revival of extinct plants and emergence of new diseases. Finally, the conversation shifts to societal implications of AI, including job displacement, the emergence of an attention economy, and the prospects of decentralized AI. The importance of understanding the limits of AI is underscored. Show notes We need to examine what's currently happening in the field of AI, particularly in relation to medical imaging. This involves an exploration of computer vision technologies and how pre-existing imaging algorithms are being applied. We should discuss the concept of a "transformer" in the context of artificial intelligence. A critical part of working with AI is data cleaning. This includes the process of anonymization, ensuring that we only use the person's image and not any identifiable data like their name. We must also consider the storage of this data, which is typically housed on hospital servers. Additionally, there's the question of how this data is transferred to a cloud system for further processing. Let's explore the issue of gatekeeping in the field of AI. This might involve discussing the role of clinical trials and the Institutional Review Board in ensuring ethical standards. The engineering aspect of gatekeeping also requires attention, particularly when dealing with 3D data sets for imaging. We should highlight two major changes currently happening in the field of AI. Swin Transformers represent a significant development, as they are built off the concept of transformers in AI. Let's delve into the world of language modeling and chatbots. We must also consider the potential downsides of these AI technologies. The transhumanism angle presents an interesting point of discussion, particularly in relation to the next generation of technology. For example, the development of the mRNA vaccine was a major leap forward in response to global health crises. There's also the concept of generative mRNA vaccines, which use AI to generate potential cures. However, these AI technologies also come with risks. They could inadvertently create a disease, or develop a cure that isn't effective. The ease with which technology can be used in this field means that virtually anyone can make implants, leading to a new set of challenges. We should also discuss the emerging role of AI in lab-based work, such as managing petri dishes. The application of Hegelian principles to AI provides an interesting philosophical perspective. Looking ahead, we might consider what a lab kit might look like in ten years. The idea of the first version of something, and its relationship to anti-authoritarianism, is another interesting topic to explore. We have to acknowledge that AI, despite its potential, will not prevent all risks. AI can be used as a predictive tool for triaging, helping to determine whether an intervention will benefit a person. The use of CRISPR technology is another relevant point of discussion, especially considering its potential downsides, its application in nanobot technology, its use in regrowing extinct plants, the potential for new diseases arising from its use, and the systematics of finding new plant species in places like the Amazon. Let's also consider the case of the dodo and the role of technology in its extinction. With a small sample size, AI can predict certain outcomes, a feature that can be beneficial in various fields. Most plant species are discovered rather than created, and AI can potentially help in predicting where these new species might be found. The question arises: is AI better at predicting the future? It can certainly help us see larger scale patterns that we aren't aware of. However, the act of predicting the future can create its own issues, akin to the Oracle of Delphi dilemma. For instance, can AI predict the third-order effects of its own intervention? By revealing patterns, AI becomes a more effective tool. The more layers of patterns it can show us, the better. AI and Medical Imaging: AI is increasingly being used in medical imaging, particularly through deep learning techniques. These have applications in MRI, CT, and PET scans, enhancing image reconstruction, quality, and efficiency. While impressive progress has been made, the technology still needs further development before it can be widely applied in clinical settings Transformers: Introduced in 2017, transformers are a type of deep learning model used primarily in natural language processing and computer vision. They're distinguished by their use of self-attention, enabling them to process the entire input data all at once, rather than sequentially as in Recurrent Neural Networks (RNNs). This allows for more parallelization and thus reduces training times. Transformers have become the model of choice for many NLP problems, replacing RNN models such as long short-term memory (LSTM)
AI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion
In this episode of the AI Today podcast hosts Kathleen Walch and Ron Schmelzer define the terms Recurrent Neural Networks (RNN) and Long-Short Term Memory (LSTM), explain how these terms relate to AI and why it's important to know about them. Want to dive deeper into an understanding of artificial intelligence, machine learning, or big data concepts? Continue reading AI Today Podcast: AI Glossary Series – Recurrent Neural Networks (RNN) and Long-Short Term Memory (LSTM) at AI & Data Today.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
Today we continue our coverage of ICLR 2023 joined by Dhruv Batra, an associate professor at Georgia Tech and research director of the Fundamental AI Research (FAIR) team at META. In our conversation, we discuss Dhruv's work on the paper Emergence of Maps in the Memories of Blind Navigation Agents, which won an Outstanding Paper Award at the event. We explore navigation with multilayer LSTM and the question of whether embodiment is necessary for intelligence. We delve into the Embodiment Hypothesis and the progress being made in language models and caution on the responsible use of these models. We also discuss the history of AI and the importance of using the right data sets in training. The conversation explores the different meanings of "maps" across AI and cognitive science fields, Dhruv's experience in navigating mapless systems, and the early discovery stages of memory representation and neural mechanisms. The complete show notes for this episode can be found at https://twimlai.com/go/629
Composants et matériels électroniques, semi-conducteurs : découvrez toutes les offres de notre partenaire Farnell France sur fr.farnell.comDans ce dossier, Guillaume Poggiaspalla explique comment fonctionnent les algorithmes d'intelligence artificielle. Avec des images faciles à se représenter, il nous explique l'arrivée des transformeurs. ❤️ Patreon
Prediction a late flight --- Send in a voice message: https://podcasters.spotify.com/pod/show/david-nishimoto/message
In this episode, we delve into the world of time series analysis, exploring its importance and various applications. From defining time series data to decomposing it and analyzing it through various models including Autoregressive (AR), Moving Average (MA) and Autoregressive Integrated Moving Average (ARIMA), this episode offers a comprehensive overview of this crucial tool. With real-world examples and suggestions for further reading, this episode is a must-listen for anyone interested in enhancing their understanding of time series analysis.Support the Show.Keep AI insights flowing – become a supporter of the show!Click the link for details
In this episode of The AI Frontier, join us as we embark on a journey through the history of deep learning and artificial intelligence. From the earliest days of linear regression to the latest advancements in generative adversarial networks, we will explore the key moments and milestones that have shaped the development of this groundbreaking field. Learn about the pioneers and trailblazers who pushed the boundaries of what was possible, and discover how deep learning has revolutionized the way we think about and interact with technology. Get ready to delve deep into the history of AI!Support the Show.Keep AI insights flowing – become a supporter of the show!Click the link for details
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on the impact of RLHF research, published by paulfchristiano on January 25, 2023 on LessWrong. In this post I'm going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I'll discuss various arguments that RLHF research had an overall negative impact and explain why I don't find them persuasive. I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress." Background on my involvement in RLHF work Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background: The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model's actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. (This is in contrast with, for example, trying to formally specify the human utility function, or notions of corrigibility / low-impact / etc, in some way.) Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because: Evaluating consequences is hard. A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it's challenging to evaluate treacherous turn probability at training time. It's very unclear if those issues are fatal before or after AI systems are powerful enough to completely transform human society (and in particular the state of AI alignment). Even if they are fatal, many of the approaches to resolving them still have the same basic structure of learning from expensive evaluations of actions. In order to overcome the fundamental difficulties with RLHF, I have long been interested in techniques like iterated amplification and adversarial training. However, prior to 2017 most researchers I talked to in ML (and many researchers in alignment) thought that the basic strategy of training AI with expensive human evaluations was impractical for more boring reasons and so weren't interested in these difficulties. On top of that, we obviously weren't able to actually implement anything more fancy than RLHF since all of these methods involve learning from expensive feedback. I worked on RLHF work to try to facilitate and motivate work on fixes. The history of my involvement: My first post on this topic was in 2015. When I started full-time at OpenAI in 2017 it seemed to me like it would be an impactful project; I considered doing a version with synthetic human feedback (showing that we could learn from a practical amount of algorithmically-defined feedback) but my manager Dario Amodei convinced me it would be more compelling to immediately go for human feedback. The initial project was surprisingly successful and published here. I then intended to implement a version with language models aiming to be complete in the first half of 2018 (aiming to build an initial amplification prototype with LMs around end of 2018; both of these timelines were about 2.5x too optimistic). This seemed like the most important domain to study RLHF and alignment more broadly. In mid-2017 Alec Radford helped me do a prototype with LSTM language models (prior to the release of transformers); the prototype didn't look promising enough to scale up. In mid-2017 Geoffrey Irving joined OpenAI and was excited about starting with RLHF and then going beyond it using debate; he also thought language models were the most important domain to study and had more conviction about that. In 2018 he started a larger team working on fine-tuning on language models, w...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on the impact of RLHF research, published by Paul Christiano on January 25, 2023 on The AI Alignment Forum. In this post I'm going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I'll discuss various arguments that RLHF research had an overall negative impact and explain why I don't find them persuasive. I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress." Background on my involvement in RLHF work Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background: The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model's actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. (This is in contrast with, for example, trying to formally specify the human utility function, or notions of corrigibility / low-impact / etc, in some way.) Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because: Evaluating consequences is hard. A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it's challenging to evaluate treacherous turn probability at training time. It's very unclear if those issues are fatal before or after AI systems are powerful enough to completely transform human society (and in particular the state of AI alignment). Even if they are fatal, many of the approaches to resolving them still have the same basic structure of learning from expensive evaluations of actions. In order to overcome the fundamental difficulties with RLHF, I have long been interested in techniques like iterated amplification and adversarial training. However, prior to 2017 most researchers I talked to in ML (and many researchers in alignment) thought that the basic strategy of training AI with expensive human evaluations was impractical for more boring reasons and so weren't interested in these difficulties. On top of that, we obviously weren't able to actually implement anything more fancy than RLHF since all of these methods involve learning from expensive feedback. I worked on RLHF work to try to facilitate and motivate work on fixes. The history of my involvement: My first post on this topic was in 2015. When I started full-time at OpenAI in 2017 it seemed to me like it would be an impactful project; I considered doing a version with synthetic human feedback (showing that we could learn from a practical amount of algorithmically-defined feedback) but my manager Dario Amodei convinced me it would be more compelling to immediately go for human feedback. The initial project was surprisingly successful and published here. I then intended to implement a version with language models aiming to be complete in the first half of 2018 (aiming to build an initial amplification prototype with LMs around end of 2018; both of these timelines were about 2.5x too optimistic). This seemed like the most important domain to study RLHF and alignment more broadly. In mid-2017 Alec Radford helped me do a prototype with LSTM language models (prior to the release of transformers); the prototype didn't look promising enough to scale up. In mid-2017 Geoffrey Irving joined OpenAI and was excited about starting with RLHF and then going beyond it using debate; he also thought language models were the most important domain to study and had more conviction about that. In 2018 he started a larger team working on fine-tuning on langu...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on the impact of RLHF research, published by paulfchristiano on January 25, 2023 on LessWrong. In this post I'm going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I'll discuss various arguments that RLHF research had an overall negative impact and explain why I don't find them persuasive. I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress." Background on my involvement in RLHF work Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background: The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model's actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. (This is in contrast with, for example, trying to formally specify the human utility function, or notions of corrigibility / low-impact / etc, in some way.) Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because: Evaluating consequences is hard. A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it's challenging to evaluate treacherous turn probability at training time. It's very unclear if those issues are fatal before or after AI systems are powerful enough to completely transform human society (and in particular the state of AI alignment). Even if they are fatal, many of the approaches to resolving them still have the same basic structure of learning from expensive evaluations of actions. In order to overcome the fundamental difficulties with RLHF, I have long been interested in techniques like iterated amplification and adversarial training. However, prior to 2017 most researchers I talked to in ML (and many researchers in alignment) thought that the basic strategy of training AI with expensive human evaluations was impractical for more boring reasons and so weren't interested in these difficulties. On top of that, we obviously weren't able to actually implement anything more fancy than RLHF since all of these methods involve learning from expensive feedback. I worked on RLHF work to try to facilitate and motivate work on fixes. The history of my involvement: My first post on this topic was in 2015. When I started full-time at OpenAI in 2017 it seemed to me like it would be an impactful project; I considered doing a version with synthetic human feedback (showing that we could learn from a practical amount of algorithmically-defined feedback) but my manager Dario Amodei convinced me it would be more compelling to immediately go for human feedback. The initial project was surprisingly successful and published here. I then intended to implement a version with language models aiming to be complete in the first half of 2018 (aiming to build an initial amplification prototype with LMs around end of 2018; both of these timelines were about 2.5x too optimistic). This seemed like the most important domain to study RLHF and alignment more broadly. In mid-2017 Alec Radford helped me do a prototype with LSTM language models (prior to the release of transformers); the prototype didn't look promising enough to scale up. In mid-2017 Geoffrey Irving joined OpenAI and was excited about starting with RLHF and then going beyond it using debate; he also thought language models were the most important domain to study and had more conviction about that. In 2018 he started a larger team working on fine-tuning on language models, w...
Millions of people around the world go to bed hungry every night. It is an appalling fact, especially in a world where billions of tonnes of food is wasted every year. There are few places on Earth, if any, were hunger is more of a critical issue than in Somalia, East Africa. Malnutrition is rife in the country, which has been beset by drought and conflict. Government officials warn Somalia is on the brink of famine. In this episode of the Food Matters Live podcast, we meet the people who are working tirelessly to stop hunger, in Somalia and elsewhere in the world. Action Against Hunger works in more than 50 countries around the world, providing life-saving programmes and helping millions of people. But how does an organisation like this go about helping so many people? The Love Food Give Food Campaign Action Against Hunger UK's Love Food Give Food campaign brings together the food and hospitality industry and their customers to take action against hunger. Through our shared love of food, we can fight the hunger crisis and help vulnerable communities to build a brighter future. Just £1 can provide a mother with a day's worth of life saving therapeutic food to treat a malnourished child. In September and October, you can support our life-saving work by donating on Action Against Hunger's website or at one of the restaurants participating in the campaign. Click here to find out more What work goes into getting the right help, to the right people, at the right time? We are joined from Mogadishu by Action Against Hunger's Country Director in Somalia, Ahmed Khalif, detailing the exact work that is happening in the country. He tells us about the groundwork that needed to be laid to allow aid to arrive, as well as how they overcome the challenge of working with different communities in the country. Alison McNutly, Action Against Hunger UK's Director of Operations reveals how the charity works out what help is needed, and the logistics of getting it where it is most needed. Action Against Hunger UK's Head of Nutrition, Alexandra Rutishauser-Perera, sheds some light on the innovations in nutrition that are helping to ensure the maximum amount of help is sent to the maximum number of people. Alexandra Rutishauser-Perera, Action Against Hunger UK's Head of Nutrition Alexandra is a multi-skilled registered nurse holding an MSc in Nutrition for Global Health from the London school of Hygiene and Tropical Medicine and a Diploma in Tropical Medicine. She has 16 years' experience in the field of public health nutrition in diverse settings across Africa and Asia. Before Action Against Hunger, she worked for Doctors Without Borders, International Medical Corps and Save the Children. Recognised for her technical skills, she is a guest lecturer at the LSHM, LSTM, Metropolitan and Westminster Universities, King's College and the American University of Beirut. She is, amongst others, a member of the strategic advisory groups of the Global Nutrition Cluster and Infant Feeding in Emergency core group and co-chairs the Global technical Assistance Mechanism on Nutrition Information Systems. She also coordinates a group of researchers working on the topic of Kwashiorkor Alison McNulty, Action Against Hunger UK's Director of Operations Alison McNulty has been Operations Director at Action Against Hunger UK since 2020. She brings a strong background in research, evaluation and strategy following a career as a health and social care researcher, and has managed and led teams to support the creation of strong evidence. Alison has worked in the humanitarian sector since 2010, after leaving the academic sector, and has a passion to showcase the impact of the charity as it strives for a world free from hunger. Ahmed Khalif, Action Against Hunger Country Director in Somalia Ahmed is a seasoned humanitarian and development professional who has held middle and senior leadership positions in various organisations. Ahmed holds a Master of Arts in Humanitarian Assistance from Tufts University, USA; a Post Graduate Certificate in Humanitarian Leadership from Deakin University; and a Bachelor of Arts in Business Studies from the University of Northampton, UK. He has a specific interest in strengthening systems, such as food, health, and water as well as disaster resilience.
Steven Banerjee: How Machine Intelligence, NLP and AI is changing Health Care [Audio] Podcast: Play in new window | DownloadSubscribe: Google Podcasts | Spotify | Stitcher | TuneIn | RSSSteven Banerjee is the CEO of NExTNet Inc. NExTNet is a Silicon Valley based technology startup pioneering natural language based Explainable AI platform to accelerate drug discovery and development. Steven is also the founder of Mekonos, a Silicon Valley based biotechnology company backed by world-class Institutional investors (pre-Series B) — pioneering proprietary cell and gene-engineering platforms to advance personalized medicine. He also advises Lumen Energy, a company that uses a radically simplified approach to deploy commercial solar. Lumen Energy makes it easy for building owners to get clean energy. Please support this podcast by checking out our sponsors:Episode Links: Steven Banerjee LinkedIn: https://www.linkedin.com/in/steven-banerjee/ Steven Banerjee Website: https://www.nextnetinc.com/ Podcast Details: Podcast website: https://www.humainpodcast.com Apple Podcasts: https://podcasts.apple.com/us/podcast/humain-podcast-artificial-intelligence-data-science/id1452117009 Spotify: https://open.spotify.com/show/6tXysq5TzHXvttWtJhmRpS RSS: https://feeds.redcircle.com/99113f24-2bd1-4332-8cd0-32e0556c8bc9 YouTube Full Episodes: https://www.youtube.com/channel/UCxvclFvpPvFM9_RxcNg1rag YouTube Clips: https://www.youtube.com/channel/UCxvclFvpPvFM9_RxcNg1rag/videos Support and Social Media: – Check out the sponsors above, it's the best way to support this podcast– Support on Patreon: https://www.patreon.com/humain/creators – Twitter: https://twitter.com/dyakobovitch – Instagram: https://www.instagram.com/humainpodcast/ – LinkedIn: https://www.linkedin.com/in/davidyakobovitch/ – Facebook: https://www.facebook.com/HumainPodcast/ – HumAIn Website Articles: https://www.humainpodcast.com/blog/ Outline: Here's the timestamps for the episode: (05:20)- So I am a mechanical engineer by training. And I started my graduate research in semiconductor technologies with applications in biotech almost more than a decade ago, in the early 2010s. I was a Doctoral Fellow at IBM labs here in San Jose, California. And then I also ended up writing some successful federal grants with a gene sequencing pioneer at Stanford, and Ron Davis, before I went, ended up going to UC Berkeley for grad school research, and then I became a visiting researcher. (09:28)- An average cost of bringing a drug to market is around $2.6 billion. It takes around 10 to 15 years, like from the earliest days of discovery, to launching into the market. And unfortunately, more than 96% of all drug R&D actually fails . This is a really bad social model. This creates this enormous burden on our society and our healthcare spending as well. One of the reasons I started NextNet was when I was running Mekonos, I kept on seeing a lot of our customers had this tremendous pain point of, where you go, there's all this demand and subject matter experts, as scientists, they're actually working with very little of the available biomedical evidence out there. And a lot of the times that actually leads to false discoveries. (13:40)- And so there are tools, they're all this plethora of bioinformatics tools and software and databases out there that are plagued with program bugs. They mostly lack documentation or have very complicated documentation and best, very technical UI's. And for an average scientist or an average person in this industry, you really need to have a fairly deep grasp or a sophisticated understanding of database schemas and SQL querying and statistical modeling and coding and data science. (22:36)- So, a transformer is potentially one of the greatest breakthroughs that has happened in NLP recently. It's basically a neural net architecture that was incorporated into NLP models by Google Brain researchers that came along in 2017 and 2018. And before transformers, your state of the art models and NLP basically were like, LSTM, like long term memories are the widely used architecture. (27:24)- So Sapiens is, our goal here is to really make biomedical data accessible and useful for scientific inquiry, using this platform, so that, your average person and industry, let's say a wet lab or dry lab scientist, or a VP of R&D or CSO, or let's say a director of research can ask and answer complex biological questions. And a better frame hypothesis to understand is very complex, multifactorial diseases. And a lot of the insights that Sapiens is extracting from all this, with publicly available data sources are proprietary to the company. And then you can map and upload your own internal data, and begin to really contextualize all that information, by uploading onto the Sapiens. (31:34)- We are definitely looking for early adopters. This includes biotech companies, pharma, academic research labs, that would like to test out Sapiens and like this to be a part of their journey of their biomedical R&D. We're definitely, as I said, looking for investors who would like to partner with us, as we continue on this journey of building this probably one of the most sophisticated natural language based platforms, or as we call it, an excellent AI platform. Advertising Inquiries: https://redcircle.com/brandsPrivacy & Opt-Out: https://redcircle.com/privacy
I show how to use car listing data to predict price --- Send in a voice message: https://anchor.fm/david-nishimoto/message
I explain how to use multiple features created by rules to create signal for the accumulation amount of an initial investment --- Send in a voice message: https://anchor.fm/david-nishimoto/message
Predicting with lstm long short term memory networks --- Send in a voice message: https://anchor.fm/david-nishimoto/message
Sepp Hochreiter is heading the Institute for Machine Learning, the LIT AI Lab and the AUDI.JKU deep learning center at the Johannes Kepler University of Linz in Austria and is director of the Institute of Advanced Research in Artificial Intelligence (IARAI).Sepp Hochreiter is a pioneer of Deep Learning. His contributions to the Long Short-Term Memory (LSTM) and the analysis of the vanishing gradient are viewed as milestones and key-moments of the history of both machine learning and Deep Learning. Sepp laid the foundations for Deep Learning in two ways. Dr. Hochreiter's seminal works on the vanishing gradient and the Long Short-Term Memory (LSTM) were the starting points for what became later known as Deep Learning. LSTM has been overwhelmingly successful in handwriting recognition, generation of writings, language modeling and identification, automatic language translation, speech recognition, analysis of audio data, as well as analysis, annotation, and description of video data. —————————————————————————————
Lstm --- Send in a voice message: https://anchor.fm/david-nishimoto/message
Neurons predictors --- Send in a voice message: https://anchor.fm/david-nishimoto/message
Thoughts on data center stock --- Send in a voice message: https://anchor.fm/david-nishimoto/message
話したネタ BERTによる自然言語処理入門 ―Transformersを使った実践プログラミング― 自然言語処理とは? 自然言語処理の応用範囲は? 機械翻訳、メールフィルタ、チャットボット 自然言語処理はどのように発展してきたのか? NN (ニューラルネットワーク) とは何か? NN における大きなブレイクスルー AlexNet エンドツーエンド処理 Deep NN におけるディープとは? 学習とは具体的にはどのような処理? RNN (Recurrent Neural Network) とは? RNN で解きたかった課題は何か? RNN における処理イメージ RNN の課題 LSTM (Long Short Term Memory) LSTM の特徴、文脈考慮 NN への文章はどのように入力するのか? 単語をベクトルで表現する ELMo と LSTM との関連性は? なぜ双方向の情報を使うのが効果的なのか? セサミストリート Transformers の登場 Self-Attention とは? Self-Attention の計算効率利点 Attention と Transformer との関連性は? BERT(Bidirectional Encoder Representations from Transformers) とは? エンコーダとデコーダとは? BERT の特徴とは? BERTによるニュース記事の構造化:企業名抽出 自然言語処理界隈はどのように進化していくのか? GitHub Copilot Recruit、ストックマーク株式会社