Podcast appearances and mentions of George Hotz

76PODCASTS
113EPISODES
1h 6mAVG DURATION
?INFREQUENT EPISODES
Mar 20, 2025LATEST

POPULARITY

20192020202120222023202420252026

Best podcasts about George Hotz

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

11 episodes with George Hotz

Lex Fridman Podcast

3 episodes with George Hotz

The Jim Rutt Show

3 episodes with George Hotz

Lexman Artificial

12 episodes with George Hotz

Wheel Bearings

2 episodes with George Hotz

Reason Video

2 episodes with George Hotz

Autonocast

2 episodes with George Hotz

The Nonlinear Library

2 episodes with George Hotz

PixlTalk

2 episodes with George Hotz

Latest podcast episodes about George Hotz

S16 E13: Corbin Fraser on Bitcoin.com & BTC vs BCH

Bitcoin Takeover Podcast

Play Episode Listen Later Mar 20, 2025 155:15

Corbin Fraser is the new CEO of Bitcoin.com. In this episode, he talks about the Scaling Wars, his perspective as a BCH advocate, and the lessons the market taught him over the years. Also, he explains why the news section of Bitcoin.com improved. Introducing Corbin Fraser (00:00:50) Bitcoin.com's Evolution (00:01:52) Support for Multiple Cryptocurrencies (00:03:05) Challenges of Tribalism in Crypto (00:06:51) Origins of Bitcoin Cash (00:09:07) The Scaling Wars (00:12:14) Misconceptions About BCH Leadership (00:15:23) The Meme Warfare (00:18:31) Promoting Token Swaps (00:20:09) Bitcoin.com News and Geopolitics (00:21:16) Respect for Peter Schiff (00:22:01) Occupy Wall Street Insights (00:22:12) Bitcoin Cash vs Bitcoin Debate (00:24:25) Vitalik's Perspective on Forks (00:25:04) Nomenclature and Propaganda (00:26:17) Cola Wars Analogy (00:28:31) Future of Bitcoin Block Size (00:29:56) Lightning Network Adoption Challenges (00:30:04) Alternative Layer 2 Solutions (00:31:27) Lightning Network's Technical Limits (00:32:33) Bitcoin SV and Centralization Concerns (00:33:42) Impact of Community Divisions (00:34:44) Chaos of Forks in 2017 (00:36:58) Bitcoin.com's Business Approach (00:40:04) Lightning Network Whitepaper Block Size (00:41:06) Discussion on AI and Block Size (00:42:39) Satoshi's Misjudgments (00:43:46) Censorship and Community Chaos (00:45:34) Personal Losses During the Block Size Wars (00:47:08) Bitcoin Cash's Price Surge to $3000 (00:47:38) Miner Hesitation (00:48:30) Current Features of BCH (00:51:55) Emergence of Cash Tokens & Failure of SLP (00:56:02) Building on UTXO (01:00:18) Ethereum's Rise Post-Scaling Wars (01:02:10) SushiSwap & DeFi Summer (01:03:53) Vampire Liquidity Attack (01:05:08) Speculation Around Chef Nomi (01:06:27) Development of Aave & Compound (01:07:30) Community Dynamics in BCH (01:08:32) Debate on Development Tax (01:09:43) Amaury Sechet's Influence and Community Split (01:10:14) Forking and Community Fragmentation (01:12:12) Transition to Ethereum and DeFi (01:14:39) The Role of Multi-Chain Strategies (01:19:12) Privacy and Zano (01:20:41) Operational Security and Privacy in Crypto (01:23:09) Government Regulation and Crypto Development (01:24:31) Zano's Hybrid PoW/PoS Consensus (01:25:56) Fact-Checking from Bitcoin Cash Podcast (01:27:13) Bitcoin Cash Upgrade Schedule (01:28:14) Twitter Spaces Collaboration (01:28:47) Privacy Solutions in BCH (01:29:29) Mike Hearn's Prediction on Forks (01:30:05) Misleading Media Influence (01:30:57) Dollar Cost Averaging Strategy (01:31:50) Self-Custody Solutions (01:32:02) Ownership of Bitcoin.com (01:33:54) Roger Ver's Contributions (01:34:22) Support for Roger Ver (01:36:25) Government's Stance on Crypto (01:38:47) Reading Roger's Defense (01:39:59) Misinterpretations of Roger's Position (01:41:08) Evolution of Roger Ver's Views (01:42:31) Market Humbling Experiences (01:42:54) Elon Musk's Influence (01:43:37) Current Crypto Market Sentiment (01:45:01) Celebrity Influence on Crypto (01:46:05) Kanye West's Coin Speculation (01:47:01) Kanye's Meme Coin Ventures (01:48:05) The Decline of Meme Coin Interest (01:49:16) Overvaluation in Crypto Projects (01:50:50) Block Size Wars Reflection (01:51:48) State Actors and Bitcoin's Challenges (01:52:51) Blockstream's Business Struggles (01:54:41) Developer Commitment and Innovation (01:56:34) Stablecoins vs. Bitcoin Adoption (02:00:05) Future of Stablecoins and CBDCs (02:01:33) Block Size Debate and Social Consensus (02:04:08) George Hotz's Perspective on Innovation (02:06:48) Ethereum's Rapid Development (02:09:07) Ethereum's Challenges (02:11:29) Bitcoin's Cautious Approach (02:12:05) Testing Protocols (02:13:15) User Education on Risks (02:15:35) Brock Pierce's Influence (02:16:21) Community Engagement (02:18:17) Listing Criteria for New Coin on Bitcoin dot com Wallet (02:20:34) Integration of Privacy Coins (02:22:58) Zano's Potential (02:24:01) User Confusion with Bitcoin Domains (02:28:21) BSV Integration Decision (02:30:36) Balanced News Coverage (02:32:04) Introduction to New Developments (02:33:07) Embracing a Broader Crypto Perspective (02:34:05) Closing Remarks (02:34:46) Final Thoughts and Future Plans (02:35:03)

ceo ai future news challenges building chaos government innovation debate evolution predictions elon musk development influence embracing kanye west impact transition respect defense bitcoin position origins crypto risks privacy integration views perspective ownership propaganda decline censorship final thoughts ethereum fraser wallet emergence community engagement stance defi contributions future plans geopolitics btc closing remarks tribalism fact checking forks stablecoins satoshi new developments slp for king lightning network bitcoin cash peter schiff misinterpretation bch government regulations vitalik nomenclature blockstream roger ver brock pierce bitcoin sv utxo segwit zano privacy coins george hotz operational security scaling bitcoin mike hearn

#317: How The Tesla Autopilot Cannonball Run Record Was Broken

Autonocast

Play Episode Listen Later Oct 22, 2024 57:28

George Hotz hasn't won self-driving yet, but Cannonballer Jay Roberts just used a Comma Three and a Toyota Pruis to shatter Alex Roy's 55 hour Autopilot record in 43 hours, 18 minutes. What's the difference between a semi-automated and driverless Cannonball record? Is the future of the Cannonball Run human-driven or autonomous? Is Comma AI's technology better than Tesla's Full Self-Driving? What did the police think? Why are records being shattered in every category this year? Will Alex go again? All this and more on one of our most technology-packed episodes of all time!

record autopilot cannonball cannonball run full self driving tesla autopilot will alex george hotz alex roy

EP 263 Evan McMullen on Self-Driving Cars

The Jim Rutt Show

Play Episode Listen Later Oct 9, 2024 77:09

Jim talks with Evan McMullen about the state of self-driving car technology, with a special focus on simulators. They discuss the purpose of simulators, levels of simulation, how the world is modeled, gradually ramping up the complexity of the testing world, Tesla's approach, hardware-in-the-loop testing, Waymo's first-mover advantage, simulating the availability of a human intervener, driverless solutions vs driver aid, simultaneous localization and mapping (SLAM), the question of which theories of ethics to use, international standards for functional safety, a liability shield equilibrium, tool-sharing between companies, open source simulators, NVIDIA's DRIVE Sim & other players, standards for interoperability, incentives for cooperation between companies, hardware accuracy, edge case generation, evaluating current offerings for consumers, vibrational tactile feedback vs heads-up displays, when we'll be able to read a book in self-driving car, and much more. Episode Transcript JRS EP 94 - Shahin Farshchi on Self-Driving Tech JRS EP 124 - Jim Hackett on Ford, Electric Cars & More JRS EP 221 - George Hotz on Open-Source Driving Assistance Evan McMullen is a mechatronics engineer at dSPACE, a leading provider of hardware and software for simulation tools into the auto industry.

tesla nvidia slam electric cars selfdriving self driving cars waymo mcmullen george hotz jim hackett

The Winds of AI Winter (Q2 Four Wars Recap) + ChatGPT Voice Mode Preview

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Aug 2, 2024 115:01

Thank you for 1m downloads of the podcast and 2m readers of the Substack!

united states god ceo american new york world australia english google ai apple vision voice talk americans san francisco new york times research war chinese rich australian data european union market search microsoft italian holy new zealand drop south iphone illinois selling irish code ladies supreme court chatgpt missouri memory os valley atlantic software whatsapp washington post reddit wars cloud singapore midwest philippines indonesia laugh ios scottish intelligence new yorker context mark zuckerberg architecture uma scaling oracle stopping snap bloomberg cto substack malaysia vc similar iq whispers adapt ipo determine southeast asia fireworks optimizing openai gemini laughing residence gateway gdp fusion nah nvidia hardware acknowledge financial times chess api document av wang frontier chrome blank verge mojo scarlett johansson 10k winds vertical ftc gpt nexus ml aws lama gorilla boston marathon llama small talk goldman mandarin apis bedtime ruler great lakes consensus nome amd synthetic tt frameworks band aids romain nano chameleons biases ids opus hirsch weights chai sam altman ops mamba llm skynet colbert gg gpu crowdstrike pdfs venn gnome google chrome 5b modular skyfall soit soc mozilla zuck wix cuz kv nama haiku imo vespa rag rudyard kipling gpus sonnets golden gate bridge 7b quadrants sdks ilya benchmarking irobot ccs lambda san fernando valley alessio asics lightspeed perplexity lms anthropic crackle stack overflow scarjo little italy noose 8b restful lex fridman economically cpus shutterstock malay riaa asic suno mistral inflection gcp opex tts superintelligence a16z vertex latency ozymandias multimodal larry ellison observability olympiads gradient datadog proxies asr icm baits drop zone devrel rpc mimicry netlify etched ai news cloud platforms temasek gpc sandbagging eclair jamba gbt gpd apple notes augments exa character ai neurips li bai ai engineer huggingface george hotz singlish entropic harvard yard gbd code interpreter icml phy ml ops ai winter martin casado crosstrek technium latent space johnny ive numina inprint sohu i okay

EP 244 Samo Burja on Lessons from the Russo-Ukrainian War

The Jim Rutt Show

Play Episode Listen Later Jun 27, 2024 87:03

Jim talks with Samo Burja about lessons military strategists should take from the Russo-Ukrainian War so far. They discuss why military stockpiles are less useful than previously assumed, the scaling up of drone production, the impossibility of envisioning what tech will be needed, 4 factors that caused Russian miscalculation, offensive vs defensive dominance, the possibility of a U.S. military draft, the changing role of conscription, the high average age in Russia & Ukraine, the rapid evolution of drones, a comparison between drone pilots & snipers, the muted relevance of the air force, empty symbols of military strength, the progress of autonomous drones, the reevaluation of civilian casualties with changing tech, the information complexity of drone warfare, the importance of artillery, the need for a new George Marshall figure in the U.S., a war of production, how the Ukraine War can inform the Taiwan situation, the idea of an amphibious assault, autonomous submersible vehicles, and much more. JRS EP 243 - Yaroslav Trofimov on Ukraine's War of Independence JRS EP 221 - George Hotz on Open-Source Driving Assistance Samo Burja is the founder and President of Bismarck Analysis, a consulting firm that specializes in institutional analysis for clients in North America and Europe. Bismarck uses the foundational sociological research that Samo and his team have conducted over the past decade to deliver unique insights to clients about institutional design and strategy. Samo's studies focus on the social and material technologies that provide the foundation for healthy human societies, with an eye to engineering and restoring the structures that produce functional institutions. He has authored articles and papers on his findings. His manuscript, Great Founder Theory, is available online. He is also a Research Fellow at the Long Now Foundation and Senior Research Fellow in Political Science at the Foresight Institute. Samo has spoken about his findings at the World Economic Forum at Davos, Y Combinator's YC 120 conference, the Reboot American Innovation conference in Washington, D.C., and elsewhere. He spends most of his time in California and his native Slovenia.

The Best Tacit Knowledge Videos on Every Subject

LessWrong Curated Podcast

Play Episode Listen Later Apr 1, 2024 14:44

TL;DRTacit knowledge is extremely valuable. Unfortunately, developing tacit knowledge is usually bottlenecked by apprentice-master relationships. Tacit Knowledge Videos could widen this bottleneck. This post is a Schelling point for aggregating these videos—aiming to be The Best Textbooks on Every Subject for Tacit Knowledge Videos. Scroll down to the list if that's what you're here for. Post videos that highlight tacit knowledge in the comments and I'll add them to the post. Experts in the videos include Stephen Wolfram, Holden Karnofsky, Andy Matuschak, Jonathan Blow, George Hotz, and others. What are Tacit Knowledge Videos?Samo Burja claims YouTube has opened the gates for a revolution in tacit knowledge transfer. Burja defines tacit knowledge as follows:Tacit knowledge is knowledge that can't properly be transmitted via verbal or written instruction, like the ability to create great art or assess a startup. This tacit knowledge is a form of intellectual [...]The original text contained 1 footnote which was omitted from this narration. --- First published: March 31st, 2024 Source: https://www.lesswrong.com/posts/SXJGSPeQWbACveJhs/the-best-tacit-knowledge-videos-on-every-subject --- Narrated by TYPE III AUDIO.

videos subject scroll tl schelling stephen wolfram tacit jonathan blow george hotz samo burja andy matuschak holden karnofsky

LW - The Best Tacit Knowledge Videos on Every Subject by Parker Conley

The Nonlinear Library

Play Episode Listen Later Mar 31, 2024 12:18

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Best Tacit Knowledge Videos on Every Subject, published by Parker Conley on March 31, 2024 on LessWrong. TL;DR Tacit knowledge is extremely valuable. Unfortunately, developing tacit knowledge is usually bottlenecked by apprentice-master relationships. Tacit Knowledge Videos could widen this bottleneck. This post is a Schelling point for aggregating these videos - aiming to be The Best Textbooks on Every Subject for Tacit Knowledge Videos. Scroll down to the list if that's what you're here for. Post videos that highlight tacit knowledge in the comments and I'll add them to the post. Experts in the videos include Stephen Wolfram, Holden Karnofsky, Andy Matuschak, Jonathan Blow, George Hotz, and others. What are Tacit Knowledge Videos? Samo Burja claims YouTube has opened the gates for a revolution in tacit knowledge transfer. Burja defines tacit knowledge as follows: Tacit knowledge is knowledge that can't properly be transmitted via verbal or written instruction, like the ability to create great art or assess a startup. This tacit knowledge is a form of intellectual dark matter, pervading society in a million ways, some of them trivial, some of them vital. Examples include woodworking, metalworking, housekeeping, cooking, dancing, amateur public speaking, assembly line oversight, rapid problem-solving, and heart surgery. In my observation, domains like housekeeping and cooking have already seen many benefits from this revolution. Could tacit knowledge in domains like research, programming, mathematics, and business be next? I'm not sure, but maybe this post will help push the needle forward. For the purpose of this post, Tacit Knowledge Videos are any video that communicates "knowledge that can't properly be transmitted via verbal or written instruction". Here are some examples: Neel Nanda, who leads the Google DeepMind mechanistic interpretability team, has a playlist of "Research Walkthroughs". AI Safety research is discussed a lot around here. Watching research videos could help instantiate what AI research really looks and feels like. GiveWell has public audio recordings of its Board Meetings from 2007-2020. Participants include Elie Hassenfeld, Holden Karnofsky, Timothy Ogden, Rob Reich, Tom Rutledge, Brigid Slipka, Cari Tuna, Julia Wise, and others. Influential business meetings are not usually made public. I feel I have learned some about business communication and business operations, among other things, by listening to these recordings. Andy Matuschak recorded himself studying Quantum Mechanics with Dwarkesh Patel and doing research. Andy Matushak "helped build iOS at Apple and led R&D at Khan Academy". I found it interesting to have a peek into Matushak's spaced repetition practice and various studying heuristics and habits, as well as his process of digesting and taking notes on papers. Call to Action Share links to Tacit Knowledge Videos below! Share them frivolously! These videos are uncommon - the bottleneck to the YouTube knowledge transfer revolution is quantity, not quality. I will add the shared videos to the post. Here are the loose rules: Recall a video that you've seen that communicates tacit knowledge - "knowledge that can't properly be transmitted via verbal or written instruction". A rule of thumb for sharing: could a reader find this video through one or two YouTube searches? If not, share it. Post the title and the URL of the video. Provide information indicating why the expert in the video is credible. (However, don't let this last rule stop you from sharing a video! Again - quantity, not quality.)[1] For information on how to best use these videos, Cedric Chin and Jacob Steinhardt have some potentially relevant practical advice. Andy Matushak also has some working notes about this idea generally. Additionally, DM or email me (email in L...

LW - The Best Tacit Knowledge Videos on Every Subject by Parker Conley

The Nonlinear Library: LessWrong

Play Episode Listen Later Mar 31, 2024 12:18

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Best Tacit Knowledge Videos on Every Subject, published by Parker Conley on March 31, 2024 on LessWrong. TL;DR Tacit knowledge is extremely valuable. Unfortunately, developing tacit knowledge is usually bottlenecked by apprentice-master relationships. Tacit Knowledge Videos could widen this bottleneck. This post is a Schelling point for aggregating these videos - aiming to be The Best Textbooks on Every Subject for Tacit Knowledge Videos. Scroll down to the list if that's what you're here for. Post videos that highlight tacit knowledge in the comments and I'll add them to the post. Experts in the videos include Stephen Wolfram, Holden Karnofsky, Andy Matuschak, Jonathan Blow, George Hotz, and others. What are Tacit Knowledge Videos? Samo Burja claims YouTube has opened the gates for a revolution in tacit knowledge transfer. Burja defines tacit knowledge as follows: Tacit knowledge is knowledge that can't properly be transmitted via verbal or written instruction, like the ability to create great art or assess a startup. This tacit knowledge is a form of intellectual dark matter, pervading society in a million ways, some of them trivial, some of them vital. Examples include woodworking, metalworking, housekeeping, cooking, dancing, amateur public speaking, assembly line oversight, rapid problem-solving, and heart surgery. In my observation, domains like housekeeping and cooking have already seen many benefits from this revolution. Could tacit knowledge in domains like research, programming, mathematics, and business be next? I'm not sure, but maybe this post will help push the needle forward. For the purpose of this post, Tacit Knowledge Videos are any video that communicates "knowledge that can't properly be transmitted via verbal or written instruction". Here are some examples: Neel Nanda, who leads the Google DeepMind mechanistic interpretability team, has a playlist of "Research Walkthroughs". AI Safety research is discussed a lot around here. Watching research videos could help instantiate what AI research really looks and feels like. GiveWell has public audio recordings of its Board Meetings from 2007-2020. Participants include Elie Hassenfeld, Holden Karnofsky, Timothy Ogden, Rob Reich, Tom Rutledge, Brigid Slipka, Cari Tuna, Julia Wise, and others. Influential business meetings are not usually made public. I feel I have learned some about business communication and business operations, among other things, by listening to these recordings. Andy Matuschak recorded himself studying Quantum Mechanics with Dwarkesh Patel and doing research. Andy Matushak "helped build iOS at Apple and led R&D at Khan Academy". I found it interesting to have a peek into Matushak's spaced repetition practice and various studying heuristics and habits, as well as his process of digesting and taking notes on papers. Call to Action Share links to Tacit Knowledge Videos below! Share them frivolously! These videos are uncommon - the bottleneck to the YouTube knowledge transfer revolution is quantity, not quality. I will add the shared videos to the post. Here are the loose rules: Recall a video that you've seen that communicates tacit knowledge - "knowledge that can't properly be transmitted via verbal or written instruction". A rule of thumb for sharing: could a reader find this video through one or two YouTube searches? If not, share it. Post the title and the URL of the video. Provide information indicating why the expert in the video is credible. (However, don't let this last rule stop you from sharing a video! Again - quantity, not quality.)[1] For information on how to best use these videos, Cedric Chin and Jacob Steinhardt have some potentially relevant practical advice. Andy Matushak also has some working notes about this idea generally. Additionally, DM or email me (email in L...

The top AI news from the past week, every ThursdAI

Play Episode Listen Later Mar 8, 2024 105:27

Hello hello everyone, happy spring! Can you believe it? It's already spring! We have tons of AI news for you to cover, starting with the most impactful one, did you already use Claude 3? Anthropic decided to celebrate Claude 1's birthday early (which btw is also ThursdAI's birthday and GPT4 release date, March 14th, 2023) and gave us 3 new Clauds! Opus, Sonnet and Haiku. TL;DR of all topics covered: * Big CO LLMs + APIs*

god ceo canada europe english google ai china guide voice news speaking san francisco phd story chinese toronto system elon musk japanese microsoft dm open iphone model 3d generation code chatgpt clear os reddit cloud iron man experiments longer playstation ios wikipedia korean viral galaxy foster constitution lower hebrew remove pi cto ram excuse folks coco jeopardy demo fireworks sf openai gemini nvidia stability omg api luigi honda galore sd vic coding ui turbo playground gpt ml lama hungarian linux github dharma hindi sidney apis lava vpn jarvis lit torch hermes yolo amd 200k r d apache biases tl sora hitchhiker opus vivek kilo amazon web services google drive prompt contacts cpu llm gpu benchmark d3 hug agi midjourney docker phi haiku rag gpus sonnets 7b rtx ilya deepmind optimus tps perplexity fine tuning anthropic reid hoffman eval grok hourglass yam yee akshay vb luma fal yi tropic suno mistral stable diffusion eric ries vik olmo inflection ssh saudi aramco typography xai pii cuda gpts v3 nissen dits autonomous driving axolotl ascii siv dpd laman jumbotron stability ai imad aws lambda reka workbench chatgbt ai news pytorch nvme emad kaggle gpc yaml david chalmers bing chat adobe firefly allen institute yann lecun janai inflection ai jeremy howard oap solr gpd 70b mcq neurips dune ii v9 george hotz huggingface entropic vl m jim keller rlhf gemini pro ideogram gvt gemini ultra john whitaker technium peter yang lpu constitutional ai george haas

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Mar 6, 2024 80:00

Speaker CFPs and Sponsor Guides are now available for AIE World's Fair — join us on June 25-27 for the biggest AI Engineer conference of 2024!Soumith Chintala needs no introduction in the ML world — his insights are incredibly accessible across Twitter, LinkedIn, podcasts, and conference talks (in this pod we'll assume you'll have caught up on the History of PyTorch pod from last year and cover different topics). He's well known as the creator of PyTorch, but he's more broadly the Engineering Lead on AI Infra, PyTorch, and Generative AI at Meta.Soumith was one of the earliest supporters of Latent Space (and more recently AI News), and we were overjoyed to catch up with him on his latest SF visit for a braindump of the latest AI topics, reactions to some of our past guests, and why Open Source AI is personally so important to him.Life in the GPU-Rich LaneBack in January, Zuck went on Instagram to announce their GPU wealth: by the end of 2024, Meta will have 350k H100s. By adding all their GPU clusters, you'd get to 600k H100-equivalents of compute. At FP16 precision, that's ~1,200,000 PFLOPS. If we used George Hotz's (previous guest!) "Person of Compute" measure, Meta now has 60k humans of compute in their clusters. Occasionally we get glimpses into the GPU-rich life; on a recent ThursdAI chat, swyx prompted PaLM tech lead Yi Tay to write down what he missed most from Google, and he commented that UL2 20B was trained by accidentally leaving the training job running for a month, because hardware failures are so rare in Google.Meta AI's Epic LLM RunBefore Llama broke the internet, Meta released an open source LLM in May 2022, OPT-175B, which was notable for how “open” it was - right down to the logbook! They used only 16 NVIDIA V100 GPUs and Soumith agrees that, with hindsight, it was likely under-trained for its parameter size.In Feb 2023 (pre Latent Space pod), Llama was released, with a 7B version trained on 1T tokens alongside 65B and 33B versions trained on 1.4T tokens. The Llama authors included Guillaume Lample and Timothée Lacroix, who went on to start Mistral.July 2023 was Llama2 time (which we covered!): 3 model sizes, 7B, 13B, and 70B, all trained on 2T tokens. The three models accounted for a grand total of 3,311,616 GPU hours for all pre-training work. CodeLlama followed shortly after, a fine-tune of Llama2 specifically focused on code generation use cases. The family had models in the 7B, 13B, 34B, and 70B size, all trained with 500B extra tokens of code and code-related data, except for 70B which is trained on 1T.All of this on top of other open sourced models like Segment Anything (one of our early hits!), Detectron, Detectron 2, DensePose, and Seamless, and in one year, Meta transformed from a company people made fun of for its “metaverse” investments to one of the key players in the AI landscape and its stock has almost tripled since (about $830B in market value created in the past year).Why Open Source AIThe obvious question is why Meta would spend hundreds of millions on its AI efforts and then release them for free. Zuck has addressed this in public statements:But for Soumith, the motivation is even more personal:“I'm irrationally interested in open source. I think open source has that fundamental way to distribute opportunity in a way that is very powerful. Like, I grew up in India… And knowledge was very centralized, but I saw that evolution of knowledge slowly getting decentralized. And that ended up helping me learn quicker and faster for like zero dollars. And I think that was a strong reason why I ended up where I am. So like that, like the open source side of things, I always push regardless of like what I get paid for, like I think I would do that as a passion project on the side……I think at a fundamental level, the most beneficial value of open source is that you make the distribution to be very wide. It's just available with no friction and people can do transformative things in a way that's very accessible. Maybe it's open source, but it has a commercial license and I'm a student in India. I don't care about the license. I just don't even understand the license. But like the fact that I can use it and do something with it is very transformative to me……Like, okay, I again always go back to like I'm a student in India with no money. What is my accessibility to any of these closed source models? At some scale I have to pay money. That makes it a non-starter and stuff. And there's also the control issue: I strongly believe if you want human aligned AI, you want all humans to give feedback. And you want all humans to have access to that technology in the first place. And I actually have seen, living in New York, whenever I come to Silicon Valley, I see a different cultural bubble.We like the way Soumith put it last year: Closed AI “rate-limits against people's imaginations and needs”!What It Takes For Open Source AI to WinHowever Soumith doesn't think Open Source will simply win by popular demand. There is a tremendous coordination problem with the decentralized nature of the open source AI development right now: nobody is collecting the valuable human feedback in the way that OpenAI or Midjourney are doing.“Open source in general always has a coordination problem. If there's a vertically integrated provider with more resources, they will just be better coordinated than open source. And so now open source has to figure out how to have coordinated benefits. And the reason you want coordinated benefits is because these models are getting better based on human feedback. And if you see with open source models, like if you go to the /r/localllama subreddit, like there's so many variations of models that are being produced from, say, Nous research. I mean, like there's like so many variations built by so many people. And one common theme is they're all using these fine-tuning or human preferences datasets that are very limited and they're not sufficiently diverse. And you look at the other side, say front-ends like Oobabooga or like Hugging Chat or Ollama, they don't really have feedback buttons. All the people using all these front-ends, they probably want to give feedback, but there's no way for them to give feedback… So we're just losing all of this feedback. Maybe open source models are being as used as GPT is at this point in like all kinds of, in a very fragmented way, like in aggregate all the open source models together are probably being used as much as GPT is, maybe close to that. But the amount of feedback that is driving back into the open source ecosystem is like negligible, maybe less than 1% of like the usage. So I think like some, like the blueprint here I think is you'd want someone to create a sinkhole for the feedback… I think if we do that, if that actually happens, I think that probably has a real chance of the open source models having a runaway effect against OpenAI, I think like there's a clear chance we can take at truly winning open source.”If you're working on solving open source coordination, please get in touch!Show Notes* Soumith Chintala Twitter* History of PyTorch episode on Gradient Podcast* The Llama Ecosystem* Apple's MLX* Neural ODEs (Ordinary Differential Equations)* AlphaGo* LMSys arena* Dan Pink's "Drive"* Robotics projects:* Dobb-E* OK Robot* Yann LeCun* Yangqing Jia of Lepton AI* Ed Catmull* George Hotz on Latent Space* Chris Lattner on Latent Space* Guillaume Lample* Yannic Kilcher of OpenAssistant* LMSys* Alex Atallah of OpenRouter* Carlo Sferrazza's 3D tactile research* Alex Wiltschko of Osmo* Tangent by Alex Wiltschko* Lerrel Pinto - RoboticsTimestamps* [00:00:00] Introductions* [00:00:51] Extrinsic vs Intrinsic Success* [00:02:40] Importance of Open Source and Its Impact* [00:03:46] PyTorch vs TinyGrad* [00:08:33] Why PyTorch is the Switzerland of frameworks* [00:10:27] Modular's Mojo + PyTorch?* [00:13:32] PyTorch vs Apple's MLX* [00:16:27] FAIR / PyTorch Alumni* [00:18:50] How can AI inference providers differentiate?* [00:21:41] How to build good benchmarks and learnings from AnyScale's* [00:25:28] Most interesting unexplored ideas* [00:28:18] What people get wrong about synthetic data* [00:35:57] Meta AI's evolution* [00:38:42] How do you allocate 600,000 GPUs?* [00:42:05] Even the GPU Rich are GPU Poor* [00:47:31] Meta's MTIA silicon* [00:50:09] Why we need open source* [00:59:00] Open source's coordination problem for feedback gathering* [01:08:59] Beyond text generation* [01:15:37] Osmo and the Future of Smell Recognition TechnologyTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:15]: Hey, and today we have in the studio Soumith Chintala, welcome.Soumith [00:00:17]: Thanks for having me.Swyx [00:00:18]: On one of your rare visits from New York where you live. You got your start in computer vision at NYU with Yann LeCun. That was a very fortuitous start. I was actually listening to your interview on the Gradient podcast. So if people want to know more about the history of Soumith, history of PyTorch, they can go to that podcast. We won't spend that much time there, but I just was marveling at your luck, or I don't know if it's your luck or your drive to find AI early and then find the right quality mentor because I guess Yan really sort of introduced you to that world.Soumith [00:00:51]: Yeah, I think you're talking about extrinsic success, right? A lot of people just have drive to do things that they think is fun, and a lot of those things might or might not be extrinsically perceived as good and successful. I think I just happened to like something that is now one of the coolest things in the world or whatever. But if I happen, the first thing I tried to become was a 3D VFX artist, and I was really interested in doing that, but I turned out to be very bad at it. So I ended up not doing that further. But even if I was good at that, whatever, and I ended up going down that path, I probably would have been equally happy. It's just like maybe like the perception of, oh, is this person successful or not might be different. I think like after a baseline, like your happiness is probably more correlated with your intrinsic stuff.Swyx [00:01:44]: Yes. I think Dan Pink has this book on drive that I often refer to about the power of intrinsic motivation versus extrinsic and how long extrinsic lasts. It's not very long at all. But anyway, now you are an investor in Runway, so in a way you're working on VFX. Yes.Soumith [00:02:01]: I mean, in a very convoluted way.Swyx [00:02:03]: It reminds me of Ed Catmull. I don't know if you guys know, but he actually tried to become an animator in his early years and failed or didn't get accepted by Disney and then went and created Pixar and then got bought by Disney and created Toy Story. So you joined Facebook in 2014 and eventually became a creator and maintainer of PyTorch. And there's this long story there you can refer to on the gradient. I think maybe people don't know that you also involved in more sort of hardware and cluster decision affair. And we can dive into more details there because we're all about hardware this month. Yeah. And then finally, I don't know what else, like what else should people know about you on a personal side or professional side?Soumith [00:02:40]: I think open source is definitely a big passion of mine and probably forms a little bit of my identity at this point. I'm irrationally interested in open source. I think open source has that fundamental way to distribute opportunity in a way that is very powerful. Like, I grew up in India. I didn't have internet for a while. In college, actually, I didn't have internet except for GPRS or whatever. And knowledge was very centralized, but I saw that evolution of knowledge slowly getting decentralized. And that ended up helping me learn quicker and faster for zero dollars. And I think that was a strong reason why I ended up where I am. So the open source side of things, I always push regardless of what I get paid for, like I think I would do that as a passion project on the side.Swyx [00:03:35]: Yeah, that's wonderful. Well, we'll talk about the challenges as well that open source has, open models versus closed models. Maybe you want to touch a little bit on PyTorch before we move on to the sort of Meta AI in general.PyTorch vs Tinygrad tradeoffsAlessio [00:03:46]: Yeah, we kind of touched on PyTorch in a lot of episodes. So we had George Hotz from TinyGrad. He called PyTorch a CISC and TinyGrad a RISC. I would love to get your thoughts on PyTorch design direction as far as, I know you talk a lot about kind of having a happy path to start with and then making complexity hidden away but then available to the end user. One of the things that George mentioned is I think you have like 250 primitive operators in PyTorch, I think TinyGrad is four. So how do you think about some of the learnings that maybe he's going to run into that you already had in the past seven, eight years almost of running PyTorch?Soumith [00:04:24]: Yeah, I think there's different models here, but I think it's two different models that people generally start with. Either they go like, I have a grand vision and I'm going to build a giant system that achieves this grand vision and maybe one is super feature complete or whatever. Or other people say they will get incrementally ambitious, right? And they say, oh, we'll start with something simple and then we'll slowly layer out complexity in a way that optimally applies Huffman coding or whatever. Like where the density of users are and what they're using, I would want to keep it in the easy, happy path and where the more niche advanced use cases, I'll still want people to try them, but they need to take additional frictional steps. George, I think just like we started with PyTorch, George started with the incrementally ambitious thing. I remember TinyGrad used to be, like we would be limited to a thousand lines of code and I think now it's at 5,000. So I think there is no real magic to which why PyTorch has the kind of complexity. I think it's probably partly necessitated and partly because we built with the technology available under us at that time, PyTorch is like 190,000 lines of code or something at this point. I think if you had to rewrite it, we would probably think about ways to rewrite it in a vastly simplified way for sure. But a lot of that complexity comes from the fact that in a very simple, explainable way, you have memory hierarchies. You have CPU has three levels of caches and then you have DRAM and SSD and then you have network. Similarly, GPU has several levels of memory and then you have different levels of network hierarchies, NVLink plus InfiniBand or Rocky or something like that, right? And the way the flops are available on your hardware, they are available in a certain way and your computation is in a certain way and you have to retrofit your computation onto both the memory hierarchy and like the flops available. When you're doing this, it is actually a fairly hard mathematical problem to do this setup, like you find the optimal thing. And finding the optimal thing is, what is optimal depends on the input variables themselves. So like, okay, what is the shape of your input tensors and what is the operation you're trying to do and various things like that. Finding that optimal configuration and writing it down in code is not the same for every input configuration you have. Like for example, just as the shape of the tensors change, let's say you have three input tensors into a Sparstar product or something like that. The shape of each of these input tensors will vastly change how you do this optimally placing this operation onto the hardware in a way that will get you maximal throughput. So a lot of our complexity comes from writing out hundreds of configurations for each single PyTorch operator and templatizing these things and symbolically generating the final CUDA code or CPU code. There's no way to avoid it because mathematically we haven't found symbolic ways to do this that also keep compile time near zero. You can write a very simple framework, but then you also should be willing to eat the long compile time. So if searching for that optimal performance at runtime, but that's the trade off. There's no, like, I don't think unless we have great breakthroughs George's vision is achievable, he should be thinking about a narrower problem such as I'm only going to make this for work for self-driving car connets or I'm only going to make this work for LLM transformers of the llama style. Like if you start narrowing the problem down, you can make a vastly simpler framework. But if you don't, if you need the generality to power all of the AI research that is happening and keep zero compile time and in all these other factors, I think it's not easy to avoid the complexity.Pytorch vs MojoAlessio [00:08:33]: That's interesting. And we kind of touched on this with Chris Lattner when he was on the podcast. If you think about frameworks, they have the model target. They have the hardware target. They have different things to think about. He mentioned when he was at Google, TensorFlow trying to be optimized to make TPUs go brr, you know, and go as fast. I think George is trying to make especially AMD stack be better than ROCm. How come PyTorch has been such as Switzerland versus just making Meta hardware go brr?Soumith [00:09:00]: First, Meta is not in the business of selling hardware. Meta is not in the business of cloud compute. The way Meta thinks about funding PyTorch is we're funding it because it's net good for Meta to fund PyTorch because PyTorch has become a standard and a big open source project. And generally it gives us a timeline edge. It gives us leverage and all that within our own work. So why is PyTorch more of a Switzerland rather than being opinionated? I think the way we think about it is not in terms of Switzerland or not. We actually the way we articulate it to all hardware vendors and software vendors and all who come to us being we want to build a backend in core for PyTorch and ship it by default is we just only look at our user side of things. Like if users are using a particular piece of hardware, then we want to support it. We very much don't want to king make the hardware side of things. So as the MacBooks have GPUs and as that stuff started getting increasingly interesting, we pushed Apple to push some engineers and work on the NPS support and we spend significant time from Meta funded engineers on that as well because a lot of people are using the Apple GPUs and there's demand. So we kind of mostly look at it from the demand side. We never look at it from like oh which hardware should we start taking opinions on.Swyx [00:10:27]: Is there a future in which, because Mojo or Modular Mojo is kind of a superset of Python, is there a future in which PyTorch might use Mojo features optionally?Soumith [00:10:36]: I think it depends on how well integrated it is into the Python ecosystem. So if Mojo is like a pip install and it's readily available and users feel like they can use Mojo so smoothly within their workflows in a way that just is low friction, we would definitely look into that. Like in the same way PyTorch now depends on Triton, OpenAI Triton, and we never had a conversation that was like huh, that's like a dependency. Should we just build a Triton of our own or should we use Triton? It almost doesn't, like those conversations don't really come up for us. The conversations are more well does Triton have 10,000 dependencies and is it hard to install? We almost don't look at these things from a strategic leverage point of view. We look at these things from a user experience point of view, like is it easy to install? Is it smoothly integrated and does it give enough benefits for us to start depending on it? If so, yeah, we should consider it. That's how we think about it.Swyx [00:11:37]: You're inclusive by default as long as it meets the minimum bar of, yeah, but like maybe I phrased it wrongly. Maybe it's more like what problems would you look to solve that you have right now?Soumith [00:11:48]: I think it depends on what problems Mojo will be useful at.Swyx [00:11:52]: Mainly a performance pitch, some amount of cross compiling pitch.Soumith [00:11:56]: Yeah, I think the performance pitch for Mojo was like, we're going to be performant even if you have a lot of custom stuff, you're going to write arbitrary custom things and we will be performant. And that value proposition is not clear to us from the PyTorch side to consider it for PyTorch. So PyTorch, it's actually not 250 operators, it's like a thousand operators. PyTorch exposes about a thousand operators and people kind of write their ideas in the thousand operators of PyTorch. Mojo is like, well, maybe it's okay to completely sidestep those thousand operators of PyTorch and just write it in a more natural form. Just write raw Python, write for loops or whatever, right? So from the consideration of how do we intersect PyTorch with Mojo, I can see one use case where you have custom stuff for some parts of your program, but mostly it's PyTorch. And so we can probably figure out how to make it easier for say Torch.compile to smoothly also consume Mojo subgraphs and like, you know, the interoperability being actually usable, that I think is valuable. But Mojo as a fundamental front end would be replacing PyTorch, not augmenting PyTorch. So in that sense, I don't see a synergy in more deeply integrating Mojo.Pytorch vs MLXSwyx [00:13:21]: So call out to Mojo whenever they have written something in Mojo and there's some performance related thing going on. And then since you mentioned Apple, what should people think of PyTorch versus MLX?Soumith [00:13:32]: I mean, MLX is early and I know the folks well, Ani used to work at FAIR and I used to chat with him all the time. He used to be based out of New York as well. The way I think about MLX is that MLX is specialized for Apple right now. It has a happy path because it's defined its product in a narrow way. At some point MLX either says we will only be supporting Apple and we will just focus on enabling, you know, there's a framework if you use your MacBook, but once you like go server side or whatever, that's not my problem and I don't care. For MLS, it enters like the server side set of things as well. Like one of these two things will happen, right? If the first thing will happen, like MLX's overall addressable market will be small, but it probably do well within that addressable market. If it enters the second phase, they're going to run into all the same complexities that we have to deal with. They will not have any magic wand and they will have more complex work to do. They probably wouldn't be able to move as fast.Swyx [00:14:44]: Like having to deal with distributed compute?Soumith [00:14:48]: Distributed, NVIDIA and AMD GPUs, like just like having a generalization of the concept of a backend, how they treat compilation with plus overheads. Right now they're deeply assumed like the whole NPS graph thing. So they need to think about all these additional things if they end up expanding onto the server side and they'll probably build something like PyTorch as well, right? Like eventually that's where it will land. And I think there they will kind of fail on the lack of differentiation. Like it wouldn't be obvious to people why they would want to use it.Swyx [00:15:24]: I mean, there are some cloud companies offering M1 and M2 chips on servers. I feel like it might be interesting for Apple to pursue that market, but it's not their core strength.Soumith [00:15:33]: Yeah. If Apple can figure out their interconnect story, maybe, like then it can become a thing.Swyx [00:15:40]: Honestly, that's more interesting than the cars. Yes.Soumith [00:15:43]: I think the moat that NVIDIA has right now, I feel is that they have the interconnect that no one else has, like AMD GPUs are pretty good. I'm sure there's various silicon that is not bad at all, but the interconnect, like NVLink is uniquely awesome. I'm sure the other hardware providers are working on it, but-Swyx [00:16:04]: I feel like when you say it's uniquely awesome, you have some appreciation of it that the rest of us don't. I mean, the rest of us just like, you know, we hear marketing lines, but what do you mean when you say NVIDIA is very good at networking? Obviously they made the acquisition maybe like 15 years ago.Soumith [00:16:15]: Just the bandwidth it offers and the latency it offers. I mean, TPUs also have a good interconnect, but you can't buy them. So you have to go to Google to use it.PyTorch MafiaAlessio [00:16:27]: Who are some of the other FAIR PyTorch alumni that are building cool companies? I know you have Fireworks AI, Lightning AI, Lepton, and Yangqing, you knew since college when he was building Coffee?Soumith [00:16:40]: Yeah, so Yangqing and I used to be framework rivals, PyTorch, I mean, we were all a very small close-knit community back then. Caffe, Torch, Theano, Chainer, Keras, various frameworks. I mean, it used to be more like 20 frameworks. I can't remember all the names. CCV by Liu Liu, who is also based out of SF. And I would actually like, you know, one of the ways it was interesting is you went into the framework guts and saw if someone wrote their own convolution kernel or they were just copying someone else's. There were four or five convolution kernels that were unique and interesting. There was one from this guy out of Russia, I forgot the name, but I remembered who was awesome enough to have written their own kernel. And at some point there, I built out these benchmarks called ConNet benchmarks. They're just benchmarking all the convolution kernels that are available at that time. It hilariously became big enough that at that time AI was getting important, but not important enough that industrial strength players came in to do these kinds of benchmarking and standardization. Like we have MLPerf today. So a lot of the startups were using ConNet benchmarks in their pitch decks as like, oh, you know, on ConNet benchmarks, this is how we fare, so you should fund us. I remember Nirvana actually was at the top of the pack because Scott Gray wrote amazingly fast convolution kernels at that time. Very interesting, but separate times. But to answer your question, Alessio, I think mainly Lepton, Fireworks are the two most obvious ones, but I'm sure the fingerprints are a lot wider. They're just people who worked within the PyTorch Cafe2 cohort of things and now end up at various other places.Swyx [00:18:50]: I think as a, both as an investor and a people looking to build on top of their services, it's a uncomfortable slash like, I don't know what I don't know pitch. Because I've met Yang Tsing and I've met Lin Chao. Yeah, I've met these folks and they're like, you know, we are deep in the PyTorch ecosystem and we serve billions of inferences a day or whatever at Facebook and now we can do it for you. And I'm like, okay, that's great. Like, what should I be wary of or cautious of when these things happen? Because I'm like, obviously this experience is extremely powerful and valuable. I just don't know what I don't know. Like, what should people know about like these sort of new inference as a service companies?Soumith [00:19:32]: I think at that point you would be investing in them for their expertise of one kind. So if they've been at a large company, but they've been doing amazing work, you would be thinking about it as what these people bring to the table is that they're really good at like GPU programming or understanding the complexity of serving models once it hits a certain scale. You know, various expertise like from the infra and AI and GPUs point of view. What you would obviously want to figure out is whether their understanding of the external markets is clear, whether they know and understand how to think about running a business, understanding how to be disciplined about making money or, you know, various things like that.Swyx [00:20:23]: Maybe I'll put it like, actually I will de-emphasize the investing bit and just more as a potential customer. Oh, okay. Like, it's more okay, you know, you have PyTorch gods, of course. Like, what else should I know?Soumith [00:20:37]: I mean, I would not care about who's building something. If I'm trying to be a customer, I would care about whether...Swyx [00:20:44]: Benchmarks.Soumith [00:20:44]: Yeah, I use it and it's usability and reliability and speed, right?Swyx [00:20:51]: Quality as well.Soumith [00:20:51]: Yeah, if someone from some random unknown place came to me and say, user stuff is great. Like, and I have the bandwidth, I probably will give it a shot. And if it turns out to be great, like I'll just use it.Benchmark dramaSwyx [00:21:07]: Okay, great. And then maybe one more thing about benchmarks, since we already brought it up and you brought up Confident Benchmarks. There was some recent drama around AnyScale. AnyScale released their own benchmarks and obviously they look great on their own benchmarks, but maybe didn't give the other... I feel there are two lines of criticism. One, which is they didn't test some apples for apples on the kind of endpoints that the other providers, that they are competitors with, on their benchmarks and that is due diligence baseline. And then the second would be more just optimizing for the right thing. You had some commentary on it. I'll just kind of let you riff.Soumith [00:21:41]: Yeah, I mean, in summary, basically my criticism of that was AnyScale built these benchmarks for end users to just understand what they should pick, right? And that's a very good thing to do. I think what they didn't do a good job of is give that end user a full understanding of what they should pick. Like they just gave them a very narrow slice of understanding. I think they just gave them latency numbers and that's not sufficient, right? You need to understand your total cost of ownership at some reasonable scale. Not oh, one API call is one cent, but a thousand API calls are 10 cents. Like people can misprice to cheat on those benchmarks. So you want to understand, okay, like how much is it going to cost me if I actually subscribe to you and do like a million API calls a month or something? And then you want to understand the latency and reliability, not just from one call you made, but an aggregate of calls you've made over several various times of the day and times of the week. And the nature of the workloads, is it just some generic single paragraph that you're sending that is cashable? Or is it like testing of real world workload? I think that kind of rigor, like in presenting that benchmark wasn't there. It was a much more narrow sliver of what should have been a good benchmark. That was my main criticism. And I'm pretty sure if before they released it, they showed it to their other stakeholders who would be caring about this benchmark because they are present in it, they would have easily just pointed out these gaps. And I think they didn't do that and they just released it. So I think those were the two main criticisms. I think they were fair and Robert took it well.Swyx [00:23:40]: And he took it very well. And we'll have him on at some point and we'll discuss it. But I think it's important for, I think the market being maturing enough that people start caring and competing on these kinds of things means that we need to establish what best practice is because otherwise everyone's going to play dirty.Soumith [00:23:55]: Yeah, absolutely. My view of the LLM inference market in general is that it's the laundromat model. Like the margins are going to drive down towards the bare minimum. It's going to be all kinds of arbitrage between how much you can get the hardware for and then how much you sell the API and how much latency your customers are willing to let go. You need to figure out how to squeeze your margins. Like what is your unique thing here? Like I think Together and Fireworks and all these people are trying to build some faster CUDA kernels and faster, you know, hardware kernels in general. But those modes only last for a month or two. These ideas quickly propagate.Swyx [00:24:38]: Even if they're not published?Soumith [00:24:39]: Even if they're not published, the idea space is small. So even if they're not published, the discovery rate is going to be pretty high. It's not like we're talking about a combinatorial thing that is really large. You're talking about Llama style LLM models. And we're going to beat those to death on a few different hardware SKUs, right? Like it's not even we have a huge diversity of hardware you're going to aim to run it on. Now when you have such a narrow problem and you have a lot of people working on it, the rate at which these ideas are going to get figured out is going to be pretty rapid.Swyx [00:25:15]: Is it a standard bag of tricks? Like the standard one that I know of is, you know, fusing operators and-Soumith [00:25:22]: Yeah, it's the standard bag of tricks on figuring out how to improve your memory bandwidth and all that, yeah.Alessio [00:25:28]: Any ideas instead of things that are not being beaten to death that people should be paying more attention to?Novel PyTorch ApplicationsSwyx [00:25:34]: One thing I was like, you know, you have a thousand operators, right? Like what's the most interesting usage of PyTorch that you're seeing maybe outside of this little bubble?Soumith [00:25:41]: So PyTorch, it's very interesting and scary at the same time, but basically it's used in a lot of exotic ways, like from the ML angle, what kind of models are being built? And you get all the way from state-based models and all of these things to stuff nth order differentiable models, like neural ODEs and stuff like that. I think there's one set of interestingness factor from the ML side of things. And then there's the other set of interesting factor from the applications point of view. It's used in Mars Rover simulations, to drug discovery, to Tesla cars. And there's a huge diversity of applications in which it is used. So in terms of the most interesting application side of things, I think I'm scared at how many interesting things that are also very critical and really important it is used in. I think the scariest was when I went to visit CERN at some point and they said they were using PyTorch and they were using GANs at the same time for particle physics research. And I was scared more about the fact that they were using GANs than they were using PyTorch, because at that time I was a researcher focusing on GANs. But the diversity is probably the most interesting. How many different things it is being used in. I think that's the most interesting to me from the applications perspective. From the models perspective, I think I've seen a lot of them. Like the really interesting ones to me are where we're starting to combine search and symbolic stuff with differentiable models, like the whole AlphaGo style models is one example. And then I think we're attempting to do it for LLMs as well, with various reward models and search. I mean, I don't think PyTorch is being used in this, but the whole alpha geometry thing was interesting because again, it's an example of combining the symbolic models with the gradient based ones. But there are stuff like alpha geometry that PyTorch is used at, especially when you intersect biology and chemistry with ML. In those areas, you want stronger guarantees on the output. So yeah, maybe from the ML side, those things to me are very interesting right now.Swyx [00:28:03]: Yeah. People are very excited about the alpha geometry thing. And it's kind of like, for me, it's theoretical. It's great. You can solve some Olympia questions. I'm not sure how to make that bridge over into the real world applications, but I'm sure people smarter than me will figure it out.Synthetic Data vs Symbolic ModelsSoumith [00:28:18]: Let me give you an example of it. You know how the whole thing about synthetic data will be the next rage in LLMs is a thing?Swyx [00:28:27]: Already is a rage.Soumith [00:28:28]: Which I think is fairly misplaced in how people perceive it. People think synthetic data is some kind of magic wand that you wave and it's going to be amazing. Synthetic data is useful in neural networks right now because we as humans have figured out a bunch of symbolic models of the world or made up certain symbolic models because of human innate biases. So we've figured out how to ground particle physics in a 30 parameter model. And it's just very hard to compute as in it takes a lot of flops to compute, but it only has 30 parameters or so. I mean, I'm not a physics expert, but it's a very low rank model. We built mathematics as a field that basically is very low rank. Language, a deep understanding of language, like the whole syntactic parse trees and just understanding how language can be broken down and into a formal symbolism is something that we figured out. So we basically as humans have accumulated all this knowledge on these subjects, either synthetic, we created those subjects in our heads, or we grounded some real world phenomenon into a set of symbols. But we haven't figured out how to teach neural networks symbolic world models directly. The only way we have to teach them is generating a bunch of inputs and outputs and gradient dissenting over them. So in areas where we have the symbolic models and we need to teach all the knowledge we have that is better encoded in the symbolic models, what we're doing is we're generating a bunch of synthetic data, a bunch of input output pairs, and then giving that to the neural network and asking it to learn the same thing that we already have a better low rank model of in gradient descent in a much more over-parameterized way. Outside of this, like where we don't have good symbolic models, like synthetic data obviously doesn't make any sense. So synthetic data is not a magic wand where it'll work in all cases in every case or whatever. It's just where we as humans already have good symbolic models off. We need to impart that knowledge to neural networks and we figured out the synthetic data is a vehicle to impart this knowledge to. So, but people, because maybe they don't know enough about synthetic data as a notion, but they hear, you know, the next wave of data revolution is synthetic data. They think it's some kind of magic where we just create a bunch of random data somehow. They don't think about how, and then they think that's just a revolution. And I think that's maybe a gap in understanding most people have in this hype cycle.Swyx [00:31:23]: Yeah, well, it's a relatively new concept, so. Oh, there's two more that I'll put in front of you and then you can see what you respond. One is, you know, I have this joke that it's, you know, it's only synthetic data if it's from the Mistral region of France, otherwise it's just a sparkling distillation, which is what news research is doing. Like they're distilling GPT-4 by creating synthetic data from GPT-4, creating mock textbooks inspired by Phi 2 and then fine tuning open source models like Llama. And so I don't know, I mean, I think that's, should we call that synthetic data? Should we call it something else? I don't know.Soumith [00:31:57]: Yeah, I mean, the outputs of LLMs, are they synthetic data? They probably are, but I think it depends on the goal you have. If your goal is you're creating synthetic data with the goal of trying to distill GPT-4's superiority into another model, I guess you can call it synthetic data, but it also feels like disingenuous because your goal is I need to copy the behavior of GPT-4 and-Swyx [00:32:25]: It's also not just behavior, but data set. So I've often thought of this as data set washing. Like you need one model at the top of the chain, you know, unnamed French company that has that, you know, makes a model that has all the data in it that we don't know where it's from, but it's open source, hey, and then we distill from that and it's great. To be fair, they also use larger models as judges for preference ranking, right? So that is, I think, a very, very accepted use of synthetic.Soumith [00:32:53]: Correct. I think it's a very interesting time where we don't really have good social models of what is acceptable depending on how many bits of information you use from someone else, right? It's like, okay, you use one bit. Is that okay? Yeah, let's accept it to be okay. Okay, what about if you use 20 bits? Is that okay? I don't know. What if you use 200 bits? I don't think we as society have ever been in this conundrum where we have to be like, where is the boundary of copyright or where is the boundary of socially accepted understanding of copying someone else? We haven't been tested this mathematically before,Swyx [00:33:38]: in my opinion. Whether it's transformative use. Yes. So yeah, I think this New York Times opening eye case is gonna go to the Supreme Court and we'll have to decide it because I think we never had to deal with it before. And then finally, for synthetic data, the thing that I'm personally exploring is solving this great stark paradigm difference between rag and fine tuning, where you can kind of create synthetic data off of your retrieved documents and then fine tune on that. That's kind of synthetic. All you need is variation or diversity of samples for you to fine tune on. And then you can fine tune new knowledge into your model. I don't know if you've seen that as a direction for synthetic data.Soumith [00:34:13]: I think you're basically trying to, what you're doing is you're saying, well, language, I know how to parametrize language to an extent. And I need to teach my model variations of this input data so that it's resilient or invariant to language uses of that data.Swyx [00:34:32]: Yeah, it doesn't overfit on the wrong source documents.Soumith [00:34:33]: So I think that's 100% synthetic. You understand, the key is you create variations of your documents and you know how to do that because you have a symbolic model or like some implicit symbolic model of language.Swyx [00:34:48]: Okay.Alessio [00:34:49]: Do you think the issue with symbolic models is just the architecture of the language models that we're building? I think maybe the thing that people grasp is the inability of transformers to deal with numbers because of the tokenizer. Is it a fundamental issue there too? And do you see alternative architectures that will be better with symbolic understanding?Soumith [00:35:09]: I am not sure if it's a fundamental issue or not. I think we just don't understand transformers enough. I don't even mean transformers as an architecture. I mean the use of transformers today, like combining the tokenizer and transformers and the dynamics of training, when you show math heavy questions versus not. I don't have a good calibration of whether I know the answer or not. I, you know, there's common criticisms that are, you know, transformers will just fail at X. But then when you scale them up to sufficient scale, they actually don't fail at that X. I think there's this entire subfield where they're trying to figure out these answers called like the science of deep learning or something. So we'll get to know more. I don't know the answer.Meta AI and Llama 2/3Swyx [00:35:57]: Got it. Let's touch a little bit on just Meta AI and you know, stuff that's going on there. Maybe, I don't know how deeply you're personally involved in it, but you're our first guest with Meta AI, which is really fantastic. And Llama 1 was, you know, you are such a believer in open source. Llama 1 was more or less the real breakthrough in open source AI. The most interesting thing for us covering on this, in this podcast was the death of Chinchilla, as people say. Any interesting insights there around the scaling models for open source models or smaller models or whatever that design decision was when you guys were doing it?Soumith [00:36:31]: So Llama 1 was Guillaume Lample and team. There was OPT before, which I think I'm also very proud of because we bridged the gap in understanding of how complex it is to train these models to the world. Like until then, no one really in gory detail published.Swyx [00:36:50]: The logs.Soumith [00:36:51]: Yeah. Like, why is it complex? And everyone says, oh, it's complex. But no one really talked about why it's complex. I think OPT was cool.Swyx [00:37:02]: I met Susan and she's very, very outspoken. Yeah.Soumith [00:37:05]: We probably, I think, didn't train it for long enough, right? That's kind of obvious in retrospect.Swyx [00:37:12]: For a 175B. Yeah. You trained it according to Chinchilla at the time or?Soumith [00:37:17]: I can't remember the details, but I think it's a commonly held belief at this point that if we trained OPT longer, it would actually end up being better. Llama 1, I think, was Guillaume Lample and team Guillaume is fantastic and went on to build Mistral. I wasn't too involved in that side of things. So I don't know what you're asking me, which is how did they think about scaling loss and all of that? Llama 2, I was more closely involved in. I helped them a reasonable amount with their infrastructure needs and stuff. And Llama 2, I think, was more like, let's get to the evolution. At that point, we kind of understood what we were missing from the industry's understanding of LLMs. And we needed more data and we needed more to train the models for longer. And we made, I think, a few tweaks to the architecture and we scaled up more. And that was Llama 2. I think Llama 2, you can think of it as after Guillaume left, the team kind of rebuilt their muscle around Llama 2. And Hugo, I think, who's the first author is fantastic. And I think he did play a reasonable big role in Llama 1 as well.Soumith [00:38:35]: And he overlaps between Llama 1 and 2. So in Llama 3, obviously, hopefully, it'll be awesome.Alessio [00:38:42]: Just one question on Llama 2, and then we'll try and fish Llama 3 spoilers out of you. In the Llama 2 paper, the loss curves of the 34 and 70B parameter, they still seem kind of steep. Like they could go lower. How, from an infrastructure level, how do you allocate resources? Could they have just gone longer or were you just, hey, this is all the GPUs that we can burn and let's just move on to Llama 3 and then make that one better?Soumith [00:39:07]: Instead of answering specifically about that Llama 2 situation or whatever, I'll tell you how we think about things. Generally, we're, I mean, Mark really is some numbers, right?Swyx [00:39:20]: So let's cite those things again. All I remember is like 600K GPUs.Soumith [00:39:24]: That is by the end of this year and 600K H100 equivalents. With 250K H100s, including all of our other GPU or accelerator stuff, it would be 600-and-something-K aggregate capacity.Swyx [00:39:38]: That's a lot of GPUs.Soumith [00:39:39]: We'll talk about that separately. But the way we think about it is we have a train of models, right? Llama 1, 2, 3, 4. And we have a bunch of GPUs. I don't think we're short of GPUs. Like-Swyx [00:39:54]: Yeah, no, I wouldn't say so. Yeah, so it's all a matter of time.Soumith [00:39:56]: I think time is the biggest bottleneck. It's like, when do you stop training the previous one and when do you start training the next one? And how do you make those decisions? The data, do you have net new data, better clean data for the next one in a way that it's not worth really focusing on the previous one? It's just a standard iterative product. You're like, when is the iPhone 1? When do you start working on iPhone 2? Where is the iPhone? And so on, right? So mostly the considerations are time and generation, rather than GPUs, in my opinion.Alessio [00:40:31]: So one of the things with the scaling loss, like Chinchilla is optimal to balance training and inference costs. I think at Meta's scale, you would rather pay a lot more maybe at training and then save on inference. How do you think about that from infrastructure perspective? I think in your tweet, you say you can try and guess on like how we're using these GPUs. Can you just give people a bit of understanding? It's like, because I've already seen a lot of VCs say, Llama 3 has been trained on 600,000 GPUs and that's obviously not true, I'm sure. How do you allocate between the research, FAIR and the Llama training, the inference on Instagram suggestions that get me to scroll, like AI-generated stickers on WhatsApp and all of that?Soumith [00:41:11]: Yeah, we haven't talked about any of this publicly, but as a broad stroke, it's like how we would allocate resources of any other kinds at any company. You run a VC portfolio, how do you allocate your investments between different companies or whatever? You kind of make various trade-offs and you kind of decide, should I invest in this project or this other project, or how much should I invest in this project? It's very much a zero sum of trade-offs. And it also comes into play, how are your clusters configured, like overall, what you can fit of what size and what cluster and so on. So broadly, there's no magic sauce here. I mean, I think the details would add more spice, but also wouldn't add more understanding. It's just gonna be like, oh, okay, I mean, this looks like they just think about this as I would normally do.Alessio [00:42:05]: So even the GPU rich run through the same struggles of having to decide where to allocate things.Soumith [00:42:11]: Yeah, I mean, at some point I forgot who said it, but you kind of fit your models to the amount of compute you have. If you don't have enough compute, you figure out how to make do with smaller models. But no one as of today, I think would feel like they have enough compute. I don't think I've heard any company within the AI space be like, oh yeah, like we feel like we have sufficient compute and we couldn't have done better. So that conversation, I don't think I've heard from any of my friends at other companies.EleutherSwyx [00:42:47]: Stella from Eleuther sometimes says that because she has a lot of donated compute. She's trying to put it to interesting uses, but for some reason she's decided to stop making large models.Soumith [00:42:57]: I mean, that's a cool, high conviction opinion that might pay out.Swyx [00:43:01]: Why?Soumith [00:43:02]: I mean, she's taking a path that most people don't care to take about in this climate and she probably will have very differentiated ideas. I mean, think about the correlation of ideas in AI right now. It's so bad, right? So everyone's fighting for the same pie. In some weird sense, that's partly why I don't really directly work on LLMs. I used to do image models and stuff and I actually stopped doing GANs because GANs were getting so hot that I didn't have any calibration of whether my work would be useful or not because, oh yeah, someone else did the same thing you did. It's like, there's so much to do, I don't understand why I need to fight for the same pie. So I think Stella's decision is very smart.Making BetsAlessio [00:43:53]: And how do you reconcile that with how we started the discussion about intrinsic versus extrinsic kind of like accomplishment or success? How should people think about that especially when they're doing a PhD or early in their career? I think in Europe, I walked through a lot of the posters and whatnot, there seems to be mode collapse in a way in the research, a lot of people working on the same things. Is it worth for a PhD to not take a bet on something that is maybe not as interesting just because of funding and visibility and whatnot? Or yeah, what suggestions would you give?Soumith [00:44:28]: I think there's a baseline level of compatibility you need to have with the field. Basically, you need to figure out if you will get paid enough to eat, right? Like whatever reasonable normal lifestyle you want to have as a baseline. So you at least have to pick a problem within the neighborhood of fundable. Like you wouldn't wanna be doing something so obscure that people are like, I don't know, like you can work on it.Swyx [00:44:59]: Would a limit on fundability, I'm just observing something like three months of compute, right? That's the top line, that's the like max that you can spend on any one project.Soumith [00:45:09]: But like, I think that's very ill specified, like how much compute, right? I think that the notion of fundability is broader. It's more like, hey, are these family of models within the acceptable set of, you're not crazy or something, right? Even something like neural or DS, which is a very boundary pushing thing or states-based models or whatever. Like all of these things I think are still in fundable territory. When you're talking about, I'm gonna do one of the neuromorphic models and then apply image classification to them or something, then it becomes a bit questionable. Again, it depends on your motivation. Maybe if you're a neuroscientist, it actually is feasible. But if you're an AI engineer, like the audience of these podcasts, then it's more questionable. The way I think about it is, you need to figure out how you can be in the baseline level of fundability just so that you can just live. And then after that, really focus on intrinsic motivation and depends on your strengths, like how you can play to your strengths and your interests at the same time. Like I try to look at a bunch of ideas that are interesting to me, but also try to play to my strengths. I'm not gonna go work on theoretical ML. I'm interested in it, but when I want to work on something like that, I try to partner with someone who is actually a good theoretical ML person and see if I actually have any value to provide. And if they think I do, then I come in. So I think you'd want to find that intersection of ideas you like, and that also play to your strengths. And I'd go from there. Everything else, like actually finding extrinsic success and all of that, I think is the way I think about it is like somewhat immaterial. When you're talking about building ecosystems and stuff, slightly different considerations come into play, but that's a different conversation.Swyx [00:47:06]: We're gonna pivot a little bit to just talking about open source AI. But one more thing I wanted to establish for Meta is this 600K number, just kind of rounding out the discussion, that's for all Meta. So including your own inference needs, right? It's not just about training.Soumith [00:47:19]: It's gonna be the number in our data centers for all of Meta, yeah.Swyx [00:47:23]: Yeah, so there's a decent amount of workload serving Facebook and Instagram and whatever. And then is there interest in like your own hardware?MTIASoumith [00:47:31]: We already talked about our own hardware. It's called MTIA. Our own silicon, I think we've even showed the standard photograph of you holding the chip that doesn't work. Like as in the chip that you basically just get like-Swyx [00:47:51]: As a test, right?Soumith [00:47:52]: Yeah, a test chip or whatever. So we are working on our silicon and we'll probably talk more about it when the time is right, but-Swyx [00:48:00]: Like what gaps do you have that the market doesn't offer?Soumith [00:48:04]: Okay, I mean, this is easy to answer. So basically, remember how I told you about there's this memory hierarchy and like sweet spots and all of that? Fundamentally, when you build a hardware, you make it general enough that a wide set of customers and a wide set of workloads can use it effectively while trying to get the maximum level of performance they can. The more specialized you make the chip, the more hardware efficient it's going to be, the more power efficient it's gonna be, the more easier it's going to be to find the software, like the kernel's right to just map that one or two workloads to that hardware and so on. So it's pretty well understood across the industry that if you have a sufficiently large volume, enough workload, you can specialize it and get some efficiency gains, like power gains and so on. So the way you can think about everyone building, every large company building silicon, I think a bunch of the other large companies are building their own silicon as well, is they, each large company has a sufficient enough set of verticalized workloads that can be specialized that have a pattern to them that say a more generic accelerator like an NVIDIA or an AMD GPU does not exploit. So there is some level of power efficiency that you're leaving on the table by not exploiting that. And you have sufficient scale and you have sufficient forecasted stability that those workloads will exist in the same form, that it's worth spending the time to build out a chip to exploit that sweet spot. Like obviously something like this is only useful if you hit a certain scale and that your forecasted prediction of those kind of workloads being in the same kind of specializable exploitable way is true. So yeah, that's why we're building our own chips.Swyx [00:50:08]: Awesome.Open Source AIAlessio [00:50:09]: Yeah, I know we've been talking a lot on a lot of different topics and going back to open source, you had a very good tweet. You said that a single company's closed source effort rate limits against people's imaginations and needs. How do you think about all the impact that some of the Meta AI work in open source has been doing and maybe directions of the whole open source AI space?Soumith [00:50:32]: Yeah, in general, I think first, I think it's worth talking about this in terms of open and not just open source, because like with the whole notion of model weights, no one even knows what source means for these things. But just for the discussion, when I say open source, you can assume it's just I'm talking about open. And then there's the whole notion of licensing and all that, commercial, non-commercial, commercial with clauses and all that. I think at a fundamental level, the most benefited value of open source is that you make the distribution to be very wide. It's just available with no friction and people can do transformative things in a way that's very accessible. Maybe it's open source, but it has a commercial license and I'm a student in India. I don't care about the license. I just don't even understand the license. But like the fact that I can use it and do something with it is very transformative to me. Like I got this thing in a very accessible way. And then it's various degrees, right? And then if it's open source, but it's actually a commercial license, then a lot of companies are gonna benefit from gaining value that they didn't previously have, that they maybe had to pay a closed source company for it. So open source is just a very interesting tool that you can use in various ways. So there's, again, two kinds of open source. One is some large company doing a lot of work and then open sourcing it. And that kind of effort is not really feasible by say a band of volunteers doing it the same way. So there's both a capital and operational expenditure that the large company just decided to ignore and give it away to the world for some benefits of some kind. They're not as tangible as direct revenue. So in that part, Meta has been doing incredibly good things. They fund a huge amount of the PyTorch development. They've open sourced Llama and those family of models and several other fairly transformative projects. FICE is one, Segment Anything, Detectron, Detectron 2. Dense Pose. I mean, it's-Swyx [00:52:52]: Seamless. Yeah, seamless.Soumith [00:52:53]: Like it's just the list is so long that we're not gonna cover. So I think Meta comes into that category where we spend a lot of CapEx and OpEx and we have a high talent density of great AI people and we open our stuff. And the thesis for that, I remember when FAIR was started, the common thing was like, wait, why would Meta wanna start a open AI lab? Like what exactly is a benefit from a commercial perspective? And for then the thesis was very simple. It was AI is currently rate limiting Meta's ability to do things. Our ability to build various product integrations, moderation, various other factors. Like AI was the limiting factor and we just wanted AI to advance more and we didn't care if the IP of the AI was uniquely in our possession or not. However the field advances, that accelerates Meta's ability to build a better product. So we just built an open AI lab and we said, if this helps accelerate the progress of AI, that's strictly great for us. But very easy, rational, right? Still the same to a large extent with the Llama stuff. And it's the same values, but the argument, it's a bit more nuanced. And then there's a second kind of open source, which is, oh, we built this project, nights and weekends and we're very smart people and we open sourced it and then we built a community around it. This is the Linux kernel and various software projects like that. So I think about open source, like both of these things being beneficial and both of these things being different. They're different and beneficial in their own ways. The second one is really useful when there's an active arbitrage to be done. If someone's not really looking at a particular space because it's not commercially viable or whatever, like a band of volunteers can just coordinate online and do something and then make that happen. And that's great.Open Source LLMsI wanna cover a little bit about open source LLMs maybe. So open source LLMs have been very interesting because I think we were trending towards an increase in open source in AI from 2010 all the way to 2017 or something. Like where more and more pressure within the community was to open source their stuff so that their methods and stuff get adopted. And then the LLMs revolution kind of took the opposite effect OpenAI stopped open sourcing their stuff and DeepMind kind of didn't, like all the other cloud and all these other providers, they didn't open source their stuff. And it was not good in the sense that first science done in isolation probably will just form its own bubble where people believe their own b******t or whatever. So there's that problem. And then there was the other problem which was the accessibility part. Like, okay, I again always go back to I'm a student in India with no money. What is my accessibility to any of these closers models? At some scale I have to pay money. That makes it a non-starter and stuff. And there's also the control thing. I strongly believe if you want human aligned stuff, you want all humans to give feedback. And you want all humans to have access to that technology in the first place. And I actually have seen, living in New York, whenever I come to Silicon Valley, I see a different cultural bubble. Like all the friends I hang out with talk about some random thing like Dyson Spheres or whatever, that's a thing. And most of the world doesn't know or care about any of this stuff. It's definitely a bubble and bubbles can form very easily. And when you make a lot of decisions because you're in a bubble, they're probably not globally optimal decisions. So I think open source, the distribution of open source powers a certain kind of non-falsifiability that I think is very important. I think on the open source models, like it's going great in the fact that LoRa I think came out of the necessity of open source models needing to be fine-tunable in some way. Yeah, and I think DPO also came out of the academic open source side of things. So do any of the closed source labs, did any of them already have LoRa or DPO internally? Maybe, but that does not advance humanity in any way. It advances some companies probability of doing the winner takes all that I talked about earlier in the podcast.Open Source and TrustI don't know, it just feels fundamentally good. Like when people try to, you know, people are like, well, what are the ways in which it is not okay? I find most of these arguments, and this might be a little controversial, but I find a lot of arguments based on whether closed source models are safer or open source models are safer very much related to what kind of culture they grew up in, what kind of society they grew up in. If they grew up in a society that they trusted, then I think they take the closed source argument. And if they grew up in a society that they couldn't trust, where the norm was that you didn't trust your government, obviously it's corrupt or whatever, then I think the open source argument is what they take. I think there's a deep connection to like people's innate biases from their childhood and their trust in society and governmental aspects that push them towards one opinion or the other. And I'm definitely in the camp of open source is definitely going to actually have better outcomes for society. Closed source to me just means that centralization of power, which, you know, is really hard to trust. So I think it's going well

new york history trust new york city europe google ai disney apple france future french new york times phd russia coffee open investing iphone language 3d cnn supreme court tesla touch silicon valley whatsapp android reddit vr switzerland honestly pixar ip playstation 5 berkeley cto images vc nyu closed nirvana toy story fireworks correct sf openai nvidia ux uc berkeley api generally palm chrome open source gmail mojo python ui carlo gpt ml linux guillaume best buy llama ds macbook torch runway vcs amd synthetic cern la croix timoth google docs ani cpu llm m1 distributed gpu vfx ssd nps elo tangent agi opt modular seamless midjourney zuck yan phi dram m2 its impact benchmarks huffman gpus 7b deepmind alessio 600k rgb triton mars rovers gans capex compute infra skus mistral opex dan pink 4t tensorflow uba chinchillas alphago gradient extrinsic cuda meta ai keras 2t ed catmull 1t dpo caffe one x odes osmo fice ai news pytorch dyson spheres risc open source ai 500b yann lecun in feb ccv 70b chris lattner engineering lead ai engineer george hotz 34b 65b scott gray gprs cisc alex atala amd gpus infiniband theano latent space llama2 anyscale chainer lepton nvlink

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Feb 16, 2024 62:25

We're writing this one day after the monster release of OpenAI's Sora and Gemini 1.5. We covered this on ‘s ThursdAI space, so head over there for our takes.IRL: We're ONE WEEK away from Latent Space: Final Frontiers, the second edition and anniversary of our first ever Latent Space event! Also: join us on June 25-27 for the biggest AI Engineer conference of the year!Online: All three Discord clubs are thriving. Join us every Wednesday/Friday!Almost 12 years ago, while working at Spotify, Erik Bernhardsson built one of the first open source vector databases, Annoy, based on ANN search. He also built Luigi, one of the predecessors to Airflow, which helps data teams orchestrate and execute data-intensive and long-running jobs. Surprisingly, he didn't start yet another vector database company, but instead in 2021 founded Modal, the “high-performance cloud for developers”. In 2022 they opened doors to developers after their seed round, and in 2023 announced their GA with a $16m Series A.More importantly, they have won fans among both household names like Ramp, Scale AI, Substack, and Cohere, and newer startups like (upcoming guest!) Suno.ai and individual hackers (Modal was the top tool of choice in the Vercel AI Accelerator):We've covered the nuances of GPU workloads, and how we need new developer tooling and runtimes for them (see our episodes with Chris Lattner of Modular and George Hotz of tiny to start). In this episode, we run through the major limitations of the actual infrastructure behind the clouds that run these models, and how Erik envisions the “postmodern data stack”. In his 2021 blog post “Software infrastructure 2.0: a wishlist”, Erik had “Truly serverless” as one of his points:* The word cluster is an anachronism to an end-user in the cloud! I'm already running things in the cloud where there's elastic resources available at any time. Why do I have to think about the underlying pool of resources? Just maintain it for me.* I don't ever want to provision anything in advance of load.* I don't want to pay for idle resources. Just let me pay for whatever resources I'm actually using.* Serverless doesn't mean it's a burstable VM that saves its instance state to disk during periods of idle.Swyx called this Self Provisioning Runtimes back in the day. Modal doesn't put you in YAML hell, preferring to colocate infra provisioning right next to the code that utilizes it, so you can just add GPU (and disk, and retries…):After 3 years, we finally have a big market push for this: running inference on generative models is going to be the killer app for serverless, for a few reasons:* AI models are stateless: even in conversational interfaces, each message generation is a fully-contained request to the LLM. There's no knowledge that is stored in the model itself between messages, which means that tear down / spin up of resources doesn't create any headaches with maintaining state.* Token-based pricing is better aligned with serverless infrastructure than fixed monthly costs of traditional software.* GPU scarcity makes it really expensive to have reserved instances that are available to you 24/7. It's much more convenient to build with a serverless-like infrastructure.In the episode we covered a lot more topics like maximizing GPU utilization, why Oracle Cloud rocks, and how Erik has never owned a TV in his life. Enjoy!Show Notes* Modal* ErikBot* Erik's Blog* Software Infra 2.0 Wishlist* Luigi* Annoy* Hetzner* CoreWeave* Cloudflare FaaS* Poolside AI* Modular Inference EngineChapters* [00:00:00] Introductions* [00:02:00] Erik's OSS work at Spotify: Annoy and Luigi* [00:06:22] Starting Modal* [00:07:54] Vision for a "postmodern data stack"* [00:10:43] Solving container cold start problems* [00:12:57] Designing Modal's Python SDK* [00:15:18] Self-Revisioning Runtime* [00:19:14] Truly Serverless Infrastructure* [00:20:52] Beyond model inference* [00:22:09] Tricks to maximize GPU utilization* [00:26:27] Differences in AI and data science workloads* [00:28:08] Modal vs Replicate vs Modular and lessons from Heroku's "graduation problem"* [00:34:12] Creating Erik's clone "ErikBot"* [00:37:43] Enabling massive parallelism across thousands of GPUs* [00:39:45] The Modal Sandbox for agents* [00:43:51] Thoughts on the AI Inference War* [00:49:18] Erik's best tweets* [00:51:57] Why buying hardware is a waste of money* [00:54:18] Erik's competitive programming backgrounds* [00:59:02] Why does Sweden have the best Counter Strike players?* [00:59:53] Never owning a car or TV* [01:00:21] Advice for infrastructure startupsTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:14]: Hey, and today we have in the studio Erik Bernhardsson from Modal. Welcome.Erik [00:00:19]: Hi. It's awesome being here.Swyx [00:00:20]: Yeah. Awesome seeing you in person. I've seen you online for a number of years as you were building on Modal and I think you're just making a San Francisco trip just to see people here, right? I've been to like two Modal events in San Francisco here.Erik [00:00:34]: Yeah, that's right. We're based in New York, so I figured sometimes I have to come out to capital of AI and make a presence.Swyx [00:00:40]: What do you think is the pros and cons of building in New York?Erik [00:00:45]: I mean, I never built anything elsewhere. I lived in New York the last 12 years. I love the city. Obviously, there's a lot more stuff going on here and there's a lot more customers and that's why I'm out here. I do feel like for me, where I am in life, I'm a very boring person. I kind of work hard and then I go home and hang out with my kids. I don't have time to go to events and meetups and stuff anyway. In that sense, New York is kind of nice. I walk to work every morning. It's like five minutes away from my apartment. It's very time efficient in that sense. Yeah.Swyx [00:01:10]: Yeah. It's also a good life. So we'll do a brief bio and then we'll talk about anything else that people should know about you. Actually, I was surprised to find out you're from Sweden. You went to college in KTH and your master's was in implementing a scalable music recommender system. Yeah.Erik [00:01:27]: I had no idea. Yeah. So I actually studied physics, but I grew up coding and I did a lot of programming competition and then as I was thinking about graduating, I got in touch with an obscure music streaming startup called Spotify, which was then like 30 people. And for some reason, I convinced them, why don't I just come and write a master's thesis with you and I'll do some cool collaborative filtering, despite not knowing anything about collaborative filtering really. But no one knew anything back then. So I spent six months at Spotify basically building a prototype of a music recommendation system and then turned that into a master's thesis. And then later when I graduated, I joined Spotify full time.Swyx [00:02:00]: So that was the start of your data career. You also wrote a couple of popular open source tooling while you were there. Is that correct?Erik [00:02:09]: No, that's right. I mean, I was at Spotify for seven years, so this is a long stint. And Spotify was a wild place early on and I mean, data space is also a wild place. I mean, it was like Hadoop cluster in the like foosball room on the floor. It was a lot of crude, like very basic infrastructure and I didn't know anything about it. And like I was hired to kind of figure out data stuff. And I started hacking on a recommendation system and then, you know, got sidetracked in a bunch of other stuff. I fixed a bunch of reporting things and set up A-B testing and started doing like business analytics and later got back to music recommendation system. And a lot of the infrastructure didn't really exist. Like there was like Hadoop back then, which is kind of bad and I don't miss it. But I spent a lot of time with that. As a part of that, I ended up building a workflow engine called Luigi, which is like briefly like somewhat like widely ended up being used by a bunch of companies. Sort of like, you know, kind of like Airflow, but like before Airflow. I think it did some things better, some things worse. I also built a vector database called Annoy, which is like for a while, it was actually quite widely used. In 2012, so it was like way before like all this like vector database stuff ended up happening. And funny enough, I was actually obsessed with like vectors back then. Like I was like, this is going to be huge. Like just give it like a few years. I didn't know it was going to take like nine years and then there's going to suddenly be like 20 startups doing vector databases in one year. So it did happen. In that sense, I was right. I'm glad I didn't start a startup in the vector database space. I would have started way too early. But yeah, that was, yeah, it was a fun seven years as part of it. It was a great culture, a great company.Swyx [00:03:32]: Yeah. Just to take a quick tangent on this vector database thing, because we probably won't revisit it but like, has anything architecturally changed in the last nine years?Erik [00:03:41]: I'm actually not following it like super closely. I think, you know, some of the best algorithms are still the same as like hierarchical navigable small world.Swyx [00:03:51]: Yeah. HNSW.Erik [00:03:52]: Exactly. I think now there's like product quantization, there's like some other stuff that I haven't really followed super closely. I mean, obviously, like back then it was like, you know, it's always like very simple. It's like a C++ library with Python bindings and you could mmap big files and into memory and like they had some lookups. I used like this kind of recursive, like hyperspace splitting strategy, which is not that good, but it sort of was good enough at that time. But I think a lot of like HNSW is still like what people generally use. Now of course, like databases are much better in the sense like to support like inserts and updates and stuff like that. I know I never supported that. Yeah, it's sort of exciting to finally see like vector databases becoming a thing.Swyx [00:04:30]: Yeah. Yeah. And then maybe one takeaway on most interesting lesson from Daniel Ek?Erik [00:04:36]: I mean, I think Daniel Ek, you know, he started Spotify very young. Like he was like 25, something like that. And that was like a good lesson. But like he, in a way, like I think he was a very good leader. Like there was never anything like, no scandals or like no, he wasn't very eccentric at all. It was just kind of like very like level headed, like just like ran the company very well, like never made any like obvious mistakes or I think it was like a few bets that maybe like in hindsight were like a little, you know, like took us, you know, too far in one direction or another. But overall, I mean, I think he was a great CEO, like definitely, you know, up there, like generational CEO, at least for like Swedish startups.Swyx [00:05:09]: Yeah, yeah, for sure. Okay, we should probably move to make our way towards Modal. So then you spent six years as CTO of Better. You were an early engineer and then you scaled up to like 300 engineers.Erik [00:05:21]: I joined as a CTO when there was like no tech team. And yeah, that was a wild chapter in my life. Like the company did very well for a while. And then like during the pandemic, yeah, it was kind of a weird story, but yeah, it kind of collapsed.Swyx [00:05:32]: Yeah, laid off people poorly.Erik [00:05:34]: Yeah, yeah. It was like a bunch of stories. Yeah. I mean, the company like grew from like 10 people when I joined at 10,000, now it's back to a thousand. But yeah, they actually went public a few months ago, kind of crazy. They're still around, like, you know, they're still, you know, doing stuff. So yeah, very kind of interesting six years of my life for non-technical reasons, like I managed like three, four hundred, but yeah, like learning a lot of that, like recruiting. I spent all my time recruiting and stuff like that. And so managing at scale, it's like nice, like now in a way, like when I'm building my own startup. It's actually something I like, don't feel nervous about at all. Like I've managed a scale, like I feel like I can do it again. It's like very different things that I'm nervous about as a startup founder. But yeah, I started Modal three years ago after sort of, after leaving Better, I took a little bit of time off during the pandemic and, but yeah, pretty quickly I was like, I got to build something. I just want to, you know. Yeah. And then yeah, Modal took form in my head, took shape.Swyx [00:06:22]: And as far as I understand, and maybe we can sort of trade off questions. So the quick history is started Modal in 2021, got your seed with Sarah from Amplify in 2022. You just announced your Series A with Redpoint. That's right. And that brings us up to mostly today. Yeah. Most people, I think, were expecting you to build for the data space.Erik: But it is the data space.Swyx:: When I think of data space, I come from like, you know, Snowflake, BigQuery, you know, Fivetran, Nearby, that kind of stuff. And what Modal became is more general purpose than that. Yeah.Erik [00:06:53]: Yeah. I don't know. It was like fun. I actually ran into like Edo Liberty, the CEO of Pinecone, like a few weeks ago. And he was like, I was so afraid you were building a vector database. No, I started Modal because, you know, like in a way, like I work with data, like throughout my most of my career, like every different part of the stack, right? Like I thought everything like business analytics to like deep learning, you know, like building, you know, training neural networks, the scale, like everything in between. And so one of the thoughts, like, and one of the observations I had when I started Modal or like why I started was like, I just wanted to make, build better tools for data teams. And like very, like sort of abstract thing, but like, I find that the data stack is, you know, full of like point solutions that don't integrate well. And still, when you look at like data teams today, you know, like every startup ends up building their own internal Kubernetes wrapper or whatever. And you know, all the different data engineers and machine learning engineers end up kind of struggling with the same things. So I started thinking about like, how do I build a new data stack, which is kind of a megalomaniac project, like, because you kind of want to like throw out everything and start over.Swyx [00:07:54]: It's almost a modern data stack.Erik [00:07:55]: Yeah, like a postmodern data stack. And so I started thinking about that. And a lot of it came from like, like more focused on like the human side of like, how do I make data teams more productive? And like, what is the technology tools that they need? And like, you know, drew out a lot of charts of like, how the data stack looks, you know, what are different components. And it shows actually very interesting, like workflow scheduling, because it kind of sits in like a nice sort of, you know, it's like a hub in the graph of like data products. But it was kind of hard to like, kind of do that in a vacuum, and also to monetize it to some extent. I got very interested in like the layers below at some point. And like, at the end of the day, like most people have code to have to run somewhere. So I think about like, okay, well, how do you make that nice? Like how do you make that? And in particular, like the thing I always like thought about, like developer productivity is like, I think the best way to measure developer productivity is like in terms of the feedback loops, like how quickly when you iterate, like when you write code, like how quickly can you get feedback. And at the innermost loop, it's like writing code and then running it. And like, as soon as you start working with the cloud, like it's like takes minutes suddenly, because you have to build a Docker container and push it to the cloud and like run it, you know. So that was like the initial focus for me was like, I just want to solve that problem. Like I want to, you know, build something less, you run things in the cloud and like retain the sort of, you know, the joy of productivity as when you're running things locally. And in particular, I was quite focused on data teams, because I think they had a couple unique needs that wasn't well served by the infrastructure at that time, or like still is in like, in particular, like Kubernetes, I feel like it's like kind of worked okay for back end teams, but not so well for data teams. And very quickly, I got sucked into like a very deep like rabbit hole of like...Swyx [00:09:24]: Not well for data teams because of burstiness. Yeah, for sure.Erik [00:09:26]: So like burstiness is like one thing, right? Like, you know, like you often have this like fan out, you want to like apply some function over very large data sets. Another thing tends to be like hardware requirements, like you need like GPUs and like, I've seen this in many companies, like you go, you know, data scientists go to a platform team and they're like, can we add GPUs to the Kubernetes? And they're like, no, like, that's, you know, complex, and we're not gonna, so like just getting GPU access. And then like, I mean, I also like data code, like frankly, or like machine learning code like tends to be like, super annoying in terms of like environments, like you end up having like a lot of like custom, like containers and like environment conflicts. And like, it's very hard to set up like a unified container that like can serve like a data scientist, because like, there's always like packages that break. And so I think there's a lot of different reasons why the technology wasn't well suited for back end. And I think the attitude at that time is often like, you know, like you had friction between the data team and the platform team, like, well, it works for the back end stuff, you know, why don't you just like, you know, make it work. But like, I actually felt like data teams, you know, or at this point now, like there's so much, so many people working with data, and like they, to some extent, like deserve their own tools and their own tool chains, and like optimizing for that is not something people have done. So that's, that's sort of like very abstract philosophical reason why I started Model. And then, and then I got sucked into this like rabbit hole of like container cold start and, you know, like whatever, Linux, page cache, you know, file system optimizations.Swyx [00:10:43]: Yeah, tell people, I think the first time I met you, I think you told me some numbers, but I don't remember, like, what are the main achievements that you were unhappy with the status quo? And then you built your own container stack?Erik [00:10:52]: Yeah, I mean, like, in particular, it was like, in order to have that loop, right? You want to be able to start, like take code on your laptop, whatever, and like run in the cloud very quickly, and like running in custom containers, and maybe like spin up like 100 containers, 1000, you know, things like that. And so container cold start was the initial like, from like a developer productivity point of view, it was like, really, what I was focusing on is, I want to take code, I want to stick it in container, I want to execute in the cloud, and like, you know, make it feel like fast. And when you look at like, how Docker works, for instance, like Docker, you have this like, fairly convoluted, like very resource inefficient way, they, you know, you build a container, you upload the whole container, and then you download it, and you run it. And Kubernetes is also like, not very fast at like starting containers. So like, I started kind of like, you know, going a layer deeper, like Docker is actually like, you know, there's like a couple of different primitives, but like a lower level primitive is run C, which is like a container runner. And I was like, what if I just take the container runner, like run C, and I point it to like my own root file system, and then I built like my own virtual file system that exposes files over a network instead. And that was like the sort of very crude version of model, it's like now I can actually start containers very quickly, because it turns out like when you start a Docker container, like, first of all, like most Docker images are like several gigabytes, and like 99% of that is never going to be consumed, like there's a bunch of like, you know, like timezone information for like Uzbekistan, like no one's going to read it. And then there's a very high overlap between the files are going to be read, there's going to be like lib torch or whatever, like it's going to be read. So you can also cache it very well. So that was like the first sort of stuff we started working on was like, let's build this like container file system. And you know, coupled with like, you know, just using run C directly. And that actually enabled us to like, get to this point of like, you write code, and then you can launch it in the cloud within like a second or two, like something like that. And you know, there's been many optimizations since then, but that was sort of starting point.Alessio [00:12:33]: Can we talk about the developer experience as well, I think one of the magic things about Modal is at the very basic layers, like a Python function decorator, it's just like stub and whatnot. But then you also have a way to define a full container, what were kind of the design decisions that went into it? Where did you start? How easy did you want it to be? And then maybe how much complexity did you then add on to make sure that every use case fit?Erik [00:12:57]: I mean, Modal, I almost feel like it's like almost like two products kind of glued together. Like there's like the low level like container runtime, like file system, all that stuff like in Rust. And then there's like the Python SDK, right? Like how do you express applications? And I think, I mean, Swix, like I think your blog was like the self-provisioning runtime was like, to me, always like to sort of, for me, like an eye-opening thing. It's like, so I didn't think about like...Swyx [00:13:15]: You wrote your post four months before me. Yeah? The software 2.0, Infra 2.0. Yeah.Erik [00:13:19]: Well, I don't know, like convergence of minds. I guess we were like both thinking. Maybe you put, I think, better words than like, you know, maybe something I was like thinking about for a long time. Yeah.Swyx [00:13:29]: And I can tell you how I was thinking about it on my end, but I want to hear you say it.Erik [00:13:32]: Yeah, yeah, I would love to. So to me, like what I always wanted to build was like, I don't know, like, I don't know if you use like Pulumi. Like Pulumi is like nice, like in the sense, like it's like Pulumi is like you describe infrastructure in code, right? And to me, that was like so nice. Like finally I can like, you know, put a for loop that creates S3 buckets or whatever. And I think like Modal sort of goes one step further in the sense that like, what if you also put the app code inside the infrastructure code and like glue it all together and then like you only have one single place that defines everything and it's all programmable. You don't have any config files. Like Modal has like zero config. There's no config. It's all code. And so that was like the goal that I wanted, like part of that. And then the other part was like, I often find that so much of like my time was spent on like the plumbing between containers. And so my thing was like, well, if I just build this like Python SDK and make it possible to like bridge like different containers, just like a function call, like, and I can say, oh, this function runs in this container and this other function runs in this container and I can just call it just like a normal function, then, you know, I can build these applications that may span a lot of different environments. Maybe they fan out, start other containers, but it's all just like inside Python. You just like have this beautiful kind of nice like DSL almost for like, you know, how to control infrastructure in the cloud. So that was sort of like how we ended up with the Python SDK as it is, which is still evolving all the time, by the way. We keep changing syntax quite a lot because I think it's still somewhat exploratory, but we're starting to converge on something that feels like reasonably good now.Swyx [00:14:54]: Yeah. And along the way you, with this expressiveness, you enabled the ability to, for example, attach a GPU to a function. Totally.Erik [00:15:02]: Yeah. It's like you just like say, you know, on the function decorator, you're like GPU equals, you know, A100 and then or like GPU equals, you know, A10 or T4 or something like that. And then you get that GPU and like, you know, you just run the code and it runs like you don't have to, you know, go through hoops to, you know, start an EC2 instance or whatever.Swyx [00:15:18]: Yeah. So it's all code. Yeah. So one of the reasons I wrote Self-Revisioning Runtimes was I was working at AWS and we had AWS CDK, which is kind of like, you know, the Amazon basics blew me. Yeah, totally. And then, and then like it creates, it compiles the cloud formation. Yeah. And then on the other side, you have to like get all the config stuff and then put it into your application code and make sure that they line up. So then you're writing code to define your infrastructure, then you're writing code to define your application. And I was just like, this is like obvious that it's going to converge, right? Yeah, totally.Erik [00:15:48]: But isn't there like, it might be wrong, but like, was it like SAM or Chalice or one of those? Like, isn't that like an AWS thing that where actually they kind of did that? I feel like there's like one.Swyx [00:15:57]: SAM. Yeah. Still very clunky. It's not, not as elegant as modal.Erik [00:16:03]: I love AWS for like the stuff it's built, you know, like historically in order for me to like, you know, what it enables me to build, but like AWS is always like struggle with developer experience.Swyx [00:16:11]: I mean, they have to not break things.Erik [00:16:15]: Yeah. Yeah. And totally. And they have to build products for a very wide range of use cases. And I think that's hard.Swyx [00:16:21]: Yeah. Yeah. So it's, it's easier to design for. Yeah. So anyway, I was, I was pretty convinced that this, this would happen. I wrote, wrote that thing. And then, you know, I imagine my surprise that you guys had it on your landing page at some point. I think, I think Akshad was just like, just throw that in there.Erik [00:16:34]: Did you trademark it?Swyx [00:16:35]: No, I didn't. But I definitely got sent a few pitch decks with my post on there and it was like really interesting. This is my first time like kind of putting a name to a phenomenon. And I think this is a useful skill for people to just communicate what they're trying to do.Erik [00:16:48]: Yeah. No, I think it's a beautiful concept.Swyx [00:16:50]: Yeah. Yeah. Yeah. But I mean, obviously you implemented it. What became more clear in your explanation today is that actually you're not that tied to Python.Erik [00:16:57]: No. I mean, I, I think that all the like lower level stuff is, you know, just running containers and like scheduling things and, you know, serving container data and stuff. So like one of the benefits of data teams is obviously like they're all like using Python, right? And so that made it a lot easier. I think, you know, if we had focused on other workloads, like, you know, for various reasons, we've like been kind of like half thinking about like CI or like things like that. But like, in a way that's like harder because like you also, then you have to be like, you know, multiple SDKs, whereas, you know, focusing on data teams, you can only, you know, Python like covers like 95% of all teams. That made it a lot easier. But like, I mean, like definitely like in the future, we're going to have others support, like supporting other languages. JavaScript for sure is the obvious next language. But you know, who knows, like, you know, Rust, Go, R, whatever, PHP, Haskell, I don't know.Swyx [00:17:42]: You know, I think for me, I actually am a person who like kind of liked the idea of programming language advancements being improvements in developer experience. But all I saw out of the academic sort of PLT type people is just type level improvements. And I always think like, for me, like one of the core reasons for self-provisioning runtimes and then why I like Modal is like, this is actually a productivity increase, right? Like, it's a language level thing, you know, you managed to stick it on top of an existing language, but it is your own language, a DSL on top of Python. And so language level increase on the order of like automatic memory management. You know, you could sort of make that analogy that like, maybe you lose some level of control, but most of the time you're okay with whatever Modal gives you. And like, that's fine. Yeah.Erik [00:18:26]: Yeah. Yeah. I mean, that's how I look at about it too. Like, you know, you look at developer productivity over the last number of decades, like, you know, it's come in like small increments of like, you know, dynamic typing or like is like one thing because not suddenly like for a lot of use cases, you don't need to care about type systems or better compiler technology or like, you know, the cloud or like, you know, relational databases. And, you know, I think, you know, you look at like that, you know, history, it's a steadily, you know, it's like, you know, you look at the developers have been getting like probably 10X more productive every decade for the last four decades or something that was kind of crazy. Like on an exponential scale, we're talking about 10X or is there a 10,000X like, you know, improvement in developer productivity. What we can build today, you know, is arguably like, you know, a fraction of the cost of what it took to build it in the eighties. Maybe it wasn't even possible in the eighties. So that to me, like, that's like so fascinating. I think it's going to keep going for the next few decades. Yeah.Alessio [00:19:14]: Yeah. Another big thing in the infra 2.0 wishlist was truly serverless infrastructure. The other on your landing page, you called them native cloud functions, something like that. I think the issue I've seen with serverless has always been people really wanted it to be stateful, even though stateless was much easier to do. And I think now with AI, most model inference is like stateless, you know, outside of the context. So that's kind of made it a lot easier to just put a model, like an AI model on model to run. How do you think about how that changes how people think about infrastructure too? Yeah.Erik [00:19:48]: I mean, I think model is definitely going in the direction of like doing more stateful things and working with data and like high IO use cases. I do think one like massive serendipitous thing that happened like halfway, you know, a year and a half into like the, you know, building model was like Gen AI started exploding and the IO pattern of Gen AI is like fits the serverless model like so well, because it's like, you know, you send this tiny piece of information, like a prompt, right, or something like that. And then like you have this GPU that does like trillions of flops, and then it sends back like a tiny piece of information, right. And that turns out to be something like, you know, if you can get serverless working with GPU, that just like works really well, right. So I think from that point of view, like serverless always to me felt like a little bit of like a solution looking for a problem. I don't actually like don't think like backend is like the problem that needs to serve it or like not as much. But I look at data and in particular, like things like Gen AI, like model inference, like it's like clearly a good fit. So I think that is, you know, to a large extent explains like why we saw, you know, the initial sort of like killer app for model being model inference, which actually wasn't like necessarily what we're focused on. But that's where we've seen like by far the most usage. Yeah.Swyx [00:20:52]: And this was before you started offering like fine tuning of language models, it was mostly stable diffusion. Yeah.Erik [00:20:59]: Yeah. I mean, like model, like I always built it to be a very general purpose compute platform, like something where you can run everything. And I used to call model like a better Kubernetes for data team for a long time. What we realized was like, yeah, that's like, you know, a year and a half in, like we barely had any users or any revenue. And like we were like, well, maybe we should look at like some use case, trying to think of use case. And that was around the same time stable diffusion came out. And the beauty of model is like you can run almost anything on model, right? Like model inference turned out to be like the place where we found initially, well, like clearly this has like 10x like better agronomics than anything else. But we're also like, you know, going back to my original vision, like we're thinking a lot about, you know, now, okay, now we do inference really well. Like what about training? What about fine tuning? What about, you know, end-to-end lifecycle deployment? What about data pre-processing? What about, you know, I don't know, real-time streaming? What about, you know, large data munging, like there's just data observability. I think there's so many things, like kind of going back to what I said about like redefining the data stack, like starting with the foundation of compute. Like one of the exciting things about model is like we've sort of, you know, we've been working on that for three years and it's maturing, but like this is so many things you can do like with just like a better compute primitive and also go up to stack and like do all this other stuff on top of it.Alessio [00:22:09]: How do you think about or rather like I would love to learn more about the underlying infrastructure and like how you make that happen because with fine tuning and training, it's a static memory. Like you exactly know what you're going to load in memory one and it's kind of like a set amount of compute versus inference, just like data is like very bursty. How do you make batches work with a serverless developer experience? You know, like what are like some fun technical challenge you solve to make sure you get max utilization on these GPUs? What we hear from people is like, we have GPUs, but we can really only get like, you know, 30, 40, 50% maybe utilization. What's some of the fun stuff you're working on to get a higher number there?Erik [00:22:48]: Yeah, I think on the inference side, like that's where we like, you know, like from a cost perspective, like utilization perspective, we've seen, you know, like very good numbers and in particular, like it's our ability to start containers and stop containers very quickly. And that means that we can auto scale extremely fast and scale down very quickly, which means like we can always adjust the sort of capacity, the number of GPUs running to the exact traffic volume. And so in many cases, like that actually leads to a sort of interesting thing where like we obviously run our things on like the public cloud, like AWS GCP, we run on Oracle, but in many cases, like users who do inference on those platforms or those clouds, even though we charge a slightly higher price per GPU hour, a lot of users like moving their large scale inference use cases to model, they end up saving a lot of money because we only charge for like with the time the GPU is actually running. And that's a hard problem, right? Like, you know, if you have to constantly adjust the number of machines, if you have to start containers, stop containers, like that's a very hard problem. Starting containers quickly is a very difficult thing. I mentioned we had to build our own file system for this. We also, you know, built our own container scheduler for that. We've implemented recently CPU memory checkpointing so we can take running containers and snapshot the entire CPU, like including registers and everything, and restore it from that point, which means we can restore it from an initialized state. We're looking at GPU checkpointing next, it's like a very interesting thing. So I think with inference stuff, that's where serverless really shines because you can drive, you know, you can push the frontier of latency versus utilization quite substantially, you know, which either ends up being a latency advantage or a cost advantage or both, right? On training, it's probably arguably like less of an advantage doing serverless, frankly, because you know, you can just like spin up a bunch of machines and try to satisfy, like, you know, train as much as you can on each machine. For that area, like we've seen, like, you know, arguably like less usage, like for modal, but there are always like some interesting use case. Like we do have a couple of customers, like RAM, for instance, like they do fine tuning with modal and they basically like one of the patterns they have is like very bursty type fine tuning where they fine tune 100 models in parallel. And that's like a separate thing that modal does really well, right? Like you can, we can start up 100 containers very quickly, run a fine tuning training job on each one of them for that only runs for, I don't know, 10, 20 minutes. And then, you know, you can do hyper parameter tuning in that sense, like just pick the best model and things like that. So there are like interesting training. I think when you get to like training, like very large foundational models, that's a use case we don't support super well, because that's very high IO, you know, you need to have like infinite band and all these things. And those are things we haven't supported yet and might take a while to get to that. So that's like probably like an area where like we're relatively weak in. Yeah.Alessio [00:25:12]: Have you cared at all about lower level model optimization? There's other cloud providers that do custom kernels to get better performance or are you just given that you're not just an AI compute company? Yeah.Erik [00:25:24]: I mean, I think like we want to support like a generic, like general workloads in a sense that like we want users to give us a container essentially or a code or code. And then we want to run that. So I think, you know, we benefit from those things in the sense that like we can tell our users, you know, to use those things. But I don't know if we want to like poke into users containers and like do those things automatically. That's sort of, I think a little bit tricky from the outside to do, because we want to be able to take like arbitrary code and execute it. But certainly like, you know, we can tell our users to like use those things. Yeah.Swyx [00:25:53]: I may have betrayed my own biases because I don't really think about modal as for data teams anymore. I think you started, I think you're much more for AI engineers. My favorite anecdotes, which I think, you know, but I don't know if you directly experienced it. I went to the Vercel AI Accelerator, which you supported. And in the Vercel AI Accelerator, a bunch of startups gave like free credits and like signups and talks and all that stuff. The only ones that stuck are the ones that actually appealed to engineers. And the top usage, the top tool used by far was modal.Erik [00:26:24]: That's awesome.Swyx [00:26:25]: For people building with AI apps. Yeah.Erik [00:26:27]: I mean, it might be also like a terminology question, like the AI versus data, right? Like I've, you know, maybe I'm just like old and jaded, but like, I've seen so many like different titles, like for a while it was like, you know, I was a data scientist and a machine learning engineer and then, you know, there was like analytics engineers and there was like an AI engineer, you know? So like, to me, it's like, I just like in my head, that's to me just like, just data, like, or like engineer, you know, like I don't really, so that's why I've been like, you know, just calling it data teams. But like, of course, like, you know, AI is like, you know, like such a massive fraction of our like workloads.Swyx [00:26:59]: It's a different Venn diagram of things you do, right? So the stuff that you're talking about where you need like infinite bands for like highly parallel training, that's not, that's more of the ML engineer, that's more of the research scientist and less of the AI engineer, which is more sort of trying to put, work at the application.Erik [00:27:16]: Yeah. I mean, to be fair to it, like we have a lot of users that are like doing stuff that I don't think fits neatly into like AI. Like we have a lot of people using like modal for web scraping, like it's kind of nice. You can just like, you know, fire up like a hundred or a thousand containers running Chromium and just like render a bunch of webpages and it takes, you know, whatever. Or like, you know, protein folding is that, I mean, maybe that's, I don't know, like, but like, you know, we have a bunch of users doing that or, or like, you know, in terms of, in the realm of biotech, like sequence alignment, like people using, or like a couple of people using like modal to run like large, like mixed integer programming problems, like, you know, using Gurobi or like things like that. So video processing is another thing that keeps coming up, like, you know, let's say you have like petabytes of video and you want to just like transcode it, like, or you can fire up a lot of containers and just run FFmpeg or like, so there are those things too. Like, I mean, like that being said, like AI is by far our biggest use case, but you know, like, again, like modal is kind of general purpose in that sense.Swyx [00:28:08]: Yeah. Well, maybe I'll stick to the stable diffusion thing and then we'll move on to the other use cases for AI that you want to highlight. The other big player in my mind is replicate. Yeah. In this, in this era, they're much more, I guess, custom built for that purpose, whereas you're more general purpose. How do you position yourself with them? Are they just for like different audiences or are you just heads on competing?Erik [00:28:29]: I think there's like a tiny sliver of the Venn diagram where we're competitive. And then like 99% of the area we're not competitive. I mean, I think for people who, if you look at like front-end engineers, I think that's where like really they found good fit is like, you know, people who built some cool web app and they want some sort of AI capability and they just, you know, an off the shelf model is like perfect for them. That's like, I like use replicate. That's great. I think where we shine is like custom models or custom workflows, you know, running things at very large scale. We need to care about utilization, care about costs. You know, we have much lower prices because we spend a lot more time optimizing our infrastructure, you know, and that's where we're competitive, right? Like, you know, and you look at some of the use cases, like Suno is a big user, like they're running like large scale, like AI. Oh, we're talking with Mikey.Swyx [00:29:12]: Oh, that's great. Cool.Erik [00:29:14]: In a month. Yeah. So, I mean, they're, they're using model for like production infrastructure. Like they have their own like custom model, like custom code and custom weights, you know, for AI generated music, Suno.AI, you know, that, that, those are the types of use cases that we like, you know, things that are like very custom or like, it's like, you know, and those are the things like it's very hard to run and replicate, right? And that's fine. Like I think they, they focus on a very different part of the stack in that sense.Swyx [00:29:35]: And then the other company pattern that I pattern match you to is Modular. I don't know.Erik [00:29:40]: Because of the names?Swyx [00:29:41]: No, no. Wow. No, but yeah, yes, the name is very similar. I think there's something that might be insightful there from a linguistics point of view. Oh no, they have Mojo, the sort of Python SDK. And they have the Modular Inference Engine, which is their sort of their cloud stack, their sort of compute inference stack. I don't know if anyone's made that comparison to you before, but like I see you evolving a little bit in parallel there.Erik [00:30:01]: No, I mean, maybe. Yeah. Like it's not a company I'm like super like familiar, like, I mean, I know the basics, but like, I guess they're similar in the sense like they want to like do a lot of, you know, they have sort of big picture vision.Swyx [00:30:12]: Yes. They also want to build very general purpose. Yeah. So they're marketing themselves as like, if you want to do off the shelf stuff, go out, go somewhere else. If you want to do custom stuff, we're the best place to do it. Yeah. Yeah. There is some overlap there. There's not overlap in the sense that you are a closed source platform. People have to host their code on you. That's true. Whereas for them, they're very insistent on not running their own cloud service. They're a box software. Yeah. They're licensed software.Erik [00:30:37]: I'm sure their VCs at some point going to force them to reconsider. No, no.Swyx [00:30:40]: Chris is very, very insistent and very convincing. So anyway, I would just make that comparison, let people make the links if they want to. But it's an interesting way to see the cloud market develop from my point of view, because I came up in this field thinking cloud is one thing, and I think your vision is like something slightly different, and I see the different takes on it.Erik [00:31:00]: Yeah. And like one thing I've, you know, like I've written a bit about it in my blog too, it's like I think of us as like a second layer of cloud provider in the sense that like I think Snowflake is like kind of a good analogy. Like Snowflake, you know, is infrastructure as a service, right? But they actually run on the like major clouds, right? And I mean, like you can like analyze this very deeply, but like one of the things I always thought about is like, why does Snowflake arbitrarily like win over Redshift? And I think Snowflake, you know, to me, one, because like, I mean, in the end, like AWS makes all the money anyway, like and like Snowflake just had the ability to like focus on like developer experience or like, you know, user experience. And to me, like really proved that you can build a cloud provider, a layer up from, you know, the traditional like public clouds. And in that layer, that's also where I would put Modal, it's like, you know, we're building a cloud provider, like we're, you know, we're like a multi-tenant environment that runs the user code. But we're also building on top of the public cloud. So I think there's a lot of room in that space, I think is very sort of interesting direction.Alessio [00:31:55]: How do you think of that compared to the traditional past history, like, you know, you had AWS, then you had Heroku, then you had Render, Railway.Erik [00:32:04]: Yeah, I mean, I think those are all like great. I think the problem that they all faced was like the graduation problem, right? Like, you know, Heroku or like, I mean, like also like Heroku, there's like a counterfactual future of like, what would have happened if Salesforce didn't buy them, right? Like, that's a sort of separate thing. But like, I think what Heroku, I think always struggled with was like, eventually companies would get big enough that you couldn't really justify running in Heroku. So they would just go and like move it to, you know, whatever AWS or, you know, in particular. And you know, that's something that keeps me up at night too, like, what does that graduation risk like look like for modal? I always think like the only way to build a successful infrastructure company in the long run in the cloud today is you have to appeal to the entire spectrum, right? Or at least like the enterprise, like you have to capture the enterprise market. But the truly good companies capture the whole spectrum, right? Like I think of companies like, I don't like Datadog or Mongo or something that were like, they both captured like the hobbyists and acquire them, but also like, you know, have very large enterprise customers. I think that arguably was like where I, in my opinion, like Heroku struggle was like, how do you maintain the customers as they get more and more advanced? I don't know what the solution is, but I think there's, you know, that's something I would have thought deeply if I was at Heroku at that time.Alessio [00:33:14]: What's the AI graduation problem? Is it, I need to fine tune the model, I need better economics, any insights from customer discussions?Erik [00:33:22]: Yeah, I mean, better economics, certainly. But although like, I would say like, even for people who like, you know, needs like thousands of GPUs, just because we can drive utilization so much better, like we, there's actually like a cost advantage of staying on modal. But yeah, I mean, certainly like, you know, and like the fact that VCs like love, you know, throwing money at least used to, you know, add companies who need it to buy GPUs. I think that didn't help the problem. And in training, I think, you know, there's less software differentiation. So in training, I think there's certainly like better economics of like buying big clusters. But I mean, my hope it's going to change, right? Like I think, you know, we're still pretty early in the cycle of like building AI infrastructure. And I think a lot of these companies over in the long run, like, you know, they're, except it may be super big ones, like, you know, on Facebook and Google, they're always going to build their own ones. But like everyone else, like some extent, you know, I think they're better off like buying platforms. And, you know, someone's going to have to build those platforms.Swyx [00:34:12]: Yeah. Cool. Let's move on to language models and just specifically that workload just to flesh it out a little bit. You already said that RAMP is like fine tuning 100 models at once simultaneously on modal. Closer to home, my favorite example is ErikBot. Maybe you want to tell that story.Erik [00:34:30]: Yeah. I mean, it was a prototype thing we built for fun, but it's pretty cool. Like we basically built this thing that hooks up to Slack. It like downloads all the Slack history and, you know, fine-tunes a model based on a person. And then you can chat with that. And so you can like, you know, clone yourself and like talk to yourself on Slack. I mean, it's like nice like demo and it's just like, I think like it's like fully contained modal. Like there's a modal app that does everything, right? Like it downloads Slack, you know, integrates with the Slack API, like downloads the stuff, the data, like just runs the fine-tuning and then like creates like dynamically an inference endpoint. And it's all like self-contained and like, you know, a few hundred lines of code. So I think it's sort of a good kind of use case for, or like it kind of demonstrates a lot of the capabilities of modal.Alessio [00:35:08]: Yeah. On a more personal side, how close did you feel ErikBot was to you?Erik [00:35:13]: It definitely captured the like the language. Yeah. I mean, I don't know, like the content, I always feel this way about like AI and it's gotten better. Like when you look at like AI output of text, like, and it's like, when you glance at it, it's like, yeah, this seems really smart, you know, but then you actually like look a little bit deeper. It's like, what does this mean?Swyx [00:35:32]: What does this person say?Erik [00:35:33]: It's like kind of vacuous, right? And that's like kind of what I felt like, you know, talking to like my clone version, like it's like says like things like the grammar is correct. Like some of the sentences make a lot of sense, but like, what are you trying to say? Like there's no content here. I don't know. I mean, it's like, I got that feeling also with chat TBT in the like early versions right now it's like better, but.Alessio [00:35:51]: That's funny. So I built this thing called small podcaster to automate a lot of our back office work, so to speak. And it's great at transcript. It's great at doing chapters. And then I was like, okay, how about you come up with a short summary? And it's like, it sounds good, but it's like, it's not even the same ballpark as like, yeah, end up writing. Right. And it's hard to see how it's going to get there.Swyx [00:36:11]: Oh, I have ideas.Erik [00:36:13]: I'm certain it's going to get there, but like, I agree with you. Right. And like, I have the same thing. I don't know if you've read like AI generated books. Like they just like kind of seem funny, right? Like there's off, right? But like you glance at it and it's like, oh, it's kind of cool. Like looks correct, but then it's like very weird when you actually read them.Swyx [00:36:30]: Yeah. Well, so for what it's worth, I think anyone can join the modal slack. Is it open to the public? Yeah, totally.Erik [00:36:35]: If you go to modal.com, there's a button in the footer.Swyx [00:36:38]: Yeah. And then you can talk to Erik Bot. And then sometimes I really like picking Erik Bot and then you answer afterwards, but then you're like, yeah, mostly correct or whatever. Any other broader lessons, you know, just broadening out from like the single use case of fine tuning, like what are you seeing people do with fine tuning or just language models on modal in general? Yeah.Erik [00:36:59]: I mean, I think language models is interesting because so many people get started with APIs and that's just, you know, they're just dominating a space in particular opening AI, right? And that's not necessarily like a place where we aim to compete. I mean, maybe at some point, but like, it's just not like a core focus for us. And I think sort of separately, it's sort of a question of like, there's economics in that long term. But like, so we tend to focus on more like the areas like around it, right? Like fine tuning, like another use case we have is a bunch of people, Ramp included, is doing batch embeddings on modal. So let's say, you know, you have like a, actually we're like writing a blog post, like we take all of Wikipedia and like parallelize embeddings in 15 minutes and produce vectors for each article. So those types of use cases, I think modal suits really well for. I think also a lot of like custom inference, like yeah, I love that.Swyx [00:37:43]: Yeah. I think you should give people an idea of the order of magnitude of parallelism, because I think people don't understand how parallel. So like, I think your classic hello world with modal is like some kind of Fibonacci function, right? Yeah, we have a bunch of different ones. Some recursive function. Yeah.Erik [00:37:59]: Yeah. I mean, like, yeah, I mean, it's like pretty easy in modal, like fan out to like, you know, at least like 100 GPUs, like in a few seconds. And you know, if you give it like a couple of minutes, like we can, you know, you can fan out to like thousands of GPUs. Like we run it relatively large scale. And yeah, we've run, you know, many thousands of GPUs at certain points when we needed, you know, big backfills or some customers had very large compute needs.Swyx [00:38:21]: Yeah. Yeah. And I mean, that's super useful for a number of things. So one of my early interactions with modal as well was with a small developer, which is my sort of coding agent. The reason I chose modal was a number of things. One, I just wanted to try it out. I just had an excuse to try it. Akshay offered to onboard me personally. But the most interesting thing was that you could have that sort of local development experience as it was running on my laptop, but then it would seamlessly translate to a cloud service or like a cloud hosted environment. And then it could fan out with concurrency controls. So I could say like, because like, you know, the number of times I hit the GPT-3 API at the time was going to be subject to the rate limit. But I wanted to fan out without worrying about that kind of stuff. With modal, I can just kind of declare that in my config and that's it. Oh, like a concurrency limit?Erik [00:39:07]: Yeah. Yeah.Swyx [00:39:09]: Yeah. There's a lot of control. And that's why it's like, yeah, this is a pretty good use case for like writing this kind of LLM application code inside of this environment that just understands fan out and rate limiting natively. You don't actually have an exposed queue system, but you have it under the hood, you know, that kind of stuff. Totally.Erik [00:39:28]: It's a self-provisioning cloud.Swyx [00:39:30]: So the last part of modal I wanted to touch on, and obviously feel free, I know you're working on new features, was the sandbox that was introduced last year. And this is something that I think was inspired by Code Interpreter. You can tell me the longer history behind that.Erik [00:39:45]: Yeah. Like we originally built it for the use case, like there was a bunch of customers who looked into code generation applications and then they came to us and asked us, is there a safe way to execute code? And yeah, we spent a lot of time on like container security. We used GeoVisor, for instance, which is a Google product that provides pretty strong isolation of code. So we built a product where you can basically like run arbitrary code inside a container and monitor its output or like get it back in a safe way. I mean, over time it's like evolved into more of like, I think the long-term direction is actually I think more interesting, which is that I think modal as a platform where like I think the core like container infrastructure we offer could actually be like, you know, unbundled from like the client SDK and offer to like other, you know, like we're talking to a couple of like other companies that want to run, you know, through their packages, like run, execute jobs on modal, like kind of programmatically. So that's actually the direction like Sandbox is going. It's like turning into more like a platform for platforms is kind of what I've been thinking about it as.Swyx [00:40:45]: Oh boy. Platform. That's the old Kubernetes line.Erik [00:40:48]: Yeah. Yeah. Yeah. But it's like, you know, like having that ability to like programmatically, you know, create containers and execute them, I think, I think is really cool. And I think it opens up a lot of interesting capabilities that are sort of separate from the like core Python SDK in modal. So I'm really excited about C. It's like one of those features that we kind of released and like, you know, then we kind of look at like what users actually build with it and people are starting to build like kind of crazy things. And then, you know, we double down on some of those things because when we see like, you know, potential new product features and so Sandbox, I think in that sense, it's like kind of in that direction. We found a lot of like interesting use cases in the direction of like platformized container runner.Swyx [00:41:27]: Can you be more specific about what you're double down on after seeing users in action?Erik [00:41:32]: I mean, we're working with like some companies that, I mean, without getting into specifics like that, need the ability to take their users code and then launch containers on modal. And it's not about security necessarily, like they just want to use modal as a back end, right? Like they may already provide like Kubernetes as a back end, Lambda as a back end, and now they want to add modal as a back end, right? And so, you know, they need a way to programmatically define jobs on behalf of their users and execute them. And so, I don't know, that's kind of abstract, but does that make sense? I totally get it.Swyx [00:42:03]: It's sort of one level of recursion to sort of be the Modal for their customers.Erik [00:42:09]: Exactly.Swyx [00:42:10]: Yeah, exactly. And Cloudflare has done this, you know, Kenton Vardar from Cloudflare, who's like the tech lead on this thing, called it sort of functions as a service as a service.Erik [00:42:17]: Yeah, that's exactly right. FaSasS.Swyx [00:42:21]: FaSasS. Yeah, like, I mean, like that, I think any base layer, second layer cloud provider like yourself, compute provider like yourself should provide, you know, it's a mark of maturity and success that people just trust you to do that. They'd rather build on top of you than compete with you. The more interesting thing for me is like, what does it mean to serve a computer like an LLM developer, rather than a human developer, right? Like, that's what a sandbox is to me, that you have to redefine modal to serve a different non-human audience.Erik [00:42:51]: Yeah. Yeah, and I think there's some really interesting people, you know, building very cool things.Swyx [00:42:55]: Yeah. So I don't have an answer, but, you know, I imagine things like, hey, the way you give feedback is different. Maybe you have to like stream errors, log errors differently. I don't really know. Yeah. Obviously, there's like safety considerations. Maybe you have an API to like restrict access to the web. Yeah. I don't think anyone would use it, but it's there if you want it.Erik [00:43:17]: Yeah.Swyx [00:43:18]: Yeah. Any other sort of design considerations? I have no idea.Erik [00:43:21]: With sandboxes?Swyx [00:43:22]: Yeah. Yeah.Erik [00:43:24]: Open-ended question here. Yeah. I mean, no, I think, yeah, the network restrictions, I think, make a lot of sense. Yeah. I mean, I think, you know, long-term, like, I think there's a lot of interesting use cases where like the LLM, in itself, can like decide, I want to install these packages and like run this thing. And like, obviously, for a lot of those use cases, like you want to have some sort of control that it doesn't like install malicious stuff and steal your secrets and things like that. But I think that's what's exciting about the sandbox primitive, is like it lets you do that in a relatively safe way.Alessio [00:43:51]: Do you have any thoughts on the inference wars? A lot of providers are just rushing to the bottom to get the lowest price per million tokens. Some of them, you know, the Sean Randomat, they're just losing money and there's like the physics of it just don't work out for them to make any money on it. How do you think about your pricing and like how much premium you can get and you can kind of command versus using lower prices as kind of like a wedge into getting there, especially once you have model instrumented? What are the tradeoffs and any thoughts on strategies that work?Erik [00:44:23]: I mean, we focus more on like custom models and custom code. And I think in that space, there's like less competition and I think we can have a pricing markup, right? Like, you know, people will always compare our prices to like, you know, the GPU power they can get elsewhere. And so how big can that markup be? Like it never can be, you know, we can never charge like 10x more, but we can certainly charge a premium. And like, you know, for that reason, like we can have pretty good margins. The LLM space is like the opposite, like the switching cost of LLMs is zero. If all you're doing is like straight up, like at least like open source, right? Like if all you're doing is like, you know, using some, you know, inference endpoint that serves an open source model and, you know, some other provider comes along and like offers a lower price, you're just going to switch, right? So I don't know, to me that reminds me a lot of like all this like 15 minute delivery wars or like, you know, like Uber versus Lyft, you know, and like maybe going back even further, like I think a lot about like sort of, you know, flip side of this is like, it's actually a positive side, which is like, I thought a lot about like fiber optics boom of like 98, 99, like the other day, or like, you know, and also like the overinvestment in GPU today. Like, like, yeah, like, you know, I don't know, like in the end, like, I don't think VCs will have the return they expected, like, you know, in these things, but guess who's going to benefit, like, you know, is the consumers, like someone's like reaping the value of this. And that's, I think an amazing flip side is that, you know, we should be very grateful, the fact that like VCs want to subsidize these things, which is, you know, like you go back to fiber optics, like there was an extreme, like overinvestment in fiber optics network in like 98. And no one made money who did that. But consumers, you know, got tremendous benefits of all the fiber optics cables that were led, you know, throughout the country in the decades after. I feel something similar abou

tv ceo new york amazon spotify europe google ai starting vision france advice san francisco italy open model uber iowa sweden ga software discord tricks platform engineers wikipedia differences swedish closer sort oracle cto ram substack vc slack salesforce openai gemini residence lyft rust api luigi io mojo python doordash gpt ml aws linux enabling snowflakes amplify gold medal one week token hvac apis vm vcs javascript azure 10x sora railways cpu sandbox llm render gpu ramp venn ssd php genai uzbekistan dns modular s3 oss docker counter strike kubernetes tbt cloudflare gpus sdks lambda nearby haskell alessio accessory fibonacci t4 nordics annoy poolside informatics modal chalice serverless akshay infra chromium replicate terraform suno mongo gcp dsl tcp datadog daniel ek redshift heroku a10 kth hadoop airflow cohere ec2 yaml scale ai pinecone oracle cloud plt redpoint bigquery ioi a100 akshat chris lattner wednesday friday ffmpeg ai engineer george hotz code interpreter hetzner latent space swix

The top AI news from the past week, every ThursdAI

Play Episode Listen Later Feb 16, 2024 117:37

Holy SH*T, These two words have been said on this episode multiple times, way more than ever before I want to say, and it's because we got 2 incredible exciting breaking news announcements in a very very short amount of time (in the span of 3 hours) and the OpenAI announcement came as we were recording the space, so you'll get to hear a live reaction of ours to this insanity. We also had 3 deep-dives, which I am posting on this weeks episode, we chatted with Yi Tay and Max Bane from Reka, which trained and released a few new foundational multi modal models this week, and with Dome and Pablo from Stability who released a new diffusion model called Stable Cascade, and finally had a great time hanging with Swyx (from Latent space) and finally got a chance to turn the microphone back at him, and had a conversation about Swyx background, Latent Space, and AI Engineer. I was also very happy to be in SF today of all days, as my day is not over yet, there's still an event which we Cohost together with A16Z, folks from Nous Research, Ollama and a bunch of other great folks, just look at all these logos! Open Source FTW

god american spotify time world google hollywood ai disney apple interview education japan talk magic news french germany san francisco phd german russian microsoft holy professional blog hawaii dive 3d video games chatgpt tokyo humans sweden silicon valley champions os pc apologies ga cheers discord cloud singapore reactions honestly west coast windows investments context alignment mixed newsletter sold hebrew chat dom tap helped developers breaking news dms ram buzz substack folks vc highest bloom whispers react siri newton lyon andrews sf munich openai goats gemini anton labs nvidia stability api arabic generally decided kd documents alphabet bard open source needle north star gpt ml aws incredibly lama mosaic github llama dome slightly apis soaring jarvis runway farrell pico vcs javascript eureka attached html apache temporal biases tl 2k sora rugs 10m weights tab pharrell llm chinese new year xl colbert gpu pica cascade nps kb rahul enrico dali agi oss fairly yarn dx eeg ocr horowitz rag benchmarks cloudflare gpus curation 7b singaporean rtx ilya deepmind gtm lambda tldr watchos v2 alessio satya nadella lms satya buster keaton fmri mephisto retrieval andrej apple news 8b yam lex fridman mixture sundar pichai googlers yi series c 60k past week sura lumiere smoothly haystack a16z latent wrecker mpt cursor chroma flan dalmatian svd hacker news tensor reca devrel datasets imad netlify reka nvidia gpus tesla autopilot cohere google brain andrew chen yann lecun vae robert scoble matryoshka instructure daniel gross discords loras neurips jeff dean andrej karpathy 128k nlw ai engineer george hotz entropic nat friedman drew houston lachanze word2vec rekka latent space hayes valley swix breca max wolf gradio ingra neuros

EP 221 George Hotz on Open-Source Driving Assistance

The Jim Rutt Show

Play Episode Listen Later Feb 6, 2024 58:36

Jim talks with George Hotz about running Comma, an open-source driving assistance company. They discuss breaking the carrier lock on the iPhone at seventeen, Google's Project Zero, zero days, Mobileye & proprietary perception algorithms, cameras vs lidar, 6 levels of self-driving automation, the reliability of human driving, self-driving cars as "demo complete," why corner cases aren't the issue, integrated world models, the challenge of defining lane lines, recognizing the right part of the road, behavioral cloning, the hugging test, Comma's data set, the small offset simulator, how to install Comma in a car, what it does, why high-precision maps aren't useful, problems with Waymo's approach, "trackless monorails," why current systems still use remote-control driving, hyper-fragile centralized systems, Tesla's approach, against magical inflection points, self-driving as a stepping stone to artificial life, why Comma doesn't do marketing, the regulatory environment, eyes off vs hands off, why self-driving cars are easier than general robotics, liability, functional safety, the Tinygrad machine learning framework, who's using it, and much more. Episode Transcript Comma Tinygrad George Hotz is the founder of comma.ai and the tiny corp. He is working on self driving, robotics, and ML infrastructure with the goal of creating an operating system for silicon-stack life.

google iphone tesla driving open source ml assistance waymo comma project zero mobileye george hotz

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Dec 14, 2023 79:37

We are running an end of year survey for our listeners. Let us know any feedback you have for us, what episodes resonated with you the most, and guest requests for 2024! RAG has emerged as one of the key pieces of the AI Engineer stack. Jerry from LlamaIndex called it a “hack”, Bryan from Hex compared it to “a recommendation system from LLMs”, and even LangChain started with it. RAG is crucial in any AI coding workflow. We talked about context quality for code in our Phind episode. Today's guests, Beyang Liu and Steve Yegge from SourceGraph, have been focused on code indexing and retrieval for over 15 years. We locked them in our new studio to record a 1.5 hours masterclass on the history of code search, retrieval interfaces for code, and how they get SOTA 30% completion acceptance rate in their Cody product by being better at the “bin packing problem” of LLM context generation. Google Grok → SourceGraph → CodyWhile at Google in 2008, Steve built Grok, which lives on today as Google Kythe. It allowed engineers to do code parsing and searching across different codebases and programming languages. (You might remember this blog post from Steve's time at Google) Beyang was an intern at Google at the same time, and Grok became the inspiration to start SourceGraph in 2013. The two didn't know eachother personally until Beyang brought Steve out of retirement 9 years later to join him as VP Engineering. Fast forward 10 years, SourceGraph has become to best code search tool out there and raised $223M along the way. Nine months ago, they open sourced SourceGraph Cody, their AI coding assistant. All their code indexing and search infrastructure allows them to get SOTA results by having better RAG than competitors:* Code completions as you type that achieve an industry-best Completion Acceptance Rate (CAR) as high as 30% using a context-enhanced open-source LLM (StarCoder)* Context-aware chat that provides the option of using GPT-4 Turbo, Claude 2, GPT-3.5 Turbo, Mistral 7x8B, or Claude Instant, with more model integrations planned* Doc and unit test generation, along with AI quick fixes for common coding errors* AI-enhanced natural language code search, powered by a hybrid dense/sparse vector search engine There are a few pieces of infrastructure that helped Cody achieve these results:Dense-sparse vector retrieval system For many people, RAG = vector similarity search, but there's a lot more that you can do to get the best possible results. From their release:"Sparse vector search" is a fancy name for keyword search that potentially incorporates LLMs for things like ranking and term expansion (e.g., "k8s" expands to "Kubernetes container orchestration", possibly weighted as in SPLADE): * Dense vector retrieval makes use of embeddings, the internal representation that LLMs use to represent text. Dense vector retrieval provides recall over a broader set of results that may have no exact keyword matches but are still semantically similar. * Sparse vector retrieval is very fast, human-understandable, and yields high recall of results that closely match the user query. * We've found the approaches to be complementary.There's a very good blog post by Pinecone on SPLADE for sparse vector search if you're interested in diving in. If you're building RAG applications in areas that have a lot of industry-specific nomenclature, acronyms, etc, this is a good approach to getting better results.SCIPIn 2016, Microsoft announced the Language Server Protocol (LSP) and the Language Server Index Format (LSIF). This protocol makes it easy for IDEs to get all the context they need from a codebase to get things like file search, references, “go to definition”, etc. SourceGraph developed SCIP, “a better code indexing format than LSIF”:* Simpler and More Efficient Format: SCIP utilizes Protobuf instead of JSON, which is used by LSIF. Protobuf is more space-efficient, simpler, and more suitable for systems programming. * Better Performance and Smaller Index Sizes: SCIP indexers, such as scip-clang, show enhanced performance and reduced index file sizes compared to LSIF indexers (10%-20% smaller)* Easier to Develop and Debug: SCIP's design, centered around human-readable string IDs for symbols, makes it faster and more straightforward to develop new language indexers. Having more efficient indexing is key to more performant RAG on code. Show Notes* Sourcegraph* Cody* Copilot vs Cody* Steve's Stanford seminar on Grok* Steve's blog* Grab* Fireworks* Peter Norvig* Noam Chomsky* Code search* Kelly Norton* Zoekt* v0.devSee also our past episodes on Cursor, Phind, Codeium and Codium as well as the GitHub Copilot keynote at AI Engineer Summit.Timestamps* [00:00:00] Intros & Backgrounds* [00:05:20] How Steve's work on Grok inspired SourceGraph for Beyang* [00:08:10] What's Cody?* [00:11:22] Comparison of coding assistants and the capabilities of Cody* [00:16:00] The importance of context (RAG) in AI coding tools* [00:21:33] The debate between Chomsky and Norvig approaches in AI* [00:30:06] Normsky: the Norvig + Chomsky models collision* [00:36:00] The death of the DSL?* [00:40:00] LSP, Skip, Kythe, BFG, and all that fun stuff* [00:53:00] The SourceGraph internal stack* [00:58:46] Building on open source models* [01:02:00] SourceGraph for engineering managers?* [01:12:00] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:16]Swyx: Hey, and today we're christening our new podcast studio in the Newton, and we have Beyang and Steve from Sourcegraph. Welcome. [00:00:25]Beyang: Hey, thanks for having us. [00:00:26]Swyx: So this has been a long time coming. I'm very excited to have you. We also are just celebrating the one year anniversary of ChatGPT yesterday, but also we'll be talking about the GA of Cody later on today. We'll just do a quick intros of both of you. Obviously, people can research you and check the show notes for more. Beyang, you worked in computer vision at Stanford and then you worked at Palantir. I did, yeah. You also interned at Google. [00:00:48]Beyang: I did back in the day where I get to use Steve's system, DevTool. [00:00:53]Swyx: Right. What was it called? [00:00:55]Beyang: It was called Grok. Well, the end user thing was Google Code Search. That's what everyone called it, or just like CS. But the brains of it were really the kind of like Trigram index and then Grok, which provided the reference graph. [00:01:07]Steve: Today it's called Kythe, the open source Google one. It's sort of like Grok v3. [00:01:11]Swyx: On your podcast, which you've had me on, you've interviewed a bunch of other code search developers, including the current developer of Kythe, right? [00:01:19]Beyang: No, we didn't have any Kythe people on, although we would love to if they're up for it. We had Kelly Norton, who built a similar system at Etsy, it's an open source project called Hound. We also had Han-Wen Nienhuys, who created Zoekt, which is, I think, heavily inspired by the Trigram index that powered Google's original code search and that we also now use at Sourcegraph. Yeah. [00:01:45]Swyx: So you teamed up with Quinn over 10 years ago to start Sourcegraph and you were indexing all code on the internet. And now you're in a perfect spot to create a code intelligence startup. Yeah, yeah. [00:01:56]Beyang: I guess the backstory was, I used Google Code Search while I was an intern. And then after I left that internship and worked elsewhere, it was the single dev tool that I missed the most. I felt like my job was just a lot more tedious and much more of a hassle without it. And so when Quinn and I started working together at Palantir, he had also used various code search engines in open source over the years. And it was just a pain point that we both felt, both working on code at Palantir and also working within Palantir's clients, which were a lot of Fortune 500 companies, large financial institutions, folks like that. And if anything, the pains they felt in dealing with large complex code bases made our pain points feel small by comparison. So that was really the impetus for starting Sourcegraph. [00:02:42]Swyx: Yeah, excellent. Steve, you famously worked at Amazon. And you've told many, many stories. I want every single listener of Latent Space to check out Steve's YouTube because he effectively had a podcast that you didn't tell anyone about or something. You just hit record and just went on a few rants. I'm always here for your Stevie rants. And then you moved to Google, where you also had some interesting thoughts on just the overall Google culture versus Amazon. You joined Grab as head of eng for a couple of years. I'm from Singapore, so I have actually personally used a lot of Grab's features. And it was very interesting to see you talk so highly of Grab's engineering and sort of overall prospects. [00:03:21]Steve: Because as a customer, it sucked? [00:03:22]Swyx: Yeah, no, it's just like, being from a smaller country, you never see anyone from our home country being on a global stage or talked about as a startup that people admire or look up to, like on the league that you, with all your legendary experience, would consider equivalent. Yeah. [00:03:41]Steve: Yeah, no, absolutely. They actually, they didn't even know that they were as good as they were, in a sense. They started hiring a bunch of people from Silicon Valley to come in and sort of like fix it. And we came in and we were like, Oh, we could have been a little better operational excellence and stuff. But by and large, they're really sharp. The only thing about Grab is that they get criticized a lot for being too westernized. Oh, by who? By Singaporeans who don't want to work there. [00:04:06]Swyx: Okay. I guess I'm biased because I'm here, but I don't see that as a problem. If anything, they've had their success because they were more westernized than the Sanders Singaporean tech company. [00:04:15]Steve: I mean, they had their success because they are laser focused. They copy to Amazon. I mean, they're executing really, really, really well for a giant. I was on a slack with 2,500 engineers. It was like this giant waterfall that you could dip your toe into. You'd never catch up. Actually, the AI summarizers would have been really helpful there. But yeah, no, I think Grab is successful because they're just out there with their sleeves rolled up, just making it happen. [00:04:43]Swyx: And for those who don't know, it's not just like Uber of Southeast Asia, it's also a super app. PayPal Plus. [00:04:48]Steve: Yeah. [00:04:49]Swyx: In the way that super apps don't exist in the West. It's one of the enduring mysteries of B2C that super apps work in the East and don't work in the West. We just don't understand it. [00:04:57]Beyang: Yeah. [00:04:58]Steve: It's just kind of curious. They didn't work in India either. And it was primarily because of bandwidth reasons and smaller phones. [00:05:03]Swyx: That should change now. It should. [00:05:05]Steve: And maybe we'll see a super app here. [00:05:08]Swyx: You retired-ish? I did. You retired-ish on your own video game? Mm-hmm. Any fun stories about that? And that's also where you discovered some need for code search, right? Mm-hmm. [00:05:16]Steve: Sure. A need for a lot of stuff. Better programming languages, better databases. Better everything. I mean, I started in like 95, right? Where there was kind of nothing. Yeah. Yeah. [00:05:24]Beyang: I just want to say, I remember when you first went to Grab because you wrote that blog post talking about why you were excited about it, about like the expanding Asian market. And our reaction was like, oh, man, how did we miss stealing it with you? [00:05:36]Swyx: Hiring you. [00:05:37]Beyang: Yeah. [00:05:38]Steve: I was like, miss that. [00:05:39]Swyx: Tell that story. So how did this happen? Right? So you were inspired by Grok. [00:05:44]Beyang: I guess the backstory from my point of view is I had used code search and Grok while at Google, but I didn't actually know that it was connected to you, Steve. I knew you from your blog posts, which were always excellent, kind of like inside, very thoughtful takes from an engineer's perspective on some of the challenges facing tech companies and tech culture and that sort of thing. But my first introduction to you within the context of code intelligence, code understanding was I watched a talk that you gave, I think at Stanford, about Grok when you're first building it. And that was very eye opening. I was like, oh, like that guy, like the guy who, you know, writes the extremely thoughtful ranty like blog posts also built that system. And so that's how I knew, you know, you were involved in that. And then, you know, we always wanted to hire you, but never knew quite how to approach you or, you know, get that conversation started. [00:06:34]Steve: Well, we got introduced by Max, right? Yeah. It was temporal. Yeah. Yeah. I mean, it was a no brainer. They called me up and I had noticed when Sourcegraph had come out. Of course, when they first came out, I had this dagger of jealousy stabbed through me piercingly, which I remember because I am not a jealous person by any means, ever. But boy, I was like, but I was kind of busy, right? And just one thing led to another. I got sucked back into the ads vortex and whatever. So thank God Sourcegraph actually kind of rescued me. [00:07:05]Swyx: Here's a chance to build DevTools. Yeah. [00:07:08]Steve: That's the best. DevTools are the best. [00:07:10]Swyx: Cool. Well, so that's the overall intro. I guess we can get into Cody. Is there anything else that like people should know about you before we get started? [00:07:18]Steve: I mean, everybody knows I'm a musician. I can juggle five balls. [00:07:24]Swyx: Five is good. Five is good. I've only ever managed three. [00:07:27]Steve: Five is hard. Yeah. And six, a little bit. [00:07:30]Swyx: Wow. [00:07:31]Beyang: That's impressive. [00:07:32]Alessio: So yeah, to jump into Sourcegraph, this has been a company 10 years in the making. And as Sean said, now you're at the right place. Phase two. Now, exactly. You spent 10 years collecting all this code, indexing, making it easy to surface it. Yeah. [00:07:47]Swyx: And also learning how to work with enterprises and having them trust you with their code bases. Yeah. [00:07:52]Alessio: Because initially you were only doing on-prem, right? Like a lot of like VPC deployments. [00:07:55]Beyang: So in the very early days, we're cloud only. But the first major customers we landed were all on-prem, self-hosted. And that was, I think, related to the nature of the problem that we're solving, which becomes just like a critical, unignorable pain point once you're above like 100 devs or so. [00:08:11]Alessio: Yeah. And now Cody is going to be GA by the time this releases. So congrats to your future self for launching this in two weeks. Can you give a quick overview of just what Cody is? I think everybody understands that it's a AI coding agent, but a lot of companies say they have a AI coding agent. So yeah, what does Cody do? How do people interface with it? [00:08:32]Beyang: Yeah. So how is it different from the like several dozen other AI coding agents that exist in the market now? When we thought about building a coding assistant that would do things like code generation and question answering about your code base, I think we came at it from the perspective of, you know, we've spent the past decade building the world's best code understanding engine for human developers, right? So like it's kind of your guide as a human dev if you want to go and dive into a large complex code base. And so our intuition was that a lot of the context that we're providing to human developers would also be useful context for AI developers to consume. And so in terms of the feature set, Cody is very similar to a lot of other assistants. It does inline autocompletion. It does code base aware chat. It does specific commands that automate, you know, tasks that you might rather not want to do like generating unit tests or adding detailed documentation. But we think the core differentiator is really the quality of the context, which is hard to kind of describe succinctly. It's a bit like saying, you know, what's the difference between Google and Alta Vista? There's not like a quick checkbox list of features that you can rattle off, but it really just comes down to all the attention and detail that we've paid to making that context work well and be high quality and fast for human devs. We're now kind of plugging into the AI coding assistant as well. Yeah. [00:09:53]Steve: I mean, just to add my own perspective on to what Beyang just described, RAG is kind of like a consultant that the LLM has available, right, that knows about your code. RAG provides basically a bridge to a lookup system for the LLM, right? Whereas fine tuning would be more like on the job training for somebody. If the LLM is a person, you know, and you send them to a new job and you do on the job training, that's what fine tuning is like, right? So tuned to our specific task. You're always going to need that expert, even if you get the on the job training, because the expert knows your particular code base, your task, right? That expert has to know your code. And there's a chicken and egg problem because, right, you know, we're like, well, I'm going to ask the LLM about my code, but first I have to explain it, right? It's this chicken and egg problem. That's where RAG comes in. And we have the best consultants, right? The best assistant who knows your code. And so when you sit down with Cody, right, what Beyang said earlier about going to Google and using code search and then starting to feel like without it, his job was super tedious. Once you start using these, do you guys use coding assistants? [00:10:53]Swyx: Yeah, right. [00:10:54]Steve: I mean, like we're getting to the point very quickly, right? Where you feel like almost like you're programming without the internet, right? Or something, you know, it's like you're programming back in the nineties without the coding assistant. Yeah. Hopefully that helps for people who have like no idea about coding systems, what they are. [00:11:09]Swyx: Yeah. [00:11:10]Alessio: I mean, going back to using them, we had a lot of them on the podcast already. We had Cursor, we have Codium and Codium, very similar names. [00:11:18]Swyx: Yeah. Find, and then of course there's Copilot. [00:11:22]Alessio: You had a Copilot versus Cody blog post, and I think it really shows the context improvement. So you had two examples that stuck with me. One was, what does this application do? And the Copilot answer was like, oh, it uses JavaScript and NPM and this. And it's like, but that's not what it does. You know, that's what it's built with. Versus Cody was like, oh, these are like the major functions. And like, these are the functionalities and things like that. And then the other one was, how do I start this up? And Copilot just said NPM start, even though there was like no start command in the package JSON, but you know, most collapse, right? Most projects use NPM start. So maybe this does too. How do you think about open source models? Because Copilot has their own private thing. And I think you guys use Starcoder, if I remember right. Yeah, that's correct. [00:12:09]Beyang: I think Copilot uses some variant of Codex. They're kind of cagey about it. I don't think they've like officially announced what model they use. [00:12:16]Swyx: And I think they use a range of models based on what you're doing. Yeah. [00:12:19]Beyang: So everyone uses a range of model. Like no one uses the same model for like inline completion versus like chat because the latency requirements for. Oh, okay. Well, there's fill in the middle. There's also like what the model's trained on. So like we actually had completions powered by Claude Instant for a while. And but you had to kind of like prompt hack your way to get it to output just the code and not like, hey, you know, here's the code you asked for, like that sort of text. So like everyone uses a range of models. We've kind of designed Cody to be like especially model, not agnostic, but like pluggable. So one of our kind of design considerations was like as the ecosystem evolves, we want to be able to integrate the best in class models, whether they're proprietary or open source into Cody because the pace of innovation in the space is just so quick. And I think that's been to our advantage. Like today, Cody uses Starcoder for inline completions. And with the benefit of the context that we provide, we actually show comparable completion acceptance rate metrics. It's kind of like the standard metric that folks use to evaluate inline completion quality. It's like if I show you a completion, what's the chance that you actually accept the completion versus you reject it? And so we're at par with Copilot, which is at the head of that industry right now. And we've been able to do that with the Starcoder model, which is open source and the benefit of the context fetching stuff that we provide. And of course, a lot of like prompt engineering and other stuff along the way. [00:13:40]Alessio: And Steve, you wrote a post called cheating is all you need about what you're building. And one of the points you made is that everybody's fighting on the same axis, which is better UI and the IDE, maybe like a better chat response. But data modes are kind of the most important thing. And you guys have like a 10 year old mode with all the data you've been collecting. How do you kind of think about what other companies are doing wrong, right? Like, why is nobody doing this in terms of like really focusing on RAG? I feel like you see so many people. Oh, we just got a new model. It's like a bit human eval. And it's like, well, but maybe like that's not what we should really be doing, you know? Like, do you think most people underestimate the importance of like the actual RAG in code? [00:14:21]Steve: I think that people weren't doing it much. It wasn't. It's kind of at the edges of AI. It's not in the center. I know that when ChatGPT launched, so within the last year, I've heard a lot of rumblings from inside of Google, right? Because they're undergoing a huge transformation to try to, you know, of course, get into the new world. And I heard that they told, you know, a bunch of teams to go and train their own models or fine tune their own models, right? [00:14:43]Swyx: Both. [00:14:43]Steve: And, you know, it was a s**t show. Nobody knew how to do it. They launched two coding assistants. One was called Code D with an EY. And then there was, I don't know what happened in that one. And then there's Duet, right? Google loves to compete with themselves, right? They do this all the time. And they had a paper on Duet like from a year ago. And they were doing exactly what Copilot was doing, which was just pulling in the local context, right? But fundamentally, I thought of this because we were talking about the splitting of the [00:15:10]Swyx: models. [00:15:10]Steve: In the early days, it was the LLM did everything. And then we realized that for certain use cases, like completions, that a different, smaller, faster model would be better. And that fragmentation of models, actually, we expected to continue and proliferate, right? Because we are fundamentally, we're a recommender engine right now. Yeah, we're recommending code to the LLM. We're saying, may I interest you in this code right here so that you can answer my question? [00:15:34]Swyx: Yeah? [00:15:34]Steve: And being good at recommender engine, I mean, who are the best recommenders, right? There's YouTube and Spotify and, you know, Amazon or whatever, right? Yeah. [00:15:41]Swyx: Yeah. [00:15:41]Steve: And they all have many, many, many, many, many models, right? For all fine-tuned for very specific, you know. And that's where we're heading in code, too. Absolutely. [00:15:50]Swyx: Yeah. [00:15:50]Alessio: We just did an episode we released on Wednesday, which we said RAG is like Rexis or like LLMs. You're basically just suggesting good content. [00:15:58]Swyx: It's like what? Recommendations. [00:15:59]Beyang: Recommendations. [00:16:00]Alessio: Oh, got it. [00:16:01]Steve: Yeah, yeah, yeah. [00:16:02]Swyx: So like the naive implementation of RAG is you embed everything, throw it in a vector database, you embed your query, and then you find the nearest neighbors, and that's your RAG. But actually, you need to rank it. And actually, you need to make sure there's sample diversity and that kind of stuff. And then you're like slowly gradient dissenting yourself towards rediscovering proper Rexis, which has been traditional ML for a long time. But like approaching it from an LLM perspective. Yeah. [00:16:24]Beyang: I almost think of it as like a generalized search problem because it's a lot of the same things. Like you want your layer one to have high recall and get all the potential things that could be relevant. And then there's typically like a layer two re-ranking mechanism that bumps up the precision and tries to get the relevant stuff to the top of the results list. [00:16:43]Swyx: Have you discovered that ranking matters a lot? Oh, yeah. So the context is that I think a lot of research shows that like one, context utilization matters based on model. Like GPT uses the top of the context window, and then apparently Claude uses the bottom better. And it's lossy in the middle. Yeah. So ranking matters. No, it really does. [00:17:01]Beyang: The skill with which models are able to take advantage of context is always going to be dependent on how that factors into the impact on the training loss. [00:17:10]Swyx: Right? [00:17:10]Beyang: So like if you want long context window models to work well, then you have to have a ton of data where it's like, here's like a billion lines of text. And I'm going to ask a question about like something that's like, you know, embedded deeply into it and like, give me the right answer. And unless you have that training set, then of course, you're going to have variability in terms of like where it attends to. And in most kind of like naturally occurring data, the thing that you're talking about right now, the thing I'm asking you about is going to be something that we talked about recently. [00:17:36]Swyx: Yeah. [00:17:36]Steve: Did you really just say gradient dissenting yourself? Actually, I love that it's entered the casual lexicon. Yeah, yeah, yeah. [00:17:44]Swyx: My favorite version of that is, you know, how we have to p-hack papers. So, you know, when you throw humans at the problem, that's called graduate student dissent. That's great. It's really awesome. [00:17:54]Alessio: I think the other interesting thing that you have is this inline assist UX that I wouldn't say async, but like it works while you can also do work. So you can ask Cody to make changes on a code block and you can still edit the same file at the same time. [00:18:07]Swyx: Yeah. [00:18:07]Alessio: How do you see that in the future? Like, do you see a lot of Cody's running together at the same time? Like, how do you validate also that they're not messing each other up as they make changes in the code? And maybe what are the limitations today? And what do you think about where the attack is going? [00:18:21]Steve: I want to start with a little history and then I'm going to turn it over to Bian, all right? So we actually had this feature in the very first launch back in June. Dominic wrote it. It was called nonstop Cody. And you could have multiple, basically, LLM requests in parallel modifying your source [00:18:37]Swyx: file. [00:18:37]Steve: And he wrote a bunch of code to handle all of the diffing logic. And you could see the regions of code that the LLM was going to change, right? And he was showing me demos of it. And it just felt like it was just a little before its time, you know? But a bunch of that stuff, that scaffolding was able to be reused for where we're inline [00:18:56]Swyx: sitting today. [00:18:56]Steve: How would you characterize it today? [00:18:58]Beyang: Yeah, so that interface has really evolved from a, like, hey, general purpose, like, request anything inline in the code and have the code update to really, like, targeted features, like, you know, fix the bug that exists at this line or request a very specific [00:19:13]Swyx: change. [00:19:13]Beyang: And the reason for that is, I think, the challenge that we ran into with inline fixes, and we do want to get to the point where you could just fire and forget and have, you know, half a dozen of these running in parallel. But I think we ran into the challenge early on that a lot of people are running into now when they're trying to construct agents, which is the reliability of, you know, working code generation is just not quite there yet in today's language models. And so that kind of constrains you to an interaction where the human is always, like, in the inner loop, like, checking the output of each response. And if you want that to work in a way where you can be asynchronous, you kind of have to constrain it to a domain where today's language models can generate reliable code well enough. So, you know, generating unit tests, that's, like, a well-constrained problem. Or fixing a bug that shows up as, like, a compiler error or a test error, that's a well-constrained problem. But the more general, like, hey, write me this class that does X, Y, and Z using the libraries that I have, that is not quite there yet, even with the benefit of really good context. Like, it definitely moves the needle a lot, but we're not quite there yet to the point where you can just fire and forget. And I actually think that this is something that people don't broadly appreciate yet, because I think that, like, everyone's chasing this dream of agentic execution. And if we're to really define that down, I think it implies a couple things. You have, like, a multi-step process where each step is fully automated. We don't have to have a human in the loop every time. And there's also kind of like an LM call at each stage or nearly every stage in that [00:20:45]Swyx: chain. [00:20:45]Beyang: Based on all the work that we've done, you know, with the inline interactions, with kind of like general Codyfeatures for implementing longer chains of thought, we're actually a little bit more bearish than the average, you know, AI hypefluencer out there on the feasibility of agents with purely kind of like transformer-based models. To your original question, like, the inline interactions with CODI, we actually constrained it to be more targeted, like, you know, fix the current error or make this quick fix. I think that that does differentiate us from a lot of the other tools on the market, because a lot of people are going after this, like, shnazzy, like, inline edit interaction, whereas I think where we've moved, and this is based on the user feedback that we've gotten, it's like that sort of thing, it demos well, but when you're actually coding day to day, you don't want to have, like, a long chat conversation inline with the code base. That's a waste of time. You'd rather just have it write the right thing and then move on with your life or not have to think about it. And that's what we're trying to work towards. [00:21:37]Steve: I mean, yeah, we're not going in the agent direction, right? I mean, I'll believe in agents when somebody shows me one that works. Yeah. Instead, we're working on, you know, sort of solidifying our strength, which is bringing the right context in. So new context sources, ways for you to plug in your own context, ways for you to control or influence the context, you know, the mixing that happens before the request goes out, etc. And there's just so much low-hanging fruit left in that space that, you know, agents seems like a little bit of a boondoggle. [00:22:03]Beyang: Just to dive into that a little bit further, like, I think, you know, at a very high level, what do people mean when they say agents? They really mean, like, greater automation, fully automated, like, the dream is, like, here's an issue, go implement that. And I don't have to think about it as a human. And I think we are working towards that. Like, that is the eventual goal. I think it's specifically the approach of, like, hey, can we have a transformer-based LM alone be the kind of, like, backbone or the orchestrator of these agentic flows? Where we're a little bit more bearish today. [00:22:31]Swyx: You want the human in the loop. [00:22:32]Beyang: I mean, you kind of have to. It's just a reality of the behavior of language models that are purely, like, transformer-based. And I think that's just like a reflection of reality. And I don't think people realize that yet. Because if you look at the way that a lot of other AI tools have implemented context fetching, for instance, like, you see this in the Copilot approach, where if you use, like, the at-workspace thing that supposedly provides, like, code-based level context, it has, like, an agentic approach where you kind of look at how it's behaving. And it feels like they're making multiple requests to the LM being like, what would you do in this case? Would you search for stuff? What sort of files would you gather? Go and read those files. And it's like a multi-hop step, so it takes a long while. It's also non-deterministic. Because any sort of, like, LM invocation, it's like a dice roll. And then at the end of the day, the context it fetches is not that good. Whereas our approach is just like, OK, let's do some code searches that make sense. And then maybe, like, crawl through the reference graph a little bit. That is fast. That doesn't require any sort of LM invocation at all. And we can pull in much better context, you know, very quickly. So it's faster. [00:23:37]Swyx: It's more reliable. [00:23:37]Beyang: It's deterministic. And it yields better context quality. And so that's what we think. We just don't think you should cargo cult or naively go like, you know, agents are the [00:23:46]Swyx: future. [00:23:46]Beyang: Let's just try to, like, implement agents on top of the LM that exists today. I think there are a couple of other technologies or approaches that need to be refined first before we can get into these kind of, like, multi-stage, fully automated workflows. [00:24:00]Swyx: It makes sense. You know, we're very much focused on developer inner loop right now. But you do see things eventually moving towards developer outer loop. Yeah. So would you basically say that they're tackling the agent's problem that you don't want to tackle? [00:24:11]Beyang: No, I would say at a high level, we are after maybe, like, the same high level problem, which is like, hey, I want some code written. I want to develop some software and can automate a system. Go build that software for me. I think the approaches might be different. So I think the analogy in my mind is, I think about, like, the AI chess players. Coding, in some senses, I mean, it's similar and dissimilar to chess. I think one question I ask is, like, do you think producing code is more difficult than playing chess or less difficult than playing chess? More. [00:24:41]Swyx: I think more. [00:24:41]Beyang: Right. And if you look at the best AI chess players, like, yes, you can use an LLM to play chess. Like, people have showed demos where it's like, oh, like, yeah, GPT-4 is actually a pretty decent, like, chess move suggester. Right. But you would never build, like, a best in class chess player off of GPT-4 alone. [00:24:57]Swyx: Right. [00:24:57]Beyang: Like, the way that people design chess players is that you have kind of like a search space and then you have a way to explore that search space efficiently. There's a bunch of search algorithms, essentially. We were doing tree search in various ways. And you can have heuristic functions, which might be powered by an LLM. [00:25:12]Swyx: Right. [00:25:12]Beyang: Like, you might use an LLM to generate proposals in that space that you can efficiently explore. But the backbone is still this kind of more formalized tree search based approach rather than the LLM itself. And so I think my high level intuition is that, like, the way that we get to more reliable multi-step workflows that do things beyond, you know, generate unit test, it's really going to be like a search based approach where you use an LLM as kind of like an advisor or a proposal function, sort of your heuristic function, like the ASTAR search algorithm. But it's probably not going to be the thing that is the backbone, because I guess it's not the right tool for that. Yeah. [00:25:50]Swyx: I can see yourself kind of thinking through this, but not saying the words, the sort of philosophical Peter Norvig type discussion. Maybe you want to sort of introduce that in software. Yeah, definitely. [00:25:59]Beyang: So your listeners are savvy. They're probably familiar with the classic like Chomsky versus Norvig debate. [00:26:04]Swyx: No, actually, I wanted, I was prompting you to introduce that. Oh, got it. [00:26:08]Beyang: So, I mean, if you look at the history of artificial intelligence, right, you know, it goes way back to, I don't know, it's probably as old as modern computers, like 50s, 60s, 70s. People are debating on like, what is the path to producing a sort of like general human level of intelligence? And kind of two schools of thought that emerged. One is the Norvig school of thought, which roughly speaking includes large language models, you know, regression, SVN, basically any model that you kind of like learn from data. And it's like data driven. Most of machine learning would fall under this umbrella. And that school of thought says like, you know, just learn from the data. That's the approach to reaching intelligence. And then the Chomsky approach is more things like compilers and parsers and formal systems. So basically like, let's think very carefully about how to construct a formal, precise system. And that will be the approach to how we build a truly intelligent system. I think Lisp was invented so that you could create like rules-based systems that you would call AI. As a language. Yeah. And for a long time, there was like this debate, like there's certain like AI research labs that were more like, you know, in the Chomsky camp and others that were more in the Norvig camp. It's a debate that rages on today. And I feel like the consensus right now is that, you know, Norvig definitely has the upper hand right now with the advent of LMs and diffusion models and all the other recent progress in machine learning. But the Chomsky-based stuff is still really useful in my view. I mean, it's like parsers, compilers, basically a lot of the stuff that provides really good context. It provides kind of like the knowledge graph backbone that you want to explore with your AI dev tool. Like that will come from kind of like Chomsky-based tools like compilers and parsers. It's a lot of what we've invested in in the past decade at Sourcegraph and what you build with Grok. Basically like these formal systems that construct these very precise knowledge graphs that are great context providers and great kind of guard rails enforcers and kind of like safety checkers for the output of a more kind of like data-driven, fuzzier system that uses like the Norvig-based models. [00:28:03]Steve: Jang was talking about this stuff like it happened in the middle ages. Like, okay, so when I was in college, I was in college learning Lisp and prologue and planning and all the deterministic Chomsky approaches to AI. And I was there when Norvig basically declared it dead. I was there 3,000 years ago when Norvig and Chomsky fought on the volcano. When did he declare it dead? [00:28:26]Swyx: What do you mean he declared it dead? [00:28:27]Steve: It was like late 90s. [00:28:29]Swyx: Yeah. [00:28:29]Steve: When I went to Google, Peter Norvig was already there. He had basically like, I forget exactly where. It was some, he's got so many famous short posts, you know, amazing. [00:28:38]Swyx: He had a famous talk, the unreasonable effectiveness of data. Yeah. [00:28:41]Steve: Maybe that was it. But at some point, basically, he basically convinced everybody that deterministic approaches had failed and that heuristic-based, you know, data-driven statistical approaches, stochastic were better. [00:28:52]Swyx: Yeah. [00:28:52]Steve: The primary reason I can tell you this, because I was there, was that, was that, well, the steam-powered engine, no. The reason was that the deterministic stuff didn't scale. [00:29:06]Swyx: Yeah. Right. [00:29:06]Steve: They're using prologue, man, constraint systems and stuff like that. Well, that was a long time ago, right? Today, actually, these Chomsky-style systems do scale. And that's, in fact, exactly what Sourcegraph has built. Yeah. And so we have a very unique, I love the framing that Bjong's made, that the marriage of the Chomsky and the Norvig, you know, sort of models, you know, conceptual models, because we, you know, we have both of them and they're both really important. And in fact, there, there's this really interesting, like, kind of overlap between them, right? Where like the AI or our graph or our search engine could potentially provide the right context for any given query, which is, of course, why ranking is important. But what we've really signed ourselves up for is an extraordinary amount of testing. [00:29:45]Swyx: Yeah. [00:29:45]Steve: Because in SWIGs, you were saying that, you know, GPT-4 tends to the front of the context window and maybe other LLMs to the back and maybe, maybe the LLM in the middle. [00:29:53]Swyx: Yeah. [00:29:53]Steve: And so that means that, you know, if we're actually like, you know, verifying whether we, you know, some change we've made has improved things, we're going to have to test putting it at the beginning of the window and at the end of the window, you know, and maybe make the right decision based on the LLM that you've chosen. Which some of our competitors, that's a problem that they don't have, but we meet you, you know, where you are. Yeah. And we're, just to finish, we're writing tens of thousands. We're generating tests, you know, fill in the middle type tests and things. And then using our graph to basically sort of fine tune Cody's behavior there. [00:30:20]Swyx: Yeah. [00:30:21]Beyang: I also want to add, like, I have like an internal pet name for this, like kind of hybrid architecture that I'm trying to make catch on. Maybe I'll just say it here. Just saying it publicly kind of makes it more real. But like, I call the architecture that we've developed the Normsky architecture. [00:30:36]Swyx: Yeah. [00:30:36]Beyang: I mean, it's obviously a portmanteau of Norvig and Chomsky, but the acronym, it stands for non-agentic, rapid, multi-source code intelligence. So non-agentic because... Rolls right off the tongue. And Normsky. But it's non-agentic in the sense that like, we're not trying to like pitch you on kind of like agent hype, right? Like it's the things it does are really just developer tools developers have been using for decades now, like parsers and really good search indexes and things like that. Rapid because we place an emphasis on speed. We don't want to sit there waiting for kind of like multiple LLM requests to return to complete a simple user request. Multi-source because we're thinking broadly about what pieces of information and knowledge are useful context. So obviously starting with things that you can search in your code base, and then you add in the reference graph, which kind of like allows you to crawl outward from those initial results. But then even beyond that, you know, sources of information, like there's a lot of knowledge that's embedded in docs, in PRDs or product specs, in your production logging system, in your chat, in your Slack channel, right? Like there's so much context is embedded there. And when you're a human developer, and you're trying to like be productive in your code base, you're going to go to all these different systems to collect the context that you need to figure out what code you need to write. And I don't think the AI developer will be any different. It will need to pull context from all these different sources. So we're thinking broadly about how to integrate these into Codi. We hope through kind of like an open protocol that like others can extend and implement. And this is something else that should be accessible by December 14th in kind of like a preview stage. But that's really about like broadening this notion of the code graph beyond your Git repository to all the other sources where technical knowledge and valuable context can live. [00:32:21]Steve: Yeah, it becomes an artifact graph, right? It can link into your logs and your wikis and any data source, right? [00:32:27]Alessio: How do you guys think about the importance of, it's almost like data pre-processing in a way, which is bring it all together, tie it together, make it ready. Any thoughts on how to actually make that good? Some of the innovation you guys have made. [00:32:40]Steve: We talk a lot about the context fetching, right? I mean, there's a lot of ways you could answer this question. But, you know, we've spent a lot of time just in this podcast here talking about context fetching. But stuffing the context into the window is, you know, the bin packing problem, right? Because the window is not big enough, and you've got more context than you can fit. You've got a ranker maybe. But what is that context? Is it a function that was returned by an embedding or a graph call or something? Do you need the whole function? Or do you just need, you know, the top part of the function, this expression here, right? You know, so that art, the golf game of trying to, you know, get each piece of context down into its smallest state, possibly even summarized by another model, right, before it even goes to the LLM, becomes this is the game that we're in, yeah? And so, you know, recursive summarization and all the other techniques that you got to use to like stuff stuff into that context window become, you know, critically important. And you have to test them across every configuration of models that you could possibly need. [00:33:32]Beyang: I think data preprocessing is probably the like unsexy, way underappreciated secret to a lot of the cool stuff that people are shipping today. Whether you're doing like RAG or fine tuning or pre-training, like the preprocessing step matters so much because it's basically garbage in, garbage out, right? Like if you're feeding in garbage to the model, then it's going to output garbage. Concretely, you know, for code RAG, if you're not doing some sort of like preprocessing that takes advantage of a parser and is able to like extract the key components of a particular file of code, you know, separate the function signature from the body, from the doc string, what are you even doing? Like that's like table stakes. It opens up so much more possibilities with which you can kind of like tune your system to take advantage of the signals that come from those different parts of the code. Like we've had a tool, you know, since computers were invented that understands the structure of source code to a hundred percent precision. The compiler knows everything there is to know about the code in terms of like structure. Like why would you not want to use that in a system that's trying to generate code, answer questions about code? You shouldn't throw that out the window just because now we have really good, you know, data-driven models that can do other things. [00:34:44]Steve: Yeah. When I called it a data moat, you know, in my cheating post, a lot of people were confused, you know, because data moat sort of sounds like data lake because there's data and water and stuff. I don't know. And so they thought that we were sitting on this giant mountain of data that we had collected, but that's not what our data moat is. It's really a data pre-processing engine that can very quickly and scalably, like basically dissect your entire code base in a very small, fine-grained, you know, semantic unit and then serve it up. Yeah. And so it's really, it's not a data moat. It's a data pre-processing moat, I guess. [00:35:15]Beyang: Yeah. If anything, we're like hypersensitive to customer data privacy requirements. So it's not like we've taken a bunch of private data and like, you know, trained a generally available model. In fact, exactly the opposite. A lot of our customers are choosing Cody over Copilot and other competitors because we have an explicit guarantee that we don't do any of that. And that we've done that from day one. Yeah. I think that's a very real concern in today's day and age, because like if your proprietary IP finds its way into the training set of any model, it's very easy both to like extract that knowledge from the model and also use it to, you know, build systems that kind of work on top of the institutional knowledge that you've built up. [00:35:52]Alessio: About a year ago, I wrote a post on LLMs for developers. And one of the points I had was maybe the depth of like the DSL. I spent most of my career writing Ruby and I love Ruby. It's so nice to use, but you know, it's not as performant, but it's really easy to read, right? And then you look at other languages, maybe they're faster, but like they're more verbose, you know? And when you think about efficiency of the context window, that actually matters. [00:36:15]Swyx: Yeah. [00:36:15]Alessio: But I haven't really seen a DSL for models, you know? I haven't seen like code being optimized to like be easier to put in a model context. And it seems like your pre-processing is kind of doing that. Do you see in the future, like the way we think about the DSL and APIs and kind of like service interfaces be more focused on being context friendly, where it's like maybe it's harder to read for the human, but like the human is never going to write it anyway. We were talking on the Hacks podcast. There are like some data science things like spin up the spandex, like humans are never going to write again because the models can just do very easily. Yeah, curious to hear your thoughts. [00:36:51]Steve: Well, so DSLs, they involve, you know, writing a grammar and a parser and they're like little languages, right? We do them that way because, you know, we need them to compile and humans need to be able to read them and so on. The LLMs don't need that level of structure. You can throw any pile of crap at them, you know, more or less unstructured and they'll deal with it. So I think that's why a DSL hasn't emerged for sort of like communicating with the LLM or packaging up the context or anything. Maybe it will at some point, right? We've got, you know, tagging of context and things like that that are sort of peeking into DSL territory, right? But your point on do users, you know, do people have to learn DSLs like regular expressions or, you know, pick your favorite, right? XPath. I think you're absolutely right that the LLMs are really, really good at that. And I think you're going to see a lot less of people having to slave away learning these things. They just have to know the broad capabilities and the LLM will take care of the rest. [00:37:42]Swyx: Yeah, I'd agree with that. [00:37:43]Beyang: I think basically like the value profit of DSL is that it makes it easier to work with a lower level language, but at the expense of introducing an abstraction layer. And in many cases today, you know, without the benefit of AI cogeneration, like that totally worth it, right? With the benefit of AI cogeneration, I mean, I don't think all DSLs will go away. I think there's still, you know, places where that trade-off is going to be worthwhile. But it's kind of like how much of source code do you think is going to be generated through natural language prompting in the future? Because in a way, like any programming language is just a DSL on top of assembly, right? And so if people can do that, then yeah, like maybe for a large portion of the code [00:38:21]Swyx: that's written, [00:38:21]Beyang: people don't actually have to understand the DSL that is Ruby or Python or basically any other programming language that exists. [00:38:28]Steve: I mean, seriously, do you guys ever write SQL queries now without using a model of some sort? At least a draft. [00:38:34]Swyx: Yeah, right. [00:38:36]Steve: And so we have kind of like, you know, past that bridge, right? [00:38:39]Alessio: Yeah, I think like to me, the long-term thing is like, is there ever going to be, you don't actually see the code, you know? It's like, hey, the basic thing is like, hey, I need a function to some two numbers and that's it. I don't need you to generate the code. [00:38:53]Steve: And the following question, do you need the engineer or the paycheck? [00:38:56]Swyx: I mean, right? [00:38:58]Alessio: That's kind of the agent's discussion in a way where like you cannot automate the agents, but like slowly you're getting more of the atomic units of the work kind of like done. I kind of think of it as like, you know, [00:39:09]Beyang: do you need a punch card operator to answer that for you? And so like, I think we're still going to have people in the role of a software engineer, but the portion of time they spend on these kinds of like low-level, tedious tasks versus the higher level, more creative tasks is going to shift. [00:39:23]Steve: No, I haven't used punch cards. [00:39:25]Swyx: Yeah, I've been talking about like, so we kind of made this podcast about the sort of rise of the AI engineer. And like the first step is the AI enhanced engineer. That is that software developer that is no longer doing these routine, boilerplate-y type tasks, because they're just enhanced by tools like yours. So you mentioned OpenCodeGraph. I mean, that is a kind of DSL maybe, and because we're releasing this as you go GA, you hope for other people to take advantage of that? [00:39:52]Beyang: Oh yeah, I would say so OpenCodeGraph is not a DSL. It's more of a protocol. It's basically like, hey, if you want to make your system, whether it's, you know, chat or logging or whatever accessible to an AI developer tool like Cody, here's kind of like the schema by which you can provide that context and offer hints. So I would, you know, comparisons like LSP obviously did this for kind of like standard code intelligence. It's kind of like a lingua franca for providing fine references and codefinition. There's kind of like analogs to that. There might be also analogs to kind of the original OpenAI, kind of like plugins, API. There's all this like context out there that might be useful for an LM-based system to consume. And so at a high level, what we're trying to do is define a common language for context providers to provide context to other tools in the software development lifecycle. Yeah. Do you have any critiques of LSP, by the way, [00:40:42]Swyx: since like this is very much, very close to home? [00:40:45]Steve: One of the authors wrote a really good critique recently. Yeah. I don't think I saw that. Yeah, yeah. LSP could have been better. It just came out a couple of weeks ago. It was a good article. [00:40:54]Beyang: Yeah. I think LSP is great. Like for what it did for the developer ecosystem, it was absolutely fantastic. Like nowadays, like it's much easier now to get code navigation up and running in a bunch of editors by speaking this protocol. I think maybe the interesting question is like looking at the different design decisions comparing LSP basically with Kythe. Because Kythe has more of a... How would you describe it? [00:41:18]Steve: A storage format. [00:41:20]Beyang: I think the critique of LSP from a Kythe point of view would be like with LSP, you don't actually have an actual symbolic model of the code. It's not like LSP models like, hey, this function calls this other function. LSP is all like range-based. Like, hey, your cursor's at line 32, column 1. [00:41:35]Swyx: Yeah. [00:41:35]Beyang: And that's the thing you feed into the language server. And then it's like, okay, here's the range that you should jump to if you click on that range. So it kind of is intentionally ignorant of the fact that there's a thing called a reference underneath your cursor, and that's linked to a symbol definition. [00:41:49]Steve: Well, actually, that's the worst example you could have used. You're right. But that's the one thing that it actually did bake in is following references. [00:41:56]Swyx: Sure. [00:41:56]Steve: But it's sort of hardwired. [00:41:58]Swyx: Yeah. [00:41:58]Steve: Whereas Kythe attempts to model [00:42:00]Beyang: like all these things explicitly. [00:42:02]Swyx: And so... [00:42:02]Steve: Well, so LSP is a protocol, right? And so Google's internal protocol is gRPC-based. And it's a different approach than LSP. It's basically you make a heavy query to the back end, and you get a lot of data back, and then you render the whole page, you know? So we've looked at LSP, and we think that it's a little long in the tooth, right? I mean, it's a great protocol, lots and lots of support for it. But we need to push into the domain of exposing the intelligence through the protocol. Yeah. [00:42:29]Beyang: And so I would say we've developed a protocol of our own called Skip, which is at a very high level trying to take some of the good ideas from LSP and from Kythe and merge that into a system that in the near term is useful for Sourcegraph, but I think in the long term, we hope will be useful for the ecosystem. Okay, so here's what LSP did well. LSP, by virtue of being like intentionally dumb, dumb in air quotes, because I'm not like ragging on it, allowed language servers developers to kind of like bypass the hard problem of like modeling language semantics precisely. So like if all you want to do is jump to definition, you don't have to come up with like a universally unique naming scheme for each symbol, which is actually quite challenging because you have to think about like, okay, what's the top scope of this name? Is it the source code repository? Is it the package? Does it depend on like what package server you're fetching this from? Like whether it's the public one or the one inside your... Anyways, like naming is hard, right? And by just going from kind of like a location to location based approach, you basically just like throw that out the window. All I care about is jumping definition, just make that work. And you can make that work without having to deal with like all the complex global naming things. The limitation of that approach is that it's harder to build on top of that to build like a true knowledge graph. Like if you actually want a system that says like, okay, here's the web of functions and here's how they reference each other. And I want to incorporate that like semantic model of how the code operates or how the code relates to each other at like a static level. You can't do that with LSP because you have to deal with line ranges. And like concretely the pain point that we found in using LSP for source graph is like in order to do like a find references [00:44:04]Swyx: and then jump definitions, [00:44:04]Beyang: it's like a multi-hop process because like you have to jump to the range and then you have to find the symbol at that range. And it just adds a lot of latency and complexity of these operations where as a human, you're like, well, this thing clearly references this other thing. Why can't you just jump me to that? And I think that's the thing that Kaith does well. But then I think the issue that Kaith has had with adoption is because it is more sophisticated schema, I think. And so there's basically more things that you have to implement to get like a Kaith implementation up and running. I hope I'm not like, correct me if I'm wrong about any of this. [00:44:35]Steve: 100%, 100%. Kaith also has a problem, all these systems have the problem, even skip, or at least the way that we implemented the indexers, that they have to integrate with your build system in order to build that knowledge graph, right? Because you have to basically compile the code in a special mode to generate artifacts instead of binaries. And I would say, by the way, earlier I was saying that XREFs were in LSP, but it's actually, I was thinking of LSP plus LSIF. [00:44:58]Swyx: Yeah. That's another. [00:45:01]Steve: Which is actually bad. We can say that it's bad, right? [00:45:04]Steve: It's like skip or Kaith, it's supposed to be sort of a model serialization, you know, for the code graph, but it basically just does what LSP needs, the bare minimum. LSIF is basically if you took LSP [00:45:16]Beyang: and turned that into a serialization format. So like you build an index for language servers to kind of like quickly bootstrap from cold start. But it's a graph model [00:45:23]Steve: with all of the inconvenience of the API without an actual graph. And so, yeah. [00:45:29]Beyang: So like one of the things that we try to do with skip is try to capture the best of both worlds. So like make it easy to write an indexer, make the schema simple, but also model some of the more symbolic characteristics of the code that would allow us to essentially construct this knowledge graph that we can then make useful for both the human developer through SourceGraph and through the AI developer through Cody. [00:45:49]Steve: So anyway, just to finish off the graph comment, we've got a new graph, yeah, that's skip based. We call it BFG internally, right? It's a beautiful something graph. A big friendly graph. [00:46:00]Swyx: A big friendly graph. [00:46:01]Beyang: It's a blazing fast. [00:46:02]Steve: Blazing fast. [00:46:03]Swyx: Blazing fast graph. [00:46:04]Steve: And it is blazing fast, actually. It's really, really interesting. I should probably have to do a blog post about it to walk you through exactly how they're doing it. Oh, please. But it's a very AI-like iterative, you know, experimentation sort of approach. We're building a code graph based on all of our 10 years of knowledge about building code graphs, yeah? But we're building it quickly with zero configuration, and it doesn't have to integrate with your build. And through some magic tricks that we have. And so what just happens when you install the plugin, that it'll be there and indexing your code and providing that knowledge graph in the background without all that build system integration. This is a bit of secret sauce that we haven't really like advertised it very much lately. But I am super excited about it because what they do is they say, all right, you know, let's tackle function parameters today. Cody's not doing a very good job of completing function call arguments or function parameters in the definition, right? Yeah, we generate those thousands of tests, and then we can actually reuse those tests for the AI context as well. So fortunately, things are kind of converging on, we have, you know, half a dozen really, really good context sources, and we mix them all together. So anyway, BFG, you're going to hear more about it probably in the holidays? [00:47:12]Beyang: I think it'll be online for December 14th. We'll probably mention it. BFG is probably not the public name we're going to go with. I think we might call it like Graph Context or something like that. [00:47:20]Steve: We're officially calling it BFG. [00:47:22]Swyx: You heard it here first. [00:47:24]Beyang: BFG is just kind of like the working name. And so the impetus for BFG was like, if you look at like current AI inline code completion tools and the errors that they make, a lot of the errors that they make, even in kind of like the easy, like single line case, are essentially like type errors, right? Like you're trying to complete a function call and it suggests a variable that you defined earlier, but that variable is the wrong type. [00:47:47]Swyx: And that's the sort of thing [00:47:47]Beyang: where it's like a first year, like freshman CS student would not make that error, right? So like, why does the AI make that error? And the reason is, I mean, the AI is just suggesting things that are plausible without the context of the types or any other like broader files in the code. And so the kind of intuition here is like, why don't we just do the basic thing that like any baseline intelligent human developer would do, which is like click jump to definition, click some fine references and pull in that like Graph Context into the context window and then have it generate the completion. So like that's sort of like the MVP of what BFG was. And turns out that works really well. Like you can eliminate a lot of type errors that AI coding tools make just by pulling in that context. Yeah, but the graph is definitely [00:48:32]Steve: our Chomsky side. [00:48:33]Swyx: Yeah, exactly. [00:48:34]Beyang: So like this like Chomsky-Norvig thing, I think pops up in a bunch of differ

america ceo amazon spotify google ai west building reading microsoft fortune bank east uber asian code chatgpt silicon valley mvp ga engineering comparison develop phase singapore stanford hacks recommendations ibm skip architecture easier ip cfo intel cto etsy exploration react jaws doc slack southeast asia newton rapid fireworks openai residence rust ux api cio rolls b2c cs coding python ui mm turbo gpt wells fargo ml llama ey apis transformer javascript hound ids copilot sam altman llm gpu programmers duet ides ide sql neural git kubernetes hex palantir codex rag gpus dense benchmarking v2 alessio lms anthropic grok lm json chomsky bfg sota googlers typescript dsl clippy github copilot vs code npm cursor lisp postgres lsp airflow zoekt pytorch sparse altavista devtools pinecone grpc 'a star svn vpc bian repl langchain dsls ai engineer george hotz george moore peter norvig steve it rnns concretely codeium steve well steve yeah latent space steve no xpath steve so steve how steve yegge steve can steve yes steve one beyang liu steve just steve they steve sure steve today steve wow steve which steve absolutely

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 17, 2023 53:01

This episode came together at ~4 hrs notice since Dylan had just landed in SF and we had to setup quickly; you might notice some small audio issues in some segments, we apologize. We're currently building our own podcast studio for 2024!

god ceo amazon world europe google ai china apple vision japan training state building africa chinese arizona japanese microsoft western cost oregon open iphone 3d surviving partnership middle east silicon valley goat networking software netherlands concerns taiwan chip customers austria bay area ip intel falcon cto palace hundreds substack nirvana distribution gp sf openai gemini residence nvidia semi patel austrian io gpt ml aws lama mosaic worldview llama pcs foundational sas macbook qa flops amd azure 2m medusa r d silicon 2b sam altman cpu llm gpu ssd xyz google cloud tale of two cities modular 500m semiconductors kylo ren zuck dota dram multiplying gpus chris miller tsmc myron tldr alessio tok anthropic triton broadcom capex cpus rainier mistral texas instruments meena feasibility databricks tensorflow google deepmind cuda gpts gaudi tsinghua university 2t jensen huang mbo 1t sram nvidia gpus pytorch gordon moore tpu poors a100 compilers 70b repl chris lattner neurips so google hbm george hotz mellanox b100 morris chang mbu xla 3d nand medex mfu

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Oct 14, 2023 65:02

Thanks to the over 11,000 people who joined us for the first AI Engineer Summit! A full recap is coming, but you can 1) catch up on the fun and videos on Twitter and YouTube, 2) help us reach 1000 people for the first comprehensive State of AI Engineering survey and 3) submit projects for the new AI Engineer Foundation.See our Community page for upcoming meetups in SF, Paris, NYC, and Singapore. This episode had good interest on Twitter.Last month, Imbue was crowned as AI's newest unicorn foundation model lab, raising a $200m Series B at a >$1 billion valuation. As “stealth” foundation model companies go, Imbue (f.k.a. Generally Intelligent) has stood as an enigmatic group given they have no publicly released models to try out. However, ever since their $20m Series A last year their goal has been to “develop generally capable AI agents with human-like intelligence in order to solve problems in the real world”.From RL to Reasoning LLMsAlong with their Series A, they announced Avalon, “A Benchmark for RL Generalization Using Procedurally Generated Worlds”. Avalon is built on top of the open source Godot game engine, and is ~100x faster than Minecraft to enable fast RL benchmarking and a clear reward with adjustable game difficulty.After a while, they realized that pure RL isn't a good path to teach reasoning and planning. The agents were able to learn mechanical things like opening complex doors, climbing, but couldn't go to higher level tasks. A pure RL world also doesn't include a language explanation of the agent reasoning, which made it hard to understand why it made certain decisions. That pushed the team more towards the “models for reasoning” path:“The second thing we learned is that pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were able to learn all sorts of crazy things: They could learn to climb like hand over hand in VR climbing, they could learn to open doors like very complicated, like multiple switches and a lever open the door, but they couldn't do any higher level things. And they couldn't do those lower level things consistently necessarily. And as a user, I do not want to interact with a pure reinforcement learning end to end RL agent. As a user, like I need much more control over what that agent is doing.”Inspired by Chelsea Finn's work on SayCan at Stanford, the team pivoted to have their agents do the reasoning in natural language instead. This development parallels the large leaps in reasoning that humans have developed as the scientific method:“We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask:* What was the original claim that was made? * What evidence is there for this claim? * Does the evidence support the claim? * Is the claim correct? This is like a reasoning strategy that was developed in like the 1600s, you know, with like the advent of science. So that's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we can generate data that's much more specific to them.“The Full Stack Model LabOne year later, it would seem that the pivot to reasoning has had tremendous success, and Imbue has now reached a >$1B valuation, with participation from Astera Institute, NVIDIA, Cruise CEO Kyle Vogt, Notion co-founder Simon Last, and others. Imbue tackles their work with a “full stack” approach:* Models. Pretraining very large (>100B parameter) models, optimized to perform well on internal reasoning benchmarks, with a ~10,000 Nvidia H100 GPU cluster lets us iterate rapidly on everything from training data to architecture and reasoning mechanisms.* Tools and Agents. Building internal productivity tools from coding agents for fixing type checking and linting errors, to sophisticated systems like CARBS (for hyperparameter tuning and network architecture search).* Interface Invention. Solving agent trust and collaboration (not merely communication) with humans by creating better abstractions and interfaces — IDEs for users to program computers in natural language.* Theory. Publishing research about the theoretical underpinnings of self-supervised learning, as well as scaling laws for machine learning research.Kanjun believes we are still in the “bare metal phase” of agent development, and they want to take a holistic approach to building the “operating system for agents”. We loved diving deep into the Imbue approach toward solving the AI Holy Grail of reliable agents, and are excited to share our conversation with you today!Timestamps* [00:00:00] Introductions* [00:06:07] The origin story of Imbue* [00:09:39] Imbue's approach to training large foundation models optimized for reasoning* [00:12:18] Imbue's goals to build an "operating system" for reliable, inspectable AI agents* [00:15:37] Imbue's process of developing internal tools and interfaces to collaborate with AI agents* [00:17:27] Imbue's focus on improving reasoning capabilities in models, using code and other data* [00:19:50] The value of using both public benchmarks and internal metrics to evaluate progress* [00:21:43] Lessons learned from developing the Avalon research environment* [00:23:31] The limitations of pure reinforcement learning for general intelligence* [00:28:36] Imbue's vision for building better abstractions and interfaces for reliable agents* [00:31:36] Interface design for collaborating with, rather than just communicating with, AI agents* [00:37:40] The future potential of an agent-to-agent protocol* [00:39:29] Leveraging approaches like critiquing between models and chain of thought* [00:45:49] Kanjun's philosophy on enabling team members as creative agents at Imbue* [00:53:51] Kanjun's experience co-founding the communal co-living space The Archive* [01:00:22] Lightning RoundShow Notes* Imbue* Avalon* CARBS (hyperparameter optimizer)* Series B announcement* Kanjun/Imbue's Podcast* MIT Media Lab* Research mentioned:* Momentum Contrast* SimClr* Chelsea Finn - SayCan* Agent Protocol - part of the AI Engineer Foundation* Xerox PARC* Michael Nielsen* Jason Benn* Outset Capital* Scenius - Kevin Kelly* South Park Commons* The Archive* Thursday Nights in AITranscriptAlessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, Partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. [00:00:19]Swyx: Hey, and today in the studio we have Kanjun from Imbue. Welcome. So you and I have, I guess, crossed paths a number of times. You're formerly named Generally Intelligent and you've just announced your rename, rebrand in huge, humongous ways. So congrats on all of that. And we're here to dive in into deeper detail on Imbue. We like to introduce you on a high level basis, but then have you go into a little bit more of your personal side. So you graduated your BS at MIT and you also spent some time at the MIT Media Lab, one of the most famous, I guess, computer hacking labs in the world. Then you graduated MIT and you went straight into BizOps at Dropbox, where you're eventually chief of staff, which is a pretty interesting role we can dive into later. And then it seems like the founder bug hit you. You were basically a three times founder at Ember, Sorceress, and now at Generally Intelligent slash Imbue. What should people know about you on the personal side that's not on your LinkedIn? That's something you're very passionate about outside of work. [00:01:12]Kanjun: Yeah. I think if you ask any of my friends, they would tell you that I'm obsessed with agency, like human agency and human potential. [00:01:19]Swyx: That's work. Come on.Kanjun: It's not work. What are you talking about?Swyx: So what's an example of human agency that you try to promote? [00:01:27]Kanjun: With all of my friends, I have a lot of conversations with them that's kind of helping figure out what's blocking them. I guess I do this with a team kind of automatically too. And I think about it for myself often, like building systems. I have a lot of systems to help myself be more effective. At Dropbox, I used to give this onboarding talk called How to Be Effective, which people liked. I think like a thousand people heard this onboarding talk, and I think maybe Dropbox was more effective. I think I just really believe that as humans, we can be a lot more than we are. And it's what drives everything. I guess completely outside of work, I do dance. I do partner dance. [00:02:03]Swyx: Yeah. Lots of interest in that stuff, especially in the sort of group living houses in San Francisco, which I've been a little bit part of, and you've also run one of those. [00:02:12]Kanjun: That's right. Yeah. I started the archive with two friends, with Josh, my co-founder, and a couple of other folks in 2015. That's right. And GPT-3, our housemates built. [00:02:22]Swyx: Was that the, I guess, the precursor to Generally Intelligent, that you started doing more things with Josh? Is that how that relationship started? Yeah. [00:02:30]Kanjun: This is our third company together. Our first company, Josh poached me from Dropbox for Ember. And there we built a really interesting technology, laser raster projector, VR headset. And then we were like, VR is not the thing we're most passionate about. And actually it was kind of early days when we both realized we really do believe that in our lifetimes, like computers that are intelligent are going to be able to allow us to do much more than we can do today as people and be much more as people than we can be today. And at that time, we actually, after Ember, we were like, work on AI research or start an AI lab. A bunch of our housemates were joining OpenAI, and we actually decided to do something more pragmatic to apply AI to recruiting and to try to understand like, okay, if we are actually trying to deploy these systems in the real world, what's required? And that was Sorceress. That taught us so much about maybe an AI agent in a lot of ways, like what does it actually take to make a product that people can trust and rely on? I think we never really fully got there. And it's taught me a lot about what's required. And it's kind of like, I think informed some of our approach and some of the way that we think about how these systems will actually get used by people in the real world. [00:03:42]Swyx: Just to go one step deeper on that, you're building AI agents in 2016 before it was cool. You got some muscle and you raised $30 million. Something was working. What do you think you succeeded in doing and then what did you try to do that did not pan out? [00:03:56]Kanjun: Yeah. So the product worked quite well. So Sorceress was an AI system that basically looked for candidates that could be a good fit and then helped you reach out to them. And this was a little bit early. We didn't have language models to help you reach out. So we actually had a team of writers that like, you know, customized emails and we automated a lot of the customization. But the product was pretty magical. Like candidates would just be interested and land in your inbox and then you can talk to them. As a hiring manager, that's such a good experience. I think there were a lot of learnings, both on the product and market side. On the market side, recruiting is a market that is endogenously high churn, which means because people start hiring and then we hire the role for them and they stop hiring. So the more we succeed, the more they... [00:04:39]Swyx: It's like the whole dating business. [00:04:40]Kanjun: It's the dating business. Exactly. Exactly. And I think that's the same problem as the dating business. And I was really passionate about like, can we help people find work that is more exciting for them? A lot of people are not excited about their jobs and a lot of companies are doing exciting things and the matching could be a lot better. But the dating business phenomenon like put a damper on that, like it's actually a pretty good business. But as with any business with like relatively high churn, the bigger it gets, the more revenue we have, the slower growth becomes because if 30% of that revenue you lose year over year, then it becomes a worse business. So that was the dynamic we noticed quite early on after our Series A. I think the other really interesting thing about it is we realized what was required for people to trust that these candidates were like well vetted and had been selected for a reason. And it's what actually led us, you know, a lot of what we do at Imbue is working on interfaces to figure out how do we get to a situation where when you're building and using agents, these agents are trustworthy to the end user. That's actually one of the biggest issues with agents that, you know, go off and do longer range goals is that I have to trust, like, did they actually think through this situation? And that really informed a lot of our work today. [00:05:52]Alessio: Let's jump into GI now, Imbue. When did you decide recruiting was done for you and you were ready for the next challenge? And how did you pick the agent space? I feel like in 2021, it wasn't as mainstream. Yeah. [00:06:07]Kanjun: So the LinkedIn says that it started in 2021, but actually we started thinking very seriously about it in early 2020, late 2019, early 2020. So what we were seeing is that scale is starting to work and language models probably will actually get to a point where like with hacks, they're actually going to be quite powerful. And it was hard to see that at the time, actually, because GPT-3, the early versions of it, there are all sorts of issues. We're like, oh, that's not that useful, but we could kind of see like, okay, you keep improving it in all of these different ways and it'll get better. What Josh and I were really interested in is how can we get computers that help us do bigger things? Like, you know, there's this kind of future where I think a lot about, you know, if I were born in 1900 as a woman, like my life would not be that fun. I'd spend most of my time like carrying water and literally like getting wood to put in the stove to cook food and like cleaning and scrubbing the dishes and, you know, getting food every day because there's no refrigerator, like all of these things, very physical labor. And what's happened over the last 150 years since the industrial revolution is we've kind of gotten free energy, like energy is way more free than it was 150 years ago. And so as a result, we've built all these technologies like the stove and the dishwasher and the refrigerator, and we have electricity and we have infrastructure, running water, all of these things that have totally freed me up to do what I can do now. And I think the same thing is true for intellectual energy. We don't really see it today, but because we're so in it, but our computers have to be micromanaged. You know, part of why people are like, oh, you're stuck to your screen all day. Well, we're stuck to our screen all day because literally nothing happens unless I'm doing something in front of my screen. I don't, you know, I can't send my computer off to do a bunch of stuff for me. And there is a future where that's not the case, where, you know, I can actually go off and do stuff and trust that my computer will pay my bills and figure out my travel plans and do the detailed work that I am not that excited to do so that I can like be much more creative and able to do things that I as a human, I'm very excited about and collaborate with other people. And there are things that people are uniquely suited for. So that's kind of always been the thing that has been really exciting to me. Like Josh and I have known for a long time, I think that, you know, whatever AI is, it would happen in our lifetimes. And the personal computer kind of started giving us a bit of free intellectual energy. And this is like really the explosion of free intellectual energy. So in early 2020, we were thinking about this and what happened was self-supervised learning basically started working across everything. So worked in language, SimClear came out, I think MoCo had come out, Momentum Contrast had come out earlier in 2019, SimClear came out in early 2020. And we're like, okay, for the first time, self-supervised learning is working really well across images and text and suspect that like, okay, actually it's the case that machines can learn things the way that humans do. And if that's true, if they can learn things in a fully self-supervised way, because like as people, we are not supervised. We like go Google things and try to figure things out. So if that's true, then like what the computer could be is much bigger than what it is today. And so we started exploring ideas around like, how do we actually go? We didn't think about the fact that we could actually just build a research lab. So we were like, okay, what kind of startup could we build to like leverage self-supervised learning? So that eventually becomes something that allows computers to become much more able to do bigger things for us. But that became General Intelligence, which started as a research lab. [00:09:39]Alessio: So your mission is you aim to rekindle the dream of the personal computer. So when did it go wrong and what are like your first products and user facing things that you're building to rekindle it? [00:09:53]Kanjun: Yeah. So what we do at Imbue is we train large foundation models optimized for reasoning. And the reason for that is because reasoning is actually, we believe the biggest blocker to agents or systems that can do these larger goals. If we think about something that writes an essay, like when we write an essay, we like write it. We put it and then we're done. We like write it and then we look at it and we're like, oh, I need to do more research on that area. I'm going to go do some research and figure it out and come back and, oh, actually it's not quite right. The structure of the outline. So I'm going to rearrange the outline, rewrite it. It's this very iterative process and it requires thinking through like, okay, what am I trying to do? Is the goal correct? Also like, has the goal changed as I've learned more? So as a tool, like when should I ask the user questions? I shouldn't ask them questions all the time, but I should ask them questions in higher risk situations. How certain am I about the like flight I'm about to book? There are all of these notions of like risk certainty, playing out scenarios, figuring out how to make a plan that makes sense, how to change the plan, what the goal should be. That are things that we lump under the bucket of reasoning and models today, they're not optimized for reasoning. It turns out that there's not actually that much explicit reasoning data on the internet as you would expect. And so we get a lot of mileage out of optimizing our models for reasoning in pre-training. And then on top of that, we build agents ourselves and we, I can get into, we really believe in serious use, like really seriously using the systems and trying to get to an agent that we can use every single day, tons of agents that we can use every single day. And then we experiment with interfaces that help us better interact with the agents. So those are some set of things that we do on the kind of model training and agent side. And then the initial agents that we build, a lot of them are trying to help us write code better because code is most of what we do every day. And then on the infrastructure and theory side, we actually do a fair amount of theory work to understand like, how do these systems learn? And then also like, what are the right abstractions for us to build good agents with, which we can get more into. And if you look at our website, we build a lot of tools internally. We have a like really nice automated hyperparameter optimizer. We have a lot of really nice infrastructure and it's all part of the belief of like, okay, let's try to make it so that the humans are doing the things humans are good at as much as possible. So out of our very small team, we get a lot of leverage. [00:12:18]Swyx: And so would you still categorize yourself as a research lab now, or are you now in startup mode? Is that a transition that is conscious at all? [00:12:26]Kanjun: That's a really interesting question. I think we've always intended to build, you know, to try to build the next version of the computer, enable the next version of the computer. The way I think about it is there's a right time to bring a technology to market. So Apple does this really well. Actually, iPhone was under development for 10 years, AirPods for five years. And Apple has a story where iPhone, the first multi-touch screen was created. They actually were like, oh wow, this is cool. Let's like productionize iPhone. They actually brought, they like did some work trying to productionize it and realized this is not good enough. And they put it back into research to try to figure out like, how do we make it better? What are the interface pieces that are needed? And then they brought it back into production. So I think of production and research as kind of like these two separate phases. And internally we have that concept as well, where like things need to be done in order to get to something that's usable. And then when it's usable, like eventually we figure out how to productize it. [00:13:20]Alessio: What's the culture like to make that happen, to have both like kind of like product oriented, research oriented. And as you think about building the team, I mean, you just raised 200 million. I'm sure you want to hire more people. What are like the right archetypes of people that work at Imbue? [00:13:35]Kanjun: I would say we have a very unique culture in a lot of ways. I think a lot about social process design. So how do you design social processes that enable people to be effective? I like to think about team members as creative agents, because most companies, they think of their people as assets and they're very proud of this. And I think about like, okay, what is an asset? It's something you own that provides you value that you can discard at any time. This is a very low bar for people. This is not what people are. And so we try to enable everyone to be a creative agent and to really unlock their superpowers. So a lot of the work I do, you know, I was mentioning earlier, I'm like obsessed with agency. A lot of the work I do with team members is try to figure out like, you know, what are you really good at? What really gives you energy and where can we put you such that, how can I help you unlock that and grow that? So much of our work, you know, in terms of team structure, like much of our work actually comes from people. Carbs, our hyperparameter optimizer came from Abe trying to automate his own research process doing hyperparameter optimization. And he actually pulled some ideas from plasma physics. He's a plasma physicist to make the local search work. A lot of our work on evaluations comes from a couple of members of our team who are like obsessed with evaluations. We do a lot of work trying to figure out like, how do you actually evaluate if the model is getting better? Is the model making better agents? Is the agent actually reliable? A lot of things kind of like, I think of people as making the like them shaped blob inside imbue and I think, you know, yeah, that's the kind of person that we're, we're hiring for. We're hiring product engineers and data engineers and research engineers and all these roles. We have projects, not teams. We have a project around data, data collection and data engineering. That's actually one of the key things that improve the model performance. We have a pre-training kind of project with some fine tuning as part of that. And then we have an agent's project that's like trying to build on top of our models as well as use other models in the outside world to try to make agents then we actually use as programmers every day. So all sorts of different, different projects. [00:15:37]Swyx: As a founder, you're now sort of a capital allocator among all of these different investments effectively at different projects. And I was interested in how you mentioned that you were optimizing for improving reasoning and specifically inside of your pre-training, which I assume is just a lot of data collection. [00:15:55]Kanjun: We are optimizing reasoning inside of our pre-trained models. And a lot of that is about data. And I can talk more about like what, you know, what exactly does it involve? But actually big, maybe 50% plus of the work is figuring out even if you do have models that reason well, like the models are still stochastic. The way you prompt them still makes, is kind of random, like makes them do random things. And so how do we get to something that is actually robust and reliable as a user? How can I, as a user, trust it? We have all sorts of cool things on the, like, you know, I was mentioning earlier when I talked to other people building agents, they have to do so much work, like to try to get to something that they can actually productize and it takes a long time and agents haven't been productized yet for, partly for this reason is that like the abstractions are very leaky. We can get like 80% of the way there, but like self-driving cars, like the remaining 20% is actually really difficult. We believe that, and we have internally, I think some things that like an interface, for example, that lets me really easily like see what the agent execution is, fork it, try out different things, modify the prompt, modify like the plan that it is making. This type of interface, it makes it so that I feel more like I'm collaborating with the agent as it's executing, as opposed to it's just like doing something as a black box. That's an example of a type of thing that's like beyond just the model pre-training, but on the model pre-training side, like reasoning is a thing that we optimize for. And a lot of that is about what data do we put in. [00:17:27]Swyx: It's interesting just because I always think like, you know, out of the levers that you have, the resources that you have, I think a lot of people think that running foundation model company or a research lab is going to be primarily compute. And I think the share of compute has gone down a lot over the past three years. It used to be the main story, like the main way you scale is you just throw more compute at it. And now it's like, Flops is not all you need. You need better data, you need better algorithms. And I wonder where that shift has gone. This is a very vague question, but is it like 30-30-30 now? Is it like maybe even higher? So one way I'll put this is people estimate that Llama2 maybe took about three to $4 million of compute, but probably 20 to $25 million worth of labeling data. And I'm like, okay, well that's a very different story than all these other foundation model labs raising hundreds of millions of dollars and spending it on GPUs. [00:18:20]Kanjun: Data is really expensive. We generate a lot of data. And so that does help. The generated data is close to actually good, as good as human labeled data. [00:18:34]Swyx: So generated data from other models? [00:18:36]Kanjun: From our own models. From your own models. Or other models, yeah. [00:18:39]Swyx: Do you feel like there's certain variations of this? There's the sort of the constitutional AI approach from Anthropic and basically models sampling training on data from other models. I feel like there's a little bit of like contamination in there, or to put it in a statistical form, you're resampling a distribution that you already have that you already know doesn't match human distributions. How do you feel about that basically, just philosophically? [00:19:04]Kanjun: So when we're optimizing models for reasoning, we are actually trying to like make a part of the distribution really spiky. So in a sense, like that's actually what we want. We want to, because the internet is a sample of the human distribution that's also skewed in all sorts of ways. That is not the data that we necessarily want these models to be trained on. And so when we're generating data, we're not really randomly generating data. We generate very specific things that are like reasoning traces and that help optimize reasoning. Code also is a big piece of improving reasoning. So generated code is not that much worse than like regular human written code. You might even say it can be better in a lot of ways. So yeah. So we are trying to already do that. [00:19:50]Alessio: What are some of the tools that you thought were not a good fit? So you built Avalon, which is your own simulated world. And when you first started, the metagame was like using games to simulate things using, you know, Minecraft and then OpenAI is like the gym thing and all these things. And I think in one of your other podcasts, you mentioned like Minecraft is like way too slow to actually do any serious work. Is that true? Yeah. I didn't say it. [00:20:17]Swyx: I don't know. [00:20:18]Alessio: That's above my pay grade. But Avalon is like a hundred times faster than Minecraft for simulation. When did you figure that out that you needed to just like build your own thing? Was it kind of like your engineering team was like, Hey, this is too slow. Was it more a long-term investment? [00:20:34]Kanjun: Yeah. At that time we built Avalon as a research environment to help us learn particular things. And one thing we were trying to learn is like, how do you get an agent that is able to do many different tasks? Like RL agents at that time and environments at that time. What we heard from other RL researchers was the like biggest thing keeping holding the field back is lack of benchmarks that let us explore things like planning and curiosity and things like that and have the agent actually perform better if the agent has curiosity. And so we were trying to figure out in a situation where, how can we have agents that are able to handle lots of different types of tasks without the reward being pretty handcrafted? That's a lot of what we had seen is that like these very handcrafted rewards. And so Avalon has like a single reward it's across all tasks. And it also allowed us to create a curriculum so we could make the level more or less difficult. And it taught us a lot, maybe two primary things. One is with no curriculum, RL algorithms don't work at all. So that's actually really interesting. [00:21:43]Swyx: For the non RL specialists, what is a curriculum in your terminology? [00:21:46]Kanjun: So a curriculum in this particular case is basically the environment Avalon lets us generate simpler environments and harder environments for a given tasks. What's interesting is that the simpler environments, what you'd expect is the agent succeeds more often. So it gets more reward. And so, you know, kind of my intuitive way of thinking about it is, okay, the reason why it learns much faster with a curriculum is it's just getting a lot more signal. And that's actually an interesting general intuition to have about training these things as like, what kind of signal are they getting? And like, how can you help it get a lot more signal? The second thing we learned is that reinforcement learning is not a good vehicle, like pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were not able to, they were able to learn all sorts of crazy things. They could learn to climb like hand over hand in VR climbing, they could learn to open doors like very complicated, like multiple switches and a lever open the door, but they couldn't do any higher level things. And they couldn't do those lower level things consistently necessarily. And as a user, I do not want to interact with a pure reinforcement learning end to end RL agent. As a user, like I need much more control over what that agent is doing. And so that actually started to get us on the track of thinking about, okay, how do we do the reasoning part in language? And we were pretty inspired by our friend Chelsea Finn at Stanford was I think working on SACAN at the time where it's basically an experiment where they have robots kind of trying to do different tasks and actually do the reasoning for the robot in natural language. And it worked quite well. And that led us to start experimenting very seriously with reasoning. [00:23:31]Alessio: How important is the language part for the agent versus for you to inspect the agent? You know, like is it the interface to kind of the human on the loop really important or? [00:23:43]Kanjun: Yeah, I personally think of it as it's much more important for us, the human user. So I think you probably could get end to end agents that work and are fairly general at some point in the future. But I think you don't want that. Like we actually want agents that we can like perturb while they're trying to figure out what to do. Because, you know, even a very simple example, internally we have like a type error fixing agent and we have like a test generation agent. Test generation agent goes off rails all the time. I want to know, like, why did it generate this particular test? [00:24:19]Swyx: What was it thinking? [00:24:20]Kanjun: Did it consider, you know, the fact that this is calling out to this other function? And the formatter agent, if it ever comes up with anything weird, I want to be able to debug like what happened with RL end to end stuff. Like we couldn't do that. Yeah. [00:24:36]Swyx: It sounds like you have a bunch of agents operating internally within the company. What's your most, I guess, successful agent and what's your least successful one? [00:24:44]Kanjun: The agents don't work. All of them? I think the only successful agents are the ones that do really small things. So very specific, small things like fix the color of this button on the website or like change the color of this button. [00:24:57]Swyx: Which is now sweep.dev is doing that. Exactly. [00:25:00]Kanjun: Perfect. Okay. [00:25:02]Swyx: Well, we should just use sweep.dev. Well, I mean, okay. I don't know how often you have to fix the color of a button, right? Because all of them raise money on the idea that they can go further. And my fear when encountering something like that is that there's some kind of unknown asymptote ceiling that's going to prevent them, that they're going to run head on into that you've already run into. [00:25:21]Kanjun: We've definitely run into such a ceiling. But what is the ceiling? [00:25:24]Swyx: Is there a name for it? Like what? [00:25:26]Kanjun: I mean, for us, we think of it as reasoning plus these tools. So reasoning plus abstractions, basically. I think actually you can get really far with current models and that's why it's so compelling. Like we can pile debugging tools on top of these current models, have them critique each other and critique themselves and do all of these, like spend more computer inference time, context hack, retrieve augmented generation, et cetera, et cetera, et cetera. Like the pile of hacks actually does get us really far. And a way to think about it is like the underlying language model is kind of like a noisy channel. Actually I don't want to use this analogy. It's actually a really bad analogy, but you kind of like trying to get more signal out of the channel. We don't like to think about it that way. It's what the default approach is, is like trying to get more signal out of this noising channel. But the issue with agents is as a user, I want it to be mostly reliable. It's kind of like self-driving in that way. Like it's not as bad as self-driving, like in self-driving, you know, you're like hurtling at 70 miles an hour. It's like the hardest agent problem. But one thing we learned from Sorceress and one thing we learned by using these things internally is we actually have a pretty high bar for these agents to work. You know, it's actually really annoying if they only work 50% of the time and we can make interfaces to make it slightly less annoying. But yeah, there's a ceiling that we've encountered so far and we need to make the models better. We also need to make the kind of like interface to the user better. And also a lot of the like critiquing. I hope what we can do is help people who are building agents actually like be able to deploy them. I think, you know, that's the gap that we see a lot of today is everyone who's trying to build agents to get to the point where it's robust enough to be deployable. It just, it's like an unknown amount of time. Okay. [00:27:12]Swyx: So this goes back into what Embu is going to offer as a product or a platform. How are you going to actually help people deploy those agents? Yeah. [00:27:21]Kanjun: So our current hypothesis, I don't know if this is actually going to end up being the case. We've built a lot of tools for ourselves internally around like debugging, around abstractions or techniques after the model generation happens. Like after the language model generates the text and like interfaces for the user and the underlying model itself, like models talking to each other, maybe some set of those things kind of like an operating system. Some set of those things will be helpful for other people. And we'll figure out what set of those things is helpful for us to make our agents. Like what we want to do is get to a point where we can like start making an agent, deploy it, it's reliable, like very quickly. And there's a similar analog to software engineering, like in the early days, in the seventies and the sixties, like to program a computer, like you have to go all the way down to the registers and write things and eventually we had assembly. That was like an improvement. But then we wrote programming languages with these higher levels of abstraction and that allowed a lot more people to do this and much faster. And the software created is much less expensive. And I think it's basically a similar route here where we're like in the like bare metal phase of agent building. And we will eventually get to something with much nicer abstractions. [00:28:36]Alessio: We had this conversation with George Hotz and we were like, there's not a lot of reasoning data out there. And can the models really understand? And his take was like, look, with enough compute, you're not that complicated as a human. Like the model can figure out eventually why certain decisions are made. What's been your experience? Like as you think about reasoning data, like do you have to do a lot of like manual work or like is there a way to prompt models to extract the reasoning from actions that they [00:29:03]Swyx: see? [00:29:03]Kanjun: So we don't think of it as, oh, throw enough data at it and then it will figure out what the plan should be. I think we're much more explicit. You know, a way to think about it is as humans, we've learned a lot of reasoning strategies over time. We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask like, huh, what was the original claim that was made? What evidence is there for this claim? Does the evidence support the claim? Is the claim correct? This is like a reasoning strategy that was developed in like the 1600s, you know, with like the advent of science. So that's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we didn't always have them. And because they're invented, like we can generate data that's much more specific to them. So I think internally, yeah, we have a lot of thoughts on what reasoning is and we generate a lot more specific data. We're not just like, oh, it'll figure out reasoning from this black box or like it'll figure out reasoning from the data that exists. Yeah. [00:30:04]Alessio: I mean, the scientific method is like a good example. If you think about hallucination, right, people are thinking, how do we use these models to do net new, like scientific research? And if you go back in time and the model is like, well, the earth revolves around the sun and people are like, man, this model is crap. It's like, what are you talking about? Like the sun revolves around the earth. It's like, how do you see the future? Like if the models are actually good enough, but we don't believe them, it's like, how do we make the two live together? So you're like, you use Inbu as a scientist to do a lot of your research and Inbu tells you, hey, I think this is like a serious path you should go down. And you're like, no, that sounds impossible. Like how is that trust going to be built? And like, what are some of the tools that maybe are going to be there to inspect it? [00:30:51]Kanjun: Really there are two answers to this. One element of it is as a person, like I need to basically get information out of the model such that I can try to understand what's going on with the model. Then the second question is like, okay, how do you do that? And that's kind of some of our debugging tools, they're not necessarily just for debugging. They're also for like interfacing with and interacting with the model. So like if I go back in this reasoning trace and like change a bunch of things, what's going to happen? Like, what does it conclude instead? So that kind of helps me understand like, what are its assumptions? And, you know, we think of these things as tools. And so it's really about like, as a user, how do I use this tool effectively? I need to be willing to be convinced as well. It's like, how do I use this tool effectively? And what can it help me with? [00:31:36]Swyx: And what can it tell me? There's a lot of mention of code in your process. And I was hoping to dive in even deeper. I think we might run the risk of giving people the impression that you view code or you use code just as like a tool within InView just for coding assistance. But I think you actually train code models. And I think there's a lot of informal understanding about how adding code to language models improves their reasoning capabilities. I wonder if there's any research or findings that you have to share that talks about the intersection of code and reasoning. Hmm. Yeah. [00:32:08]Kanjun: So the way I think about it intuitively is like code is the most explicit example of reasoning data on the internet. [00:32:15]Swyx: Yeah. [00:32:15]Kanjun: And it's not only structured, it's actually very explicit, which is nice. You know, it says this variable means this, and then it uses this variable. And then the function does this. As people, when we talk in language, it takes a lot more to extract that explicit structure out of our language. And so that's one thing that's really nice about code is I see it as almost like a curriculum for reasoning. I think we use code in all sorts of ways. The coding agents are really helpful for us to understand what are the limitations of the agents. The code is really helpful for the reasoning itself. But also code is a way for models to act. So by generating code, it can act on my computer. And, you know, when we talk about rekindling the dream of the personal computer, kind of where I see computers going is, you know, like computers will eventually become these much more malleable things where I, as a user today, I have to know how to write software code, like in order to make my computer do exactly what I want it to do. But in the future, if the computer is able to generate its own code, then I can actually interface with it in natural language. And so one way we think about agents is kind of like a natural language programming language. It's a way to program my computer in natural language that's much more intuitive to me as a user. And these interfaces that we're building are essentially IDEs for users to program our computers in natural language. Maybe I should say what we're doing that way. Maybe it's clearer. [00:33:47]Swyx: I don't know. [00:33:47]Alessio: That's a good pitch. What do you think about the different approaches people have, kind of like text first, browser first, like multi-on? What do you think the best interface will be? Or like, what is your, you know, thinking today? [00:33:59]Kanjun: In a lot of ways, like chat as an interface, I think Linus, Linus Lee, you had on this. I really like how he put it. Chat as an interface is skeuomorphic. So in the early days, when we made word processors on our computers, they had notepad lines because that's what we understood these like objects to be. Chat, like texting someone is something we understand. So texting our AI is something that we understand. But today's word documents don't have notepad lines. And similarly, the way we want to interact with agents, like chat is a very primitive way of interacting with agents. What we want is to be able to inspect their state and to be able to modify them and fork them and all of these other things. And we internally have, think about what are the right representations for that? Like architecturally, like what are the right representations? What kind of abstractions do we need to build? And how do we build abstractions that are not leaky? Because if the abstractions are leaky, which they are today, like, you know, this stochastic generation of text is like a leaky abstraction. I cannot depend on it. And that means it's actually really hard to build on top of. But our experience and belief is actually by building better abstractions and better tooling, we can actually make these things non-leaky. And now you can build like whole things on top of them. So these other interfaces, because of where we are, we don't think that much about them. [00:35:17]Swyx: Yeah. [00:35:17]Alessio: I mean, you mentioned, this is kind of like the Xerox Spark moment for AI. And we had a lot of stuff come out of Parc, like the, what you see is what you got editors and like MVC and all this stuff. But yeah, but then we didn't have the iPhone at Parc. We didn't have all these like higher things. What do you think it's reasonable to expect in like this era of AI, you know, call it like five years or so? Like what are like the things we'll build today and what are things that maybe we'll see in kind of like the second wave of products? [00:35:46]Kanjun: That's interesting. I think the waves will be much faster than before. Like what we're seeing right now is basically like a continuous wave. Let me zoom a little bit earlier. So people like the Xerox Parc analogy I give, but I think there are many different analogies. Like one is the like analog to digital computer is kind of an example, like another analogy to where we are today. The analog computer Vannevar Bush built in the 1930s, I think, and it's like a system of pulleys and it can only calculate one function. Like it can calculate like an integral. And that was so magical at the time because you actually did need to calculate this integral bunch, but it had a bunch of issues like in analog errors compound. And so there was actually a set of breakthroughs necessary in order to get to the digital computer, like Turing's decidability, Shannon. I think the like whole like relay circuits can be thought of as can be mapped to Boolean operators and a set of other like theoretical breakthroughs, which essentially were abstractions. They were like creating abstractions for these like very like lossy circuits. They were creating abstractions for these like very analog circuits and digital had this nice property of like being error correcting. And so when I talk about like less leaky abstractions, that's what I mean. That's what I'm kind of pointing a little bit to. It's not going to look exactly the same way. And then the Xerox PARC piece, a lot of that is about like, how do we get to computers that as a person, I can actually use well. And the interface actually helps it unlock so much more power. So the sets of things we're working on, like the sets of abstractions and the interfaces, like hopefully that like help us unlock a lot more power in these systems. Like hopefully that'll come not too far in the future. I could see a next version, maybe a little bit farther out. It's like an agent protocol. So a way for different agents to talk to each other and call each other. Kind of like HTTP. [00:37:40]Swyx: Do you know it exists already? [00:37:41]Kanjun: Yeah, there is a nonprofit that's working on one. I think it's a bit early, but it's interesting to think about right now. Part of why I think it's early is because the issue with agents, it's not quite like the internet where you could like make a website and the website would appear. The issue with agents is that they don't work. And so it may be a bit early to figure out what the protocol is before we really understand how these agents get constructed. But, you know, I think that's, I think it's a really interesting question. [00:38:09]Swyx: While we're talking on this agent to agent thing, there's been a bit of research recently on some of these approaches. I tend to just call them extremely complicated chain of thoughting, but any perspectives on kind of meta-GPT, I think it's the name of the paper. I don't know if you care about at the level of individual papers coming out, but I did read that recently and TLDR, it beat GPT-4 and human eval by role-playing software agent development agency, instead of having sort of single shot or single role, you have multiple roles and how having all of them criticize each other as agents communicating with other agents. [00:38:45]Kanjun: Yeah, I think this is an example of an interesting abstraction of like, okay, can I just plop in this like multi-role critiquing and see how it improves my agent? And can I just plop in chain of thought, tree of thought, plop in these other things and see how they improve my agent? One issue with this kind of prompting is that it's still not very reliable. It's like, there's one lens, which is like, okay, if you do enough of these techniques, you'll get to high reliability. And I think actually that's a pretty reasonable lens. We take that lens often. And then there's another lens that's like, okay, but it's starting to get really messy what's in the prompt and like, how do we deal with that messiness? And so maybe you need like cleaner ways of thinking about and constructing these systems. And we also take that lens. So yeah, I think both are necessary. Yeah. [00:39:29]Swyx: Side question, because I feel like this also brought up another question I had for you. I noticed that you work a lot with your own benchmarks, your own evaluations of what is valuable. I would say I would contrast your approach with OpenAI as OpenAI tends to just lean on, hey, we played StarCraft or hey, we ran it on the SAT or the, you know, the AP bio test and that did results. Basically, is benchmark culture ruining AI? [00:39:55]Swyx: Or is that actually a good thing? Because everyone knows what an SAT is and that's fine. [00:40:04]Kanjun: I think it's important to use both public and internal benchmarks. Part of why we build our own benchmarks is that there are not very many good benchmarks for agents, actually. And to evaluate these things, you actually need to think about it in a slightly different way. But we also do use a lot of public benchmarks for like, is the reasoning capability in this particular way improving? So yeah, it's good to use both. [00:40:26]Swyx: So for example, the Voyager paper coming out of NVIDIA played Minecraft and set their own benchmarks on getting the Diamond X or whatever and exploring as much of the territory as possible. And I don't know how that's received. That's obviously fun and novel for the rest of the engineer, the people who are new to the scene. But for people like yourselves, you build Avalon just because you already found deficiencies with using Minecraft. Is that valuable as an approach? Oh, yeah. I love Voyager. [00:40:57]Kanjun: I mean, Jim, I think is awesome. And I really like the Voyager paper and I think it has a lot of really interesting ideas, which is like the agent can create tools for itself and then use those tools. [00:41:06]Swyx: He had the idea of the curriculum as well, which is something that we talked about earlier. Exactly. [00:41:09]Kanjun: And that's like a lot of what we do. We built Avalon mostly because we couldn't use Minecraft very well to like learn the things we wanted. And so it's like not that much work to build our own. [00:41:19]Swyx: It took us, I don't know. [00:41:22]Kanjun: We had like eight engineers at the time, took about eight weeks. So six weeks. [00:41:27]Swyx: And OpenAI built their own as well, right? Yeah, exactly. [00:41:30]Kanjun: It's just nice to have control over our environment. But if you're doing our own sandbox to really trying to inspect our own research questions. But if you're doing something like experimenting with agents and trying to get them to do things like Minecraft is a really interesting environment. And so Voyager has a lot of really interesting ideas in it. [00:41:47]Swyx: Yeah. Cool. One more element that we had on this list, which is context and memory. I think that's kind of like the foundational, quote unquote, RAM of our era. I think Andrej Karpathy has already made this comparison. So there's nothing new here. And that's just the amount of working knowledge that we can fit into one of these agents. And it's not a lot, right? Especially if you need to get them to do long running tasks. If they need to self-correct from errors that they observe while operating in their environment. Do you see this as a problem? Do you think we're going to just trend to infinite context and that'll go away? Or how do you think we're going to deal with it? [00:42:22]Kanjun: I think when you talked about what's going to happen in the first wave and then in the second wave, I think what we'll see is we'll get like relatively simplistic agents pretty soon. And they will get more and more complex. And there's like a future wave in which they are able to do these like really difficult, really long running tasks. And the blocker to that future, one of the blockers is memory. And that was true of computers too. You know, I think when von Neumann made the von Neumann architecture, he was like, the biggest blocker will be like, we need this amount of memory, which is like, I don't remember exactly like 32 kilobytes or something to store programs. And that will allow us to write software. He didn't say it this way because he didn't have these terms, but that only really was like happened in the seventies with the microchip revolution. It may be the case that we're waiting for some research breakthroughs or some other breakthroughs in order for us to have like really good long running memory. And then in the meantime, agents will be able to do all sorts of things that are a little bit smaller than that. I do think with the pace of the field, we'll probably come up with all sorts of interesting things like, you know, RAG is already very helpful. [00:43:26]Swyx: Good enough, you think? [00:43:27]Kanjun: Maybe good enough for some things. [00:43:29]Swyx: How is it not good enough? I don't know. [00:43:31]Kanjun: I just think about a situation where you want something that's like an AI scientist. As a scientist, I have learned so much about my fields and a lot of that data is maybe hard to fine tune or on, or maybe hard to like put into pre-training. Like a lot of that data, I don't have a lot of like repeats of the data that I'm seeing. You know, like if I'm a scientist, I've like accumulated so many little data points. And ideally I'd want to store those somehow, or like use those to fine tune myself as a model somehow, or like have better memory somehow. I don't think RAG is enough for that kind of thing. But RAG is certainly enough for like user preferences and things like that. Like what should I do in this situation? What should I do in that situation? That's a lot of tasks. We don't have to be a scientist right away. Awesome. [00:44:21]Swyx: I have a hard question, if you don't mind me being bold. Yeah. I think the most comparable lab to InView is Adept. You know, a research lab with like some amount of product situation on the horizon, but not just yet, right? Why should people work for InView over Adept? And we can cut this if it's too like... Yeah. [00:44:40]Kanjun: The way I think about it is I believe in our approach. The type of thing that we're doing is we're trying to like build something that enables other people to build agents and build something that really can be maybe something like an operating system for agents. I know that that's what we're doing. I don't really know what everyone else is doing. You know, I can kind of like talk to people and have some sense of what they're doing. And I think it's a mistake to focus too much on what other people are doing, because extremely focused execution on the right thing is what matters. To the question of like, why us? I think like strong focus on reasoning, which we believe is the biggest blocker, on inspectability, which we believe is really important for user experience and also for the power and capability of these systems. Building non-leaky, good abstractions, which we believe is solving the core issue of agents, which is around reliability and being able to make them deployable. And then really seriously trying to use these things ourselves, like every single day, and getting to something that we can actually ship to other people that becomes something that is a platform. Like, it feels like it could be Mac or Windows. I love the dogfooding approach. [00:45:49]Swyx: That's extremely important. And you will not be surprised how many agent companies I talk to that don't use their own agent. Oh no, that's not good. That's a big surprise. [00:45:59]Kanjun: Yeah, I think if we didn't use our own agents, then we would have all of these beliefs about how good they are. Wait, did you have any other hard questions you wanted to ask? [00:46:08]Swyx: Yeah, mine was just the only other follow-up that you had based on the answer you just gave was, do you see yourself releasing models or do you see yourself, what is the artifacts that you want to produce that lead up to the general operating system that you want to have people use, right? And so a lot of people just as a byproduct of their work, just to say like, hey, I'm still shipping, is like, here's a model along the way. Adept took, I don't know, three years, but they released Persimmon recently, right? Like, do you think that kind of approach is something on your horizon? Or do you think there's something else that you can release that can show people, here's kind of the idea, not the end products, but here's the byproducts of what we're doing? [00:46:51]Kanjun: Yeah, I don't really believe in releasing things to show people like, oh, here's what we're doing that much. I think as a philosophy, we believe in releasing things that will be helpful to other people. [00:47:02]Swyx: Yeah. [00:47:02]Kanjun: And so I think we may release models or we may release tools that we think will help agent builders. Ideally, we would be able to do something like that, but I'm not sure exactly what they look like yet. [00:47:14]Swyx: I think more companies should get into the releasing evals and benchmarks game. Yeah. [00:47:20]Kanjun: Something that we have been talking to agent builders about is co-building evals. So we build a lot of our own evals and every agent builder tells me, basically evals are their biggest issue. And so, yeah, we're exploring right now. And if you are building agents, please reach out to me because I would love to, like, figure out how we can be helpful based on what we've seen. Cool. [00:47:40]Swyx: That's a good call to action. I know a bunch of people that I can send your way. Cool. Great. [00:47:43]Kanjun: Awesome. [00:47:44]Swyx: Yeah. We can zoom out to other interests now. [00:47:46]Alessio: We got a lot of stuff. So we have Sherif from Lexicon, the podcast. He had a lot of interesting questions on his website. You similarly have a lot of them. Yeah. [00:47:55]Swyx: I need to do this. I'm very jealous of people with personal websites right there. Like, here's the high level questions of goals of humanity that I want to set people on. And I don't have that. [00:48:04]Alessio: It's never too late, Sean. [00:48:05]Swyx: Yeah. [00:48:05]Alessio: It's never too late. [00:48:06]Kanjun: Exactly. [00:48:07]Alessio: There were a few that stuck out as related to your work that maybe you're kind of learning [00:48:12]Swyx: more about it. [00:48:12]Alessio: So one is why are curiosity and goal orientation often at odds? And from a human perspective, I get it. It's like, you know, would you want to like go explore things or kind of like focus on your career? How do you think about that from like an agent perspective? Where it's like, should you just stick to the task and try and solve it as in the guardrails as possible? Or like, should you look for alternative solutions? [00:48:34]Swyx: Yeah. [00:48:34]Kanjun: I think one thing that's really interesting about agents actually is that they can be forked. Like, you know, we can take an agent that's executed to a certain place and said, okay, here, like fork this and do a bunch of different things. I try a bunch of different things. Some of those agents can be goal oriented and some of them can be like more curiosity driven. You can prompt them in slightly different ways. And something I'm really curious about, like what would happen if in the future, you know, we were able to actually go down both paths. As a person, why I have this question on my website is I really find that like I really can only take one mode at a time and I don't understand why. And like, is it inherent in like the kind of context that needs to be held? That's why I think from an agent perspective, like forking it is really interesting. Like I can't fork myself to do both, but I maybe could fork an agent to like add a certain point in a task. [00:49:26]Swyx: Yeah. Explore both. Yeah. [00:49:28]Alessio: How has the thinking changed for you as the funding of the company changed? That's one thing that I think a lot of people in the space think is like, oh, should I raise venture capital? Like, how should I get money? How do you feel your options to be curious versus like goal oriented has changed as you raise more money and kind of like the company has grown? [00:49:50]Kanjun: Oh, that's really funny. Actually, things have not changed that much. So we raised our Series A $20 million in late 2021. And our entire philosophy at that time was, and still kind of is, is like, how do we figure out the stepping stones, like collect stepping stones that eventually let us build agents, kind of these new computers that help us do bigger things. And there was a lot of curiosity in that. And there was a lot of goal orientation in that. Like the curiosity led us to build CARBS, for example, this hyperparameter optimizer. Great name, by the way. [00:50:28]Swyx: Thank you. [00:50:29]Kanjun: Is there a story behind that name? [00:50:30]Swyx: Yeah. [00:50:31]Kanjun: Abe loves CARBS. It's also cost aware. So as soon as he came up with cost aware, he was like, I need to figure out how to make this work. But the cost awareness of it was really important. So that curiosity led us to this really cool hyperparameter optimizer. That's actually a big part of how we do our research. It lets us experiment on smaller models. And for those experiment results to carry to larger ones. [00:50:56]Swyx: Which you also published a scaling laws, which is great. I think the scaling laws paper from OpenAI was like the biggest. And from Google, I think, was the greatest public service to machine learning that any research lab can do. Yeah, totally. [00:51:10]Kanjun: What was nice about CARBS is it gave us scaling laws for all sorts of hyperparameters. So yeah, that's cool. It basically hasn't changed very much. So there's some curiosity. And then there's some goal oriented parts. Like Avalon, it was like a six to eight week sprint for all of us. And we got this thing out. And then now different projects do like more curiosity or more goal orientation at different times. Cool. [00:51:36]Swyx: Another one of your questions that we highlighted was, how can we enable artificial agents to permanently learn new abstractions and processes? I think this is might be called online learning. [00:51:45]Kanjun: Yeah. So I struggle with this because, you know, that scientist example I gave. As a scientist, I've like permanently learned a lot of new things. And I've updated and created new abstractions and learned them pretty reliably. And you were talking about like, okay, we have this RAM that we can store learnings in. But how well does online learning actually work? And the answer right now seems to be like, as models get bigger, they fine tune faster. So they're more sample efficient as they get bigger. [00

god ceo community new york city google ai apple lessons state san francisco building friend partner mit tools explore pilot iphone code theory bs vr singapore mac stanford commerce windows leveraging chat models publishing cto ram exploration vc minecraft gp sf openai residence archive nvidia gi voyager airpods gpt dropbox notion thursday night 1b carbs flops abe avalon linus reasoning interface parc benchmark neumann ides turing starcraft partly kevin kelly rag gpus series b vue tldr alessio yc anthropic godot honorary mit media lab rl lexicon 100b mvc spc sorceress bell labs adept sherif ruchi sacan moco smol persimmon boolean wasim xerox parc alan kay andrej karpathy imbue george hotz embu michael nielsen bizops llama2 chelsea finn

Doing it the Hard Way: Making the AI engine and language

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Sep 14, 2023 89:22

Want to help define the AI Engineer stack? Have opinions on the top tools, communities and builders? We're collaborating with friends at Amplify to launch the first State of AI Engineering survey! Please fill it out (and tell your friends)!If AI is so important, why is its software so bad?This was the motivating question for Chris Lattner as he reconnected with his product counterpart on Tensorflow, Tim Davis, and started working on a modular solution to the problem of sprawling, monolithic, fragmented platforms in AI development. They announced a $30m seed in 2022 and, following their successful double launch of Modular/Mojo

ceo president google ai starting apple future state design phd goals performance microsoft dna iphone language cnn tesla fall in love attention tree humans matrix android discord origins nerds switzerland mac lego ios ipads windows senior director intel cto actors swift siri openai load residence nvidia rust hardware api generally engine learnings cs ads prom mojo python ui gpt ml linux enabling llama amplify autopilot automatic macbook guido amd macos underneath hard way macs cpu llm gpu flute google cloud modular docker walrus gpus tf simplest alessio speculative satya rl playgrounds cpus gcp dsl proms cpp clippy tensorflow hpc cuda keras xcode chris do caffe compiler smol risc v pytorch distinguished engineer tim davis objective c risc google brain clang intel cpus product engineering jupyter notebooks cutlass tpu jeremy howard swift playgrounds imagenet a100 pypi llvm halide graviton amx chris lattner tvm alerted andrej karpathy numpy ai engineer george hotz scott forstall chris yeah christ here chris ray tabular chris you sagemaker chris so resnet chris no cisc chris oh chris well chris one xla pypy chris right mnist chris they wkb mlir cython

George Hotz vs Eliezer Yudkowsky AI Safety Debate

The Lunar Society

Play Episode Listen Later Aug 17, 2023 88:22

George Hotz and Eliezer Yudkowsky hashed out their positions on AI safety.It was a really fun debate. No promises but there might be a round 2 where we better hone in on the cruxes that we began to identify here.Watch the livestreamed YouTube version (high quality video will be up next week).Catch the Twitter stream.Listen on Apple Podcasts, Spotify, or any other podcast platform. Check back here in about 24 hours for the full transcript. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dwarkeshpatel.com

ai debate safety eliezer yudkowsky george hotz

The Mathematics of Training LLMs — with Quentin Anthony of Eleuther AI

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Aug 16, 2023 50:38

Invites are going out for AI Engineer Summit! In the meantime, we have just announced our first Actually Open AI event with Brev.dev and Langchain, Aug 26 in our SF HQ (we'll record talks for those remote). See you soon (and join the Discord)!Special thanks to @nearcyan for helping us arrange this with the Eleuther team.This post was on the HN frontpage for 15 hours.As startups and even VCs hoard GPUs to attract talent, the one thing more valuable than GPUs is knowing how to use them (aka, make GPUs go brrrr).There is an incredible amount of tacit knowledge in the NLP community around training, and until Eleuther.ai came along you pretty much had to work at Google or Meta to gain that knowledge. This makes it hard for non-insiders to even do simple estimations around costing out projects - it is well known how to trade $ for GPU hours, but trading “$ for size of model” or “$ for quality of model” is less known and more valuable and full of opaque “it depends”. This is why rules of thumb for training are incredibly useful, because they cut through the noise and give you the simple 20% of knowledge that determines 80% of the outcome derived from hard earned experience.Today's guest, Quentin Anthony from EleutherAI, is one of the top researchers in high-performance deep learning. He's one of the co-authors of Transformers Math 101, which was one of the clearest articulations of training rules of thumb. We can think of no better way to dive into training math than to have Quentin run us through a masterclass on model weights, optimizer states, gradients, activations, and how they all impact memory requirements.The core equation you will need to know is the following:Where C is the compute requirements to train a model, P is the number of parameters, and D is the size of the training dataset in tokens. This is also equal to τ, the throughput of your machine measured in FLOPs (Actual FLOPs/GPU * # of GPUs), multiplied by T, the amount of time spent training the model.Taking Chinchilla scaling at face value, you can simplify this equation to be `C = 120(P^2)`.These laws are only true when 1000 GPUs for 1 hour costs the same as 1 GPU for 1000 hours, so it's not always that easy to make these assumptions especially when it comes to communication overhead. There's a lot more math to dive into here between training and inference, which you can listen to in the episode or read in the articles. The other interesting concept we covered is distributed training and strategies such as ZeRO and 3D parallelism. As these models have scaled, it's become impossible to fit everything in a single GPU for training and inference. We leave these advanced concepts to the end, but there's a lot of innovation happening around sharding of params, gradients, and optimizer states that you must know is happening in modern LLM training. If you have questions, you can join the Eleuther AI Discord or follow Quentin on Twitter. Show Notes* Transformers Math 101 Article* Eleuther.ai* GPT-NeoX 20B* BLOOM* Turing NLG* Mosaic* Oak Ridge & Frontier Supercomputer* Summit Supercomputer * Lawrence Livermore Lab* RWKV* Flash Attention * Stas BekmanTimestamps* [00:00:00] Quentin's background and work at Eleuther.ai* [00:03:14] Motivation behind writing the Transformers Math 101 article* [00:05:58] Key equation for calculating compute requirements (tau x T = 6 x P x D)* [00:10:00] Difference between theoretical and actual FLOPs* [00:12:42] Applying the equation to estimate compute for GPT-3 training* [00:14:08] Expecting 115+ teraflops/sec per A100 GPU as a baseline* [00:15:10] Tradeoffs between Nvidia and AMD GPUs for training* [00:18:50] Model precision (FP32, FP16, BF16 etc.) and impact on memory* [00:22:00] Benefits of model quantization even with unlimited memory* [00:23:44] KV cache memory overhead during inference* [00:26:08] How optimizer memory usage is calculated* [00:32:03] Components of total training memory (model, optimizer, gradients, activations)* [00:33:47] Activation recomputation to reduce memory overhead* [00:38:25] Sharded optimizers like ZeRO to distribute across GPUs* [00:40:23] Communication operations like scatter and gather in ZeRO* [00:41:33] Advanced 3D parallelism techniques (data, tensor, pipeline)* [00:43:55] Combining 3D parallelism and sharded optimizers* [00:45:43] Challenges with heterogeneous clusters for distribution* [00:47:58] Lightning RoundTranscriptionAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]Swyx: Hey, today we have a very special guest, Quentin Anthony from Eleuther.ai. The context for this episode is that we've been looking to cover Transformers math for a long time. And then one day in April, there's this blog post that comes out that literally is called Transformers Math 101 from Eleuther. And this is one of the most authoritative posts that I've ever seen. And I think basically on this podcast, we're trying to give people an intuition around what are the rules of thumb that are important in thinking about AI and reasoning by AI. And I don't think there's anyone more credible than the people at Eleuther or the people training actual large language models, especially on limited resources. So welcome, Quentin. [00:00:59]Quentin: Thank you. A little bit about myself is that I'm a PhD student at Ohio State University, starting my fifth year now, almost done. I started with Eleuther during the GPT-NeoX20B model. So they were getting started training that, they were having some problems scaling it. As we'll talk about, I'm sure today a lot, is that communication costs and synchronization and how do you scale up a model to hundreds of GPUs and make sure that things progress quickly is really difficult. That was really similar to my PhD work. So I jumped in and helped them on the 20B, getting that running smoothly. And then ever since then, just as new systems challenges arise, and as they move to high performance computing systems and distributed systems, I just sort of kept finding myself falling into projects and helping out there. So I've been at Eleuther for a little bit now, head engineer there now, and then finishing up my PhD and then, well, who knows where I'll go next. [00:01:48]Alessio: Awesome. What was the inspiration behind writing the article? Was it taking some of those learnings? Obviously Eleuther is one of the most open research places out there. Is it just part of the DNA there or any fun stories there? [00:02:00]Quentin: For the motivation for writing, you very frequently see in like the DL training space, like these Twitter posts by like, for example, like Stas Bekman at Hugging Face, you'll see like a Twitter post that's like, oh, we just found this magic number and everything is like 20% faster. He's super excited, but doesn't really understand what's going on. And the same thing for us, we very frequently find that a lot of people understand the theory or maybe the fundamentals of why like AI training or inference works, but no one knows like the nitty gritty details of like, how do you get inference to actually run correctly on your machine split across two GPUs or something like that. So we sort of had all of these notes that we had accumulated and we're sort of sharing among engineers within Eleuther and we thought, well, this would really help a lot of other people. It's not really maybe appropriate for like a paper, but for something like a blog post or technical report, this would actually maybe squeeze a lot of performance out of people's hardware they're already running on. So I guess there are a lot of projects in Eleuther that we're sort of trying to share notes with people in a way that typical institutions don't. They sort of live within that institution and then you go to a different institution and they do something very similar, but without the lessons of the previous. And it's because everyone's trying to do their own special sauce with their own stack. Whereas Eleuther, we don't really have that constraint and we can just share everything to everybody. [00:03:14]Swyx: Yeah, this is a level of openness that basically very few people actually embrace. One, it's an extra effort to write things down, of course, but two, it is secret sauce and so that not many people do it. And therefore, oftentimes the only way to learn this stuff is to actually work in one of the large model labs. And so you guys are doing a lot. The only other instance where I can think of where people actually open sourced their process was Facebook's OPT. What else is similar, like sort of trade knowledge, but not formal research knowledge? [00:03:45]Quentin: I would say Bloom. So the Hugging Face Bloom project in big science and all of that, that was very open. I'd say it's the same caliber, if not more detailed than OPT. Other than that, I think there was like a doc from Microsoft on like their Turing NLG. Their paper is pretty relaxed in that it did talk about some of those challenges. Other than like OPT and Bloom and us, I can't think of any. It's a new thing. [00:04:10]Swyx: It matters that you are going for the sort of good enough rules of thumb, because I think a lot of people try to go for precision and being overly precise actually is not helpful. Right. Yes. [00:04:20]Quentin: You'll see some like statements in the blog posts that are just like, we think this is about 1.2 in our experience. And, you know, we don't go any further into detail and it would take maybe an extra month for us to chase down every single little piece of memory. But instead, like getting good enough is still helpful to people. [00:04:36]Alessio: Let's jump into it. The first part of the article, and we'll put this in the show notes so people will be following along with the post. So we don't need to read every single equation and every footnote for it. [00:04:46]Swyx: Okay. [00:04:46]Alessio: But the core equation here is that not the cost of compute, but the compute required to turn a transformer model is roughly equal to tau times T, where like T is the, where tau is the hardware setup throughput that you have. So number of GPUs times the actual flops per GPU. And then T is the time spent. I think people can visualize that pretty easily. It's basically like how many GPUs do you have and how much do you let them run for? And the things that come to it that people have read before in the Chinchilla paper in a way, and the OpenAI scaling law is that you can then equal this to 6PD, where P is the number of parameters in the model and D is the size of the, of the dataset in tokens. So talk a little bit about how people should think about the two. I think a lot of times the focus is on tokens parameter ratio in the training dataset and people don't think as much about the actual flops per GPU, which you're going to mention later in the blog post too, in terms of how much you can get out. So how should people think about this when they're building a model and where should they go to this equation as they're starting to think about training their own transformer-based [00:05:58]Swyx: model? [00:05:58]Quentin: You touched a little bit on the fact that people usually start with the dataset. So you have some dataset that you want to train a model on. And then from there, from the 6PD, you should see, okay, I should have about six tokens per parameter. So that determines my model size thereabouts for Chinchilla Optimal. So since then we've seen that need more something like 20 or more than that to get a good quality model. But the next question that should be on your mind in terms of a systems perspective is how long is it going to take for this model to train and what kind of budget should I expect? So let's say I want some cloud instance for some amount of time and each of them will have some price attached to it. So that's where the throughput comes in. So now that you have this model, this number of parameters, you should map that to a transformer architecture and you should benchmark what throughput you get on your software stack for that type of model. So now you have your flops per second on a single GPU. And then given whatever parallelism scheme, which I'm sure we'll get into, like data parallelism or tensor parallelism or whatever else, how is that flops number going to scale to whatever number of GPUs? And then from there, you're going to get a time. And if you have a time, you have a cost. Those are like the business answers that you'll be able to get using this formula. That's why we sort of split it into the T and the throughput terms so that you can solve for one of them, which is usually get throughput, need time, and from time you get cost. In a nutshell, that's the answer. [00:07:19]Alessio: One thing that I noticed, you mentioned some of these laws are only true when a thousand GPUs for one hour cost the same as one GPU for a thousand hours, given that we have a shortage of the biggest GPUs out there. Any thoughts there on how people should prioritize this? [00:07:36]Quentin: Yeah, so I would say you should find what the minimum number of GPUs is to just fit your model first. The memory bottleneck is your biggest problem if you have a sizable model. If it's a small model, nobody cares. But most models that people care about will need to be split across multiple GPUs. So find the minimum number of GPUs to just fit your one instance of your model and then calculate how long that's going to take. If it's a reasonable amount of time, then you're done. If it takes too long, then you need to start worrying about having multiple instances of that model. I always feel like you should go with the minimum number of GPUs because the more number of GPUs that you have, the more likely it is for things to break. So I would say just find out what time is reasonable for you and then fit the number of GPUs to that and no more. Because people get greedy and they say, if I have twice the GPUs, I can get this done in half the time. And then you end up taking three times the time because everything is breaking every day. And that's when I am up at midnight trying to fix your model that's broken. [00:08:34]Swyx: We had a previous guest which has invested a lot in their framework for training these things. Would there not be an equivalent open source framework you guys would have made that would help with scaling up GPUs linearly like that? Or is this an oversimplification? [00:08:50]Quentin: Okay, yeah. So maybe I should step back. Both Mosaic and us have our own sort of software stack recipe that scales well, theoretically. But I'll get to that in a minute. Mosaic is all based off optimizer sharding. So it's based off ZeRO. So you basically perfectly split your model optimizer and your parameters and your gradients across all of the different GPUs. So your aggregate memory is number of parameters divided by number of GPUs. Same thing for optimizer and so on. Whereas we at Eleuther use a Megatron deep speed based library. And for that, it's a bit more complex. So the efficiency can be a little higher, but it's more prone to failure at the same [00:09:30]Swyx: time. [00:09:30]Quentin: So you kind of have to tune it. In both cases, getting back to like the practical case, you should be able to get linear speed up by adding more GPUs. The problem is that there are hardware failures. You tend to have problems with like maybe loss will overflow if you have too many GPUs or maybe one GPU will hang. You might have software issues. You might have synchronization issues. And that's why I'm saying practically that you should take the minimum number of GPUs that you have because those are the easier cases to debug. That make sense? [00:10:00]Swyx: Yeah. [00:10:00]Quentin: Any more detail on any specific point? [00:10:02]Swyx: Not particularly, just because we haven't actually had to debug those things. But I imagine basically there's a lot of return towards encoding these knowledge into software and not repeating it again. So it makes a ton of sense. I think Alessio had more questions before we move too far into high level, more questions on just the equation itself. I think we want to spend time on essentially, this is the central equation of figuring out compute requirements. Yeah. [00:10:25]Alessio: Another thing in it is that the computer is like the forward pass and like the backwards pass and forward is 2PD, backward is 4PD. Why it's to the ratio between the two? Can you explain that? Why is it two and four? [00:10:39]Quentin: Yeah. [00:10:40]Alessio: Why is it twice the amount? [00:10:42]Quentin: Oh, okay. Intuitively for forward pass, you're just moving, you're propagating forward the inputs through the layer. And then in the backward pass, you're doing something a little more complex than that. You're doing back propagation. And I don't think I can explain it intuitively enough to go into more detail on the exact [00:10:58]Swyx: numbers. Yeah. [00:10:58]Quentin: That's okay. [00:10:59]Swyx: I feel like you want to get out a whiteboard and start drawing like, you know. [00:11:02]Quentin: That's what I would normally do. [00:11:03]Swyx: Tangents and gradients. It's actually surprisingly low to do the back propagation. Honestly, that's one of the fundamental things I love about the math of deep learning so far that as I've explored it, which is, it's surprisingly efficient as compared to other, I guess, numerical methods you might be exposed to and, you know, college calculus. Yeah. [00:11:22]Alessio: And I think the other thing is that things sound simple, you know, when people go on Twitter and say, Oh, 20 is like the optimal ratio. And it's like, then it's like, well, why is that the number? And the answer is usually much, much harder, like what we're seeing right now. So I think it's a, it's a good reminder that the numbers are simple, like all the best and most popular, like math equations are like, so elegant. Obviously the proof behind that is, it's not that easy. That's always a good reminder. [00:11:52]Swyx: I want to put this equation to the test a little bit. We can do this from either GPT-3's perspective or GPT-NeoX, whatever you're more comfortable with. You have this distinction of actual flops versus theoretical flops. And a lot of times when people report the flops it took to train a model, like we just saw one in Lama 2 where the estimate is something that the amount of flops and that's, that's what we go with. So GPT-3 took a 3.14 times 10 to the power 23 flops. That is the theoretical flops. I want to get to a point where I can sort of work out if a number passes the smell test. And I wonder how to do that because I should be able to plug in this equation, right? I know that GPT-3 was trained on 300 billion tokens. I know the parameter size of 175. Is it, is it just like a 6 times 175 times 300? Like I haven't done the math, but what are the nuances here that you might want to call out? [00:12:42]Quentin: Theoretical flops is usually given from, you have a given set of hardware and this is what you expect your hardware to get. The problem is that in practice, full utilization, that's the key word, right? Because in practice, there are a lot of cases where like you're spending time waiting on data movement from like the GPU to CPU. Or for example, you might be waiting to synchronize across the different GPUs. So there's a lot of idle time basically that you're going to be spending during training. [00:13:05]Swyx: Smell tests. [00:13:06]Quentin: I don't know if I have a smell test myself, to be honest, like maybe I'll look at like what sort of flops, what you would expect on like an A100. There's sort of just an expected flops for a given GPU that everyone sort of knows what you should expect. So like for an A100, that number is somewhere between 100 and 180. T flops is what you would expect to see on an A100. For a V100, like an older GPU, it's something more like 40 to 30. So people sort of know, given the kernels that we're running for a deep learning, what sort of flops you expect. And then you sort of compare that to the theory, to the theoretical flops that people are reporting and see if that matches your expectations. [00:13:47]Swyx: Yeah. [00:13:47]Alessio: And in the article you mentioned for the A100, like if you're seeing below 115 teraflops a second, there's something wrong with your model or hardware. How did you get to 115? Is it just, you know, production observability and like you've seen over months and months and months that like that's the baseline or how do you come up with the numbers like that? Yeah. [00:14:08]Quentin: For a number like that, we basically, we compared a lot of different frameworks. So like I mentioned before, Mosaic has their own framework and we have our own framework. They all have their own flop counters too, right? And we saw across a bunch of different hardware configurations that if you tune things correctly, you should be getting above 115 in pretty much all cases. So like there are some cases where things are tuned poorly or your system is a little weird, but we've never been able to get a new system and not been able to get above [00:14:35]Swyx: 115. [00:14:35]Quentin: If something is below 115, you have something really wrong in your software. But that's really all it is, is just comparing across software stacks and hardware systems. [00:14:44]Alessio: What about different GPUs? We had George Hotz on the podcast and he talked about AMD cards and how in theory their flops should be much better than some Nvidia cards, but the reality is like the CUDA runtime makes up for it. How should people think about improving that? You know, like do you see, okay, the A100 is like 115 teraflops. I'd rather just stick with this than try and figure out all the kinks of like a better AMD card or any thoughts there? [00:15:10]Swyx: Right. [00:15:10]Quentin: Well, that's sort of touching on developer time, right? And which ends up being more expensive because at the end of the day, the AMD and Rockham software stack has a long way to go. I would say most things run there, not particularly efficiently, but you're going to have weird bugs that no one has encountered before. One of the big pluses of going with the Nvidia and PyTorch stack is that there are thousands of GitHub issues with everyone facing the same problem as you and resolving them quickly and in an open source way is probably the biggest benefit of going with the Nvidia software stack right now. AMD has about the same hardware, software, not so much. And they haven't quite got the momentum in the open source realm, for example, to get close. Like something, for example, like Flash Attention, it's spread to more Nvidia GPU types than it has like to AMD at all. And waiting on those latest and greatest features to reach AMD is something that's prohibitive to a lot of people, but it's getting there. I'm running a lot of experiments on AMD right now because it's sort of reached the government lab supercomputers now. And so a lot of experiments are going there and it will catch up, I'd say within a few [00:16:14]Swyx: years. [00:16:14]Quentin: Awesome. [00:16:15]Swyx: Maybe just talk about what's available from the government labs and I heard the original, the origin of Eluther started with a grant for TPUs. Is that right? [00:16:24]Quentin: Yes, that was a little before me, but there was a lot of just like getting a grabbing a Google Cloud or TPU pod or something like that is a lot of the original TPU work on Mesh TensorFlow, which is like now like an ancient distributed deep learning library. [00:16:36]Quentin: Eluther got a grant, an insight grant with Oak Ridge last year, and we got quite a bit of Summit Compute. So Summit is a V100 based supercomputer. It's got some weirdness to it. So there's six V100 GPUs per node. And we did a lot of experiments there. It's a challenging system to scale to because your interconnect across nodes is kind of slow in comparison to within a node, which I think we'll get to later. But now Oak Ridge has moved to AMD. So the next grant that we're trying to work towards is on Frontier, which has four AMD GPUs per node and again has a slower interconnect across nodes. So we get all of those new challenges again to try and overlap things. But that's just like you have Oak Ridge, you have Lawrence Livermore. There's a lot of government supercomputers that you can apply for compute towards like open researchers too. It's sort of a new thing. I think we're one of the first like us and like Lion, for example, is another organization that's getting compute from government providers and such. They're all moving to AMD as well. And we look forward to exploring that with them. [00:17:42]Swyx: Yeah. [00:17:43]Alessio: The computing is definitely, it used to be easy to find the GPU. Now, not as much. So you got to find them anywhere. [00:17:49]Swyx: Yes. [00:17:49]Alessio: Let's talk about memory requirements a little bit. So you touched on this a little bit before and just before this, we had a trade out on the pockets from FlashAttention and memory speed was one of our main focuses, but this time we're being bound by actually memory size, like the VRAM itself, when it comes to model weights and parameters and optimizer states and all that fun stuff. Let's go through this and Sean, we can, we can take turns. There's a lot to cover here, but maybe we can start from model weights. So one topic we covered a lot in the past is precision and quantization. That's one of the obviously main driver of memory. You mentioned most of, in the article, most transformers are mixed precision, like FP16 plus FP32 or BF16 FP32, and they can be cast down. And you mentioned up to like INT8 without a lot of performance hit. So let's start there and maybe run people through some of the maths and like the byte per parameter ratio and different precision. [00:18:50]Swyx: Sure. [00:18:51]Quentin: So when I started deep learning, it was all FP32. You have 32 bits, four bytes per parameter. Things were pretty simple. You didn't have to do any loss scaling at all. But the problem was that you didn't get a whole lot of flops once NVIDIA moved to V100s and introduced Tensor cores. So Tensor cores do all of their computation at FP16 precision. So you're kind of throwing all of those away if you're doing things in FP32. So once the hardware moved to V100, the software moved to like mixed precision and APEX and AMP and such. And one counterintuitive part of mixed precision is that you actually require more memory when you're trained because you need an FP16 copy of the weights and an FP32 copy of the weights. The FP16 copy is where you're doing like your actual computation on the Tensor cores. So you get maybe it's not uncommon to get double the throughput that you would see before in FP32. And then you at each step update that FP32 copy with the FP16 update. So both need to be stored in memory. The problem with that is that FP16 is very precise but doesn't have a whole lot of range, [00:19:55]Swyx: dynamic range. [00:19:55]Quentin: So you have a really big mantissa if you're thinking in terms of like floating point representations, not a whole lot of exponent. So BF16 puts more of the bits from the mantissa back to the exponent. So you have a much higher range and a lower precision. And that gets rid of all of this instability problem and loss scaling and such that anyone familiar with debugging knows how unstable it can be, especially for large scale training. And BF16 does away with a lot of that, but it's only supported on A100s. So you see the back and forth between hardware and software. So every time NVIDIA introduces some new Tensor cores or BF16 support or something like that, the software adapts to support it and then training adapts. And then now you mentioned like Ind8 and such. Now we're seeing that you have some model that's been trained in FP16, FP32, whatever else. And then now you want to, with minimal loss and accuracy, quantize that model into a smaller representation like Ind8 and now like Ind4 and things like that and see what you can get away with. And then since deep learning is such like a stochastic problem that a lot of those last bits of precision don't really matter is what we're finding. And I expect that to continue. [00:21:06]Alessio: And so just to put some numbers to it, when you have a FP32, you need four bytes per parameter at inference time to load it in memory. If you have a eight bits model quantized down, you need one byte per parameter. So for example, in an H100, which is 80 gigabyte of memory, you could fit a 70 billion parameters in eight, you cannot fit a FP32 because you will need like 280 gigabytes of memory. So how much does that play into it? Like you mentioned it was all FP32 when you first started. Is it just like a development complexity thing, like going down to FP16 and then Ind8? Or if they could get a GPU with like a terabyte of VRAM, will people just load this memory as like FP32 weights or would they still want to quantize them to make them more efficient? Right. [00:22:00]Quentin: I would say even if you had infinite VRAM, you would still want a quantized model, just a bigger model that's quantized is what I would say. And that's because like I was mentioning there at the end, how like deep learning is very stochastic and a lot, you could have all the precision in the world, but ultimately it's meaningless when you still depend so much like on what the input is. And you depend so much on little variations and maybe a few more samples of training data would matter more. A lot of that precision in a nutshell doesn't really matter in deep learning. All that matters is the big picture. What is that neuron actually saying? And not the tiny details of what it might be thinking. Oh, I also wanted to mention that even if you have an A100, the actual model size is quite a bit smaller that you could load than what you mentioned. That's because of the KV cache. So the KV cache intuitively during inference, it only matters during inference and think intuitively if you're writing a paragraph, you want to remember every single previous word that you've written before you write the next word. So like what is autoregressive language modeling? It's filling in the next word, the next token. So if I say like the dog went to the, and I need to write the next word, I would say park or something. Before I write the next word, my memory is wiped and I have to read the whole thing again. That is life without a KV cache. And a KV cache says, remember everything that I've generated before, as well as all the context before what I've generated. But the memory overhead for a KV cache commonly is either comparable or larger than the model in some cases, if you have a really long context. And I think the exact equation is something like, oh, it's like two times the number of layers, times the number of heads, times the dimension of each head. And then there's two of those. You have one for K, one for V. But that was just a quick aside. Yeah. [00:23:44]Alessio: I know this is Transformers math, but do you think one of the interesting things about RNNs too, it's like moving away from this, like KV cache, the scales with the sequence length and having like a fixed sequence pass. I know those are some of the things that people are working on. [00:24:00]Swyx: Yeah. [00:24:00]Quentin: So there's a paper that I was involved with called RWKV that I would recommend people read. It is answering this exact question. So how do you get Transformers quality without this quadratic attention overhead that Transformers requires? So it is interesting. I don't know if I can really dive too deep into the technical details there. I'd recommend people read the paper. But yeah. [00:24:23]Swyx: Yeah. [00:24:23]Alessio: It's interesting to see if attention is all you need, or maybe attention is all we need, but we need better ways to make it infer in a good way. [00:24:33]Swyx: We've actually done an unreleased episode with one of the RWKV core members and they call it soft attention or light attention. I forget what they call it, but yeah, just ways to approximate it such that it's linear and not quadratic. That's great. Yeah. [00:24:47]Quentin: I didn't know that you were involved. [00:24:48]Swyx: That's great. How did you get involved? Is it just because like everyone just hangs out in Discord and talks about the future of Transformers? Oh yeah. [00:24:55]Quentin: I mean, the RWKV people specifically are in Eleuther all the time. Like they're very close collaboration with us. And my contribution was we have all of these experiments done by all of these people on RNNs and how they relate to Transformers and how do we turn that into a paper and disseminate that digestibly so that people don't have to read through like a Discord log from a year ago to understand what's going on. [00:25:16]Swyx: Oh my God. [00:25:16]Quentin: Just read this paper. So that took some work, but I wasn't a core contributor. So that's why I don't want to go into like the technical details. But yeah, that's how I did. [00:25:24]Swyx: We'll try to get that RWKV episode out. It seems like there's increasing mentions of it and they are doing pretty important work as far as scaling these models are concerned. Okay. So we discussed inference type quantization and memory requirements. And then you also had a section on training with a lot of stuff I think mentioned. I think we probably want to spend the most of our time on optimizer states and the Atom optimizer. Yeah. What are your takes on it and what should people keep in mind when they deal with these optimizers? Okay. [00:25:57]Quentin: I would say the Atom optimizer is good at what it does. It's sort of a broad question. So let me think. You have the copy of the weights and then you have your momentum and your variance that [00:26:08]Swyx: you store. [00:26:08]Quentin: And like, okay, maybe an intuitive explanation for momentum is that like, let's say you have a canyon and you're trying to get to the bottom. And if you're just doing basic SGD, then every step is going to be an equal size. Whereas if you're using something like Atom with the momentum term, then your steps should be progressively larger because you can see, oh, the general trend is we're heading downwards very quickly. But stepping back from that, since you have all of these extra terms in Atom, you require a lot more memory to store it. Like three times as much memory as SGD. And if you have all of this memory being spent on your optimizer states, then how do you distribute it across GPUs? Because you'll find that what ends up being your bottleneck more than just raw compute, raw flops on a given GPU is your parallelism. And that falls back onto how much model you can fit on a single GPU before you need to split it up across a bunch of GPUs. And then you end up spending time, more time with them talking to each other than actually making progress. So that's why all of this time in the blog post is spent on how do you distribute your model? What are all those different distributed strategies look like? Which ones are more efficient? And given that a lot of your memory is being spent optimizers, how do you distribute that optimizer specifically? Because a lot of people, when they talk about parallelism, they talk about model parallelism, the parameters themselves. In actuality, when you're training, a good portion of your memory is actually spent on optimizer states. So what specific part of that would you like to go into? Would you like to go into like zero or sharded optimizers? [00:27:36]Swyx: I think the sharded optimizer stuff is really interesting, but I think we're kind of leaving that towards the end, right? Because that's the maybe more advanced distributed sections. Here, I think we're just going for rough intuition for people who've maybe are familiar with the ideas of these optimizers, but haven't actually had to implement them yet. They read your code, but they don't really understand the intuition behind the code. I see. [00:28:00]Alessio: And Quentin, when you say in the blog post, it says, Adam is magic. How much of it is like actual magic, even to like people like you that are pretty close to the metal, so to speak? Are some of these things just come as gospel? It's like, I know this works, like I'm not touching it. I'm just leveraging it. How much of it are you actually thinking about improving on in your day-to-day work? I see. [00:28:22]Quentin: So I'm a systems guy. I'm an engineer. And a lot of these things come to me as magic. Adam comes to me as magic. I see it from the gods. I say, this is how a deep learning model is trained. And this is how the next step is calculated. And then I say, okay, how do I make that fast? I would say I do look at ways to improve upon it using things like second order optimizers. So there's a lot of research on there because they're hard to distribute. But the core contribution for me always comes down to someone else has done like some deep learning optimization and I need to make it run fast. So I can't really speak to the motivation of why Adam came about other than like simple, intuitive things like I mentioned with like the momentum. But what matters to me is that Adam takes more memory than SGD, specifically three times. And all of that memory needs to go somewhere and it needs to be split efficiently. [00:29:14]Swyx: Yeah. [00:29:14]Alessio: So when you add them all up, you got 12 bytes per parameter with vanilla Adam. [00:29:20]Swyx: Yeah. [00:29:20]Alessio: And then you still get the model parameters and memory too. So as you mentioned, you need to keep a copy of both for like a FB32, FB16 mixed, a copy of both quantization levels. So there's precision levels. So it's six bytes per parameter. Right. [00:29:36]Quentin: Taking a step back again, is that like, okay, most people think of your model getting big. So you need to split with model parallelism purely, something like tensor parallelism. But we can see that the model only takes like two bytes per parameter if we're doing FB16. Whereas the optimizer itself requires four bytes per parameter for the model states, four bytes for momentum, four bytes for variance. So what matters more is how do you split your optimizer efficiently and how do you store it efficiently? And something like bits and bytes, where the optimizer, you got like eight bit Adam, where those optimizer states is only one byte per parameter instead of four or something like that. That is going to give you a much better return on your model training and on your memory overhead required than if you were to, for example, quantize your pure like FB16 model weights down to int8 or something. So for training specifically, your optimizer memory matters a lot. The most in most cases. [00:30:31]Swyx: Well, yeah. [00:30:31]Alessio: And before we dive into zero, just to wrap up the items that you're going to shard later. So you have the parameters, you have the optimizer states, and then you have the gradients. Just maybe touch a little bit on that. And then we can talk about how to efficiently load them in GPUs. [00:30:48]Quentin: So the parameters are the FP32 copies of the parameters. We include them in the optimizer discussion. Some people don't, but just for clarity, it's 12 bytes per param for the optimizer states and four of them are for that FP32 copy of the weights. Four of them are for the momentum. I already went into why it's important to store momentum, but that's also per parameter. You need to store where that parameter is going and where it's been going in the past. You also need to know, okay, we know where it's going, but there's going to be bumps on this canyon that we're going down. So we need to store its variance. How often are those bumps? Should we be focusing more on the momentum? Or is this parameter just kind of jumping around everywhere? Those are all important answers that we need the optimizer to store, and it's per parameter. So that's where all three of those terms come from. And we also include some competing bits and bytes, for example, an SGD to show that depending on your optimizer, you may store all or none of these and in different representations. [00:31:50]Alessio: I'm looking at the total training memory. You essentially have model memory, optimizer memory, gradient memory, and activation memory. I think that's one of the last discussed things. So maybe just give people a little bit of a view. [00:32:03]Swyx: Yeah, this is completely new to me. [00:32:05]Alessio: Active, you know, recomputation, checkpointing, and all of that. [00:32:08]Swyx: Right. [00:32:09]Quentin: So, okay. So to summarize before activation checkpointing, which will be complicated, you have your model params, like I mentioned before, they used to be FP32. Now they're probably BF16, maybe FP16 if it's an older GPU. Then you have your optimizer. That's where a lot of the memory is going. And it's your high precision, usually FP32, copy of the weights. So that's four bytes per param. And then you have, optionally, a couple more terms like we just discussed, like momentum or variance or whatever else, depending on what your optimizer is. Then you have your gradients. So your gradients is what is the gradient update that we get after running the forward pass on the model. And that's going to be whatever your low precision copy of the weights is. So like two bytes per param, if you're using FP16 or BF16. And all of those are sort of set in stone. And that overhead is not going to go away for the duration of training. Your gradients might get cleared after you back propagate them, but your optimizer states and your model states aren't going away. That memory overhead will be there. Activation recomputation and activation memory is dynamic. So some people will come and have this problem where the model loads fine for training. But then when you actually run your first iteration, or you run some future iteration or something like that, you run out of memory, seemingly at random. And it's because of these activations that you're computing on the fly. Good summary, or do you want to get into activation recomputation now, or do you want me to touch on anything else? [00:33:35]Alessio: Yeah, I was going to say, when is the recomputation happening? How does it decide between recomputing versus storing? And talk a bit more about that, maybe. [00:33:47]Quentin: Yeah, okay. So there's a lot of different ways to do this, but I would say there are a few main ones. First is a very simple scheme. You recompute everything. Every single activation that you calculate is just going to be either used or thrown away until the end. So in that case, you care very much about memory. You care very little about compute. Maybe this would be a case where you have to distribute across a lot of different GPUs, for example. And your communication speed is really low. Then that might be a good case for you to just recompute everything. It happens rarely, but it happens. Next up would be something like selective recomputation. So in selective recomputation, which Megatron has a good paper on, and I believe the figure that we have in our blog post is from, in that case, you sort of do a weighted decision for each activation. So for really big activation tensors, you decide, is this going to be more expensive to save in terms of memory or to recompute in terms of compute? So that's sort of the smart scheme that Megatron implements. And there's a lot of different heuristics they use. It's probably not worth mentioning off this super long equation on a pod, but you should go and read that paper if you're interested on selective recomputation. And then a really stupid scheme that most people go with, including NeoX, would be something like, instead of doing all of these heuristics, you just say, if my tensor is bigger than X, I throw it away. And you set X to some static number, and that's it. And that is good enough for a lot of cases. [00:35:18]Swyx: Why is it good enough? [00:35:20]Quentin: You don't want to store more than, you know, X-sized tensor. And some fall above that, some fall below it. And you're not trying to squeeze. You care more about getting something close enough to what the actual heuristic should be without actually computing the heuristic because you don't want to spend the time writing that heuristic code. [00:35:37]Swyx: Cool. I think that does take us on a grand tour of the memory math. Is there any sort of high-level takeaway before we go into the distributed stuff? Zero and all that. Perhaps more detail than most people have ever encountered. And so I'll repeat the equation that Alessio mentioned again, which is total training memory now has all these components that you've mapped out for the first time as far as we're concerned. Model memory, optimizer memory, activation memory, gradient memory. We covered quite a few algorithms as to the choices you can make there. Anything else that you want to mention about just memory math? I don't think so. [00:36:11]Quentin: I think that about covers it. I will say that it's a very different scheme for training and inference. It's common for people to say, oh, BF16 is the best. Done. Whereas a more correct take is that during training, precision matters a bit more. So BF16 will be around longer for training than it will for inference, in which case your model is sort of already baked. And it definitely doesn't need some of those last bits of precision so you can get away much easier with going to int8 for inference rather than training. So everything that you learn for training has to be relearned for inference and vice versa. [00:36:44]Swyx: There's a third category. You're talking about training versus inference. This third category is emerging with regards to fine-tuning and perhaps parameter-efficient methods of fine-tuning. The naive way to implement fine-tuning is just to do more training. But I don't know if you've developed any intuitions over fine-tuning that's worth inserting here. Any intuitions? If you were to write fine-tuning math, what would go in there? That might be an interesting diff to training math. [00:37:10]Quentin: I think there's a lot of questions that are unanswered for fine-tuning. For example, we know scaling laws for training. And some people have done scaling laws for fine-tuning. But how does a model that's already been trained on one domain transfer to another in terms of fine-tuning size? How many tokens per parameter should you have for your fine-tuning dataset? Maybe I'm ignorant, but I feel like a lot of those sort of practical questions on how a model can transfer and how a model can learn or grok some new ability that wasn't in its original training dataset is something that I would definitely put inside a fine-tuning blog post. [00:37:45]Swyx: Something related to perplexity and, I guess, diversity of the tokens that you get. [00:37:49]Quentin: Yeah, sort of dataset transfer is something that I would be curious in. Learning rate transfer is another one. So your model has some decayed learning rate over the course of training. How does that change for fine-tuning? Things like that. [00:38:00]Swyx: All right, cool. Thanks for indulging that stuff. Sure. Yeah. [00:38:03]Alessio: I think after all of this, you can quickly do the math and see that training needs to be distributed to actually work because we just don't have hardware that can easily run this. So let's talk a bit about that. So zero is one of the first things that you mentioned here, which is focused on sharded optimizers. Maybe run people through that and how to think about it. [00:38:25]Swyx: Sure. [00:38:25]Quentin: So zero is centered around two communication operations. And the first is scatter. And people should be looking at the zero figure that I think we have. [00:38:35]Swyx: Yeah. [00:38:36]Quentin: So there's a figure in the paper with parameters, gradients, and optimizer states that people should be looking at when I'm talking about this. Every GPU is going to get its own equal portion of the slice. And if we're doing... There are different stages of zero, but let's just start off with assuming that it's an equal slice of the optimizer states, gradients, and parameters. That would be zero three, stage three in that case. And we do that with a scatter. And the scatter takes, say, one over end GPUs, plus this offset of that slice goes to that GPU. Now all of the GPUs have an equal slice that's in its rank order. And then during each training step, that GPU is going to wait for all of the other slices to communicate so that we now have a whole pie on that GPU, that single GPU. Once we have that whole pie, we do the forward pass on it. And then we distribute that forward pass to all of the others using a gather. So it's a scatter, reduced scatter specifically, and then a gather back to all the others. And you do that each step. So the point of it is that you're sharding these states across GPUs. And with the different stages, you'll see in that figure that the optimizer state is taking the most proportion, which is because of what I mentioned before. We're including the FP32 copy and we're doing atom. So we need those four bytes per param for momentum and for variance. And then zero stage one, which is the most common one, is just optimizer. Zero stage two is optimizer plus gradients. And zero stage three is optimizer gradients and model parameters. But it all comes back to this splitting up and then gathering together back and forth over and over. So you get a lot of communication overhead from zero. But the plus part of that is that you can overlap a lot of that movement with computation. [00:40:23]Alessio: How do you get the optimal number of GPUs to do this on? Is there a way to shard too much as well and put too much overhead? [00:40:31]Quentin: It depends more on what your interconnect is. Taking a step back, there is synchronization that's required, a lot of it, across all of these GPUs. And those tend to be cumulative. So if you go to too many GPUs on an interconnect that's too slow, then you're going to end up spending more time synchronizing. And that magic number where you spend more time synchronizing is going to be different depending on what your fabric is and what your GPU memory is specifically. Just how small of a slice is each GPU getting? I can't, for example, for Summit, that number comes out to be about 20 billion parameters. Now you have 20 billion parameters, and then your magic number of GPUs for that is going to be something like 100 to 200 scale. Beyond that, you're just going to end up spending more time communicating. And the actual flops dipping below some predetermined number by you is going to be whatever your sweet spot ends up being. [00:41:24]Alessio: And then, so this one was like hard for me to go through, so I'm excited to have you run through it, which is a 3D parallelism. [00:41:33]Swyx: It's fancy, it's cutting edge. [00:41:35]Alessio: Yeah, let's talk a bit more about that and some of the work. [00:41:38]Quentin: Okay, 3D parallelism. So what is each dimension? First is the really basic one. That's data parallelism. And data parallelism is you have a copy of the model. Let's say for simplicity, one copy fits on one GPU perfectly. Data parallelism is that now you have two GPUs, so you have one copy on GPU one, one copy on GPU two. Both of them do the forward and backward pass and then synchronize and average the gradients. And then that's a step. Data parallelism for 3D parallelism is actually zero. So it's, you're sharding the optimizer states across all of your different GPUs. Next up is tensor parallelism. Tensor parallelism is you split your model. Like say, if you have two GPUs, you split your model down the middle and each GPU on its tensor specifically is going to do its forward or backward operation on its tensor. And then only when necessary, it'll synchronize that tensor operation with the other GPU. It's a bit more complex than something like pipeline parallelism, which is the third dimension. In pipeline parallelism, let's say you have four layers in your model. And you have four GPUs. You put one layer on each GPU and then GPU one does the forward pass and then sends the output of its activations to GPU two. It does the forward pass, sends activations to three, and you're just moving down a line. That is a naive scheme in that all of the other GPUs are doing nothing while a single GPU is doing its forward or backward pass. So the reason it's called pipeline parallelism is because you're splitting your mini batch into micro batches. So GPU one will do the forward pass on micro batch one and then send to GPU two. And then while GPU two is running on that first micro batch, GPU one is working on the next micro batch. And so you're sort of pipelining the movement and computation of each micro batch. The problem with that is that you need a really big batch size in order to split it up into both mini batches and micro batches. So combining all three of those together, you get a 3D mesh of where each parameter and optimizer state and so on maps to each GPU. And that's 3D parallelism. So let's start diving into details on what have that made sense, what should I jump into more on? [00:43:55]Alessio: I think the main question is, do you need all of the GPUs to be the same to do this? Or can you have mismatching GPUs as well? [00:44:03]Quentin: Okay, two things matter. If there's a difference in VRAM for the two different kinds of GPUs, then you're going to be bottlenecked by whichever GPU has the lower amount of VRAM because it's going to run out of memory. And then you can't like whatever's left on the larger GPUs is going to be empty. As far as I'm aware, there's no like GPU single GPU aware memory overhead scheme that would account for that. The second problem is that let's say all of your GPUs have the same amount of VRAM, but half of them are really slow. And the problem with that is that those synchronizations that I mentioned earlier are going to kill you. So you're going to move as quickly as your slowest GPU in that case. So in both cases, you end up regressing to your slowest or smallest GPU. So you might as well have the same GPUs for all of them. Otherwise, you're wasting the nicer ones. And that also goes to your CPUs and your interconnect. So going back to the 20 billion parameter model that Eleuther was training, that was on a cluster that was sort of Frankenstein made during COVID when there was all of that shortage of network switches and such like that. So every node had a different network switch. And so you ended up moving at the speed of the slowest switch and getting everything tuned properly so that it's not worse than the slowest switch was challenging and is like a real world problem that sometimes comes up. [00:45:28]Alessio: Is this work widely accepted? Like I hadn't learned about this before studying for this episode. Is this something that people are still trying and researching? Or is everybody just aware of this and running this in production? [00:45:43]Quentin: What is this specifically? [00:45:44]Alessio: Like the sharded optimizers plus the 3D parallelism, bringing the two things together and having this kind of mesh strategy. [00:45:51]Quentin: I would say that a lot of major GPT-based models use this scheme. A lot of them now are sort of going with just a pure zero scheme. So just a pure sharded. You just shard everything. And then since that's so easy, everyone gets an equal slice. There's no such thing as a pipeline stage. There's no such thing as what tensor should go on which GPU. Instead, we shard everything equally and treat everything equally. It's a much easier problem to debug, to checkpoint, to run training on than it is with this 3D parallel scheme. I say 3D parallel gives you the most control and also the most ways to go wrong. And depending on whether you have more engineers or whether you have more GPUs, that should decide which of these you go with. [00:46:35]Swyx: It's also not too hard, right? You've basically outlined the five or six different numbers that you need to keep in your head. And it doesn't feel impossible that if you need to achieve that level of control, you've given everybody the main levers to do it with. And that's wonderful. Definitely. [00:46:51]Quentin: The problem that comes up is like, say, like, okay, GPT-4 came out. Now we have VLLMs. [00:46:57]Swyx: Whoa, what are VLLMs? Oh, okay. Virtual LLMs, like the Metro of Expert things? No, like visual. [00:47:03]Quentin: So now you have like multimodal models and such. How do you distribute that? Do you distribute it in a pipeline stage? And do you just shard it? Do you split the tensor and make a tensor parallel? It's sort of hard to change your model and add new features and such when you have this 3D parallel scheme. That's when I say hard. I mean, it's hard to sort of adapt and modify it to new features. [00:47:26]Alessio: I know we're at the hour mark, and I think we put our listeners through a very intense class today. So this was great, Quentin. And we're going to definitely link the article so that people can read it and follow along. Any other research that you're working on in this space that you want to shout out? I know one of our usual, I mean, wrong question is, what's the most interesting unsolved question in AI? So curious to hear if you think it's still on the training inference, math optimization, or are there more areas that people should pay attention to? [00:47:58]Quentin: I think in my area of research, there are two things that I think people should really care about. And the first is multimodal parallelism and RLHF. You were seeing more and more reinforcement learning and coming into the training loop. And so how do you split that some model or some GPUs are working on inference and some GPUs are working on training? And like I mentioned before, you have to relearn everything and they have very unique challenges. How do you split up a KV cache during training, for example? Those are challenges that are not well studied, I don't think. And then multimodal, you have like maybe a vision transformer and a text transformer. How do you split those up? Do you split them up equally? Do you put them on separate GPUs or do you just shard everything? And just maybe one GPU will have some vision, some text parameters. And then the second case I would say is that communication is very often a bottleneck. So we talk about 3D parallelism, but a lot of those like, for example, tensor parallelism, you can't go across nodes with. You'll just get killed in communication. So what I'm getting to is how should you compress your communication before it happens? So on the fly compression, you have some buffer that needs to be communicated. You compress it with a GPU kernel, then you send it across the network and then you decompress it, something like that. Making people spend less money on communication fabrics and more on GPUs as intended is sort of a thing that people need to explore. I think those are my two. [00:49:26]Alessio: Sean, you went over the other half of the lightning round before we wrap it up. [00:49:30]Swyx: That's a good brain dump. Cool. Yeah, I have so many more questions on the multimodal stuff, but that should be for another time. Acceleration, what has already happened in AI that you thought would take much longer? [00:49:42]Quentin: I would say flash attention. Guys, just talk to Tree. And flash attention is just sort of a really great set of kernels that I thought would take a while to get to us. [00:49:51]Alessio: Well, Quentin, thank you very much, man. This was super informative and I think hopefully helps demystify a little bit the blog post. I think people open it and it's like a lot of math on it. And I think you walking them through it was super helpful. So thank you so much for coming on. [00:50:07]Swyx: Of course. [00:50:08]Quentin: And I'm happy to answer any questions that people have offline if they have them. I do read my email. [00:50:13]Swyx: Email and Discord. Of course, yeah. [00:50:15]Quentin: Discord I'm even faster on. [00:50:16]Alessio: Thank you, everyone. [00:50:18]Swyx: Thanks, Quentin. [00:50:19] Get full access to Latent Space at www.latent.space/subscribe

LW - Summary of and Thoughts on the Hotz/Yudkowsky Debate by Zvi

The Nonlinear Library

Play Episode Listen Later Aug 16, 2023 13:56

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Summary of and Thoughts on the Hotz/Yudkowsky Debate, published by Zvi on August 16, 2023 on LessWrong. George Hotz and Eliezer Yudkowsky debated on YouTube for 90 minutes, with some small assists from moderator Dwarkesh Patel. It seemed worthwhile to post my notes on this on their own. I thought this went quite well for the first half or so, then things went increasingly off the rails in the second half, and Hotz gets into questions where he didn't have a chance to reflect and prepare, especially around cooperation and the prisoner's dilemma. First, some general notes, then specific notes I took while watching. Holz was allowed to drive discussion. In debate terms, he was the con side, raising challenges, while Yudkowsky was the pro side defending a fixed position. These discussions often end up doing what this one did, which is meandering around a series of 10-20 metaphors and anchors and talking points, mostly repeating the same motions with variations, in ways that are worth doing once but not very productive thereafter. Yudkowsky has a standard set of responses and explanations, which he is mostly good at knowing when to pull out, but after a while one has heard them all. The key to a good conversation or debate with Yudkowsky is to allow the conversation to advance beyond those points or go in a new direction entirely. Mostly, once Yudkowsky had given a version of his standard response and given his particular refutation attempt on Hotz's variation of the question, Hotz would then pivot to another topic. This included a few times when Yudkowsky's response was not fully convincing and there was room for Hotz to go deeper, and I wish he would have in those cases. In other cases, and more often than not, the refutation or defense seemed robust. This standard set of responses meant that Holz knew a lot of the things he wanted to respond to, and he prepared mostly good responses and points on a bunch of the standard references. Which was good, but I would have preferred to sidestep those points entirely. What would Tyler Cowen be asking in a CWT? Another pattern was Holz asserting that things would be difficult for future ASIs (artificial superintelligences) because they are difficult for humans, or the task had a higher affinity for human-style thought in some form, often with a flat out assertion that a task would prove difficult or slow. Hotz seemed to be operating under the theory that if he could break Yudkowsky's long chain of events at any point, that would show we were safe. Yudkowsky explicitly contested this on foom, and somewhat in other places as well. This seems important, as what Hotz was treating a load bearing usually very much wasn't. Yudkowsky mentioned a few times that he was not going to rely on a given argument or pathway because although it was true it would strain credulity. This is a tricky balance, on the whole we likely need more of this. Later on, Yudkowsky strongly defended that ASIs would cooperate with each other and not with us, and the idea of a deliberate left turn. This clearly strained a lot of credulity with Hotz and I think with many others, and I do not think these assertions are necessary either. Hotz closes with a vision of ASIs running amok, physically fighting each other over resources, impossible to align even to each other. He then asserts that this will go fine for him and he is fine with this outcome despite not saying he inherently values the ASIs or what they would create. I do not understand this at all. Such a scenario would escalate far quicker than Hotz realizes. But even if it did not, this very clearly leads to a long term future with no humans, and nothing humans obviously value. Is 'this will take long enough that they won't kill literal me' supposed to make that acceptable? Here is my summary of important stat...

debate speech ea holz asis tyler cowen hotz zvi eliezer yudkowsky rationalist cwt george hotz yudkowsky lesswrong dwarkesh patel

LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Aug 10, 2023 52:10

We have just announced our first set of speakers at AI Engineer Summit! Sign up for the livestream or email sponsors@ai.engineer if you'd like to support.We are facing a massive GPU crunch. As both startups and VC's hoard Nvidia GPUs like countries count nuclear stockpiles, tweets about GPU shortages have become increasingly common. But what if we could run LLMs with AMD cards, or without a GPU at all? There's just one weird trick: compilation. And there's one person uniquely qualified to do it.We had the pleasure to sit down with Tianqi Chen, who's an Assistant Professor at CMU, where he both teaches the MLC course and runs the MLC group. You might also know him as the creator of XGBoost, Apache TVM, and MXNet, as well as the co-founder of OctoML. The MLC (short for Machine Learning Compilation) group has released a lot of interesting projects:* MLC Chat: an iPhone app that lets you run models like RedPajama-3B and Vicuna-7B on-device. It gets up to 30 tok/s!* Web LLM: Run models like LLaMA-70B in your browser (!!) to offer local inference in your product.* MLC LLM: a framework that allows any language models to be deployed natively on different hardware and software stacks.The MLC group has just announced new support for AMD cards; we previously talked about the shortcomings of ROCm, but using MLC you can get performance very close to the NVIDIA's counterparts. This is great news for founders and builders, as AMD cards are more readily available. Here are their latest results on AMD's 7900s vs some of top NVIDIA consumer cards.If you just can't get a GPU at all, MLC LLM also supports ARM and x86 CPU architectures as targets by leveraging LLVM. While speed performance isn't comparable, it allows for non-time-sensitive inference to be run on commodity hardware.We also enjoyed getting a peek into TQ's process, which involves a lot of sketching:With all the other work going on in this space with projects like ggml and Ollama, we're excited to see GPUs becoming less and less of an issue to get models in the hands of more people, and innovative software solutions to hardware problems!Show Notes* TQ's Projects:* XGBoost* Apache TVM* MXNet* MLC* OctoML* CMU Catalyst* ONNX* GGML* Mojo* WebLLM* RWKV* HiPPO* Tri Dao's Episode* George Hotz EpisodePeople:* Carlos Guestrin* Albert GuTimestamps* [00:00:00] Intros* [00:03:41] The creation of XGBoost and its surprising popularity* [00:06:01] Comparing tree-based models vs deep learning* [00:10:33] Overview of TVM and how it works with ONNX* [00:17:18] MLC deep dive* [00:28:10] Using int4 quantization for inference of language models* [00:30:32] Comparison of MLC to other model optimization projects* [00:35:02] Running large language models in the browser with WebLLM* [00:37:47] Integrating browser models into applications* [00:41:15] OctoAI and self-optimizing compute* [00:45:45] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]Swyx: Okay, and we are here with Tianqi Chen, or TQ as people call him, who is assistant professor in ML computer science at CMU, Carnegie Mellon University, also helping to run Catalyst Group, also chief technologist of OctoML. You wear many hats. Are those, you know, your primary identities these days? Of course, of course. [00:00:42]Tianqi: I'm also, you know, very enthusiastic open source. So I'm also a VP and PRC member of the Apache TVM project and so on. But yeah, these are the things I've been up to so far. [00:00:53]Swyx: Yeah. So you did Apache TVM, XGBoost, and MXNet, and we can cover any of those in any amount of detail. But maybe what's one thing about you that people might not learn from your official bio or LinkedIn, you know, on the personal side? [00:01:08]Tianqi: Let me say, yeah, so normally when I do, I really love coding, even though like I'm trying to run all those things. So one thing that I keep a habit on is I try to do sketchbooks. I have a book, like real sketchbooks to draw down the design diagrams and the sketchbooks I keep sketching over the years, and now I have like three or four of them. And it's kind of a usually a fun experience of thinking the design through and also seeing how open source project evolves and also looking back at the sketches that we had in the past to say, you know, all these ideas really turn into code nowadays. [00:01:43]Alessio: How many sketchbooks did you get through to build all this stuff? I mean, if one person alone built one of those projects, he'll be a very accomplished engineer. Like you built like three of these. What's that process like for you? Like it's the sketchbook, like the start, and then you think about the code or like. [00:01:59]Swyx: Yeah. [00:02:00]Tianqi: So, so usually I start sketching on high level architectures and also in a project that works for over years, we also start to think about, you know, new directions, like of course generative AI language model comes in, how it's going to evolve. So normally I would say it takes like one book a year, roughly at that rate. It's usually fun to, I find it's much easier to sketch things out and then gives a more like a high level architectural guide for some of the future items. Yeah. [00:02:28]Swyx: Have you ever published this sketchbooks? Cause I think people would be very interested on, at least on a historical basis. Like this is the time where XGBoost was born, you know? Yeah, not really. [00:02:37]Tianqi: I started sketching like after XGBoost. So that's a kind of missing piece, but a lot of design details in TVM are actually part of the books that I try to keep a record of. [00:02:48]Swyx: Yeah, we'll try to publish them and publish something in the journals. Maybe you can grab a little snapshot for visual aid. Sounds good. [00:02:57]Alessio: Yeah. And yeah, talking about XGBoost, so a lot of people in the audience might know it's a gradient boosting library, probably the most popular out there. And it became super popular because many people started using them in like a machine learning competitions. And I think there's like a whole Wikipedia page of like all state-of-the-art models. They use XGBoost and like, it's a really long list. When you were working on it, so we just had Tri Dao, who's the creator of FlashAttention on the podcast. And I asked him this question, it's like, when you were building FlashAttention, did you know that like almost any transform race model will use it? And so I asked the same question to you when you were coming up with XGBoost, like, could you predict it would be so popular or like, what was the creation process? And when you published it, what did you expect? We have no idea. [00:03:41]Tianqi: Like, actually, the original reason that we built that library is that at that time, deep learning just came out. Like that was the time where AlexNet just came out. And one of the ambitious mission that myself and my advisor, Carlos Guestrin, then is we want to think about, you know, try to test the hypothesis. Can we find alternatives to deep learning models? Because then, you know, there are other alternatives like, you know, support vector machines, linear models, and of course, tree-based models. And our question was, if you build those models and feed them with big enough data, because usually like one of the key characteristics of deep learning is that it's taking a lot [00:04:22]Swyx: of data, right? [00:04:23]Tianqi: So we will be able to get the same amount of performance. That's a hypothesis we're setting out to test. Of course, if you look at now, right, that's a wrong hypothesis, but as a byproduct, what we find out is that, you know, most of the gradient boosting library out there is not efficient enough for us to test that hypothesis. So I happen to have quite a bit of experience in the past of building gradient boosting trees and their variants. So Effective Action Boost was kind of like a byproduct of that hypothesis testing. At that time, I'm also competing a bit in data science challenges, like I worked on KDDCup and then Kaggle kind of become bigger, right? So I kind of think maybe it's becoming useful to others. One of my friends convinced me to try to do a Python binding of it. That tends to be like a very good decision, right, to be effective. Usually when I build it, we feel like maybe a command line interface is okay. And now we have a Python binding, we have R bindings. And then it realized, you know, it started getting interesting. People started contributing different perspectives, like visualization and so on. So we started to push a bit more on to building distributive support to make sure it works on any platform and so on. And even at that time point, when I talked to Carlos, my advisor, later, he said he never anticipated that we'll get to that level of success. And actually, why I pushed for gradient boosting trees, interestingly, at that time, he also disagreed. He thinks that maybe we should go for kernel machines then. And it turns out, you know, actually, we are both wrong in some sense, and Deep Neural Network was the king in the hill. But at least the gradient boosting direction got into something fruitful. [00:06:01]Swyx: Interesting. [00:06:02]Alessio: I'm always curious when it comes to these improvements, like, what's the design process in terms of like coming up with it? And how much of it is a collaborative with like other people that you're working with versus like trying to be, you know, obviously, in academia, it's like very paper-driven kind of research driven. [00:06:19]Tianqi: I would say the extra boost improvement at that time point was more on like, you know, I'm trying to figure out, right. But it's combining lessons. Before that, I did work on some of the other libraries on matrix factorization. That was like my first open source experience. Nobody knew about it, because you'll find, likely, if you go and try to search for the package SVD feature, you'll find some SVN repo somewhere. But it's actually being used for some of the recommender system packages. So I'm trying to apply some of the previous lessons there and trying to combine them. The later projects like MXNet and then TVM is much, much more collaborative in a sense that... But, of course, extra boost has become bigger, right? So when we started that project myself, and then we have, it's really amazing to see people come in. Michael, who was a lawyer, and now he works on the AI space as well, on contributing visualizations. Now we have people from our community contributing different things. So extra boost even today, right, it's a community of committers driving the project. So it's definitely something collaborative and moving forward on getting some of the things continuously improved for our community. [00:07:37]Alessio: Let's talk a bit about TVM too, because we got a lot of things to run through in this episode. [00:07:42]Swyx: I would say that at some point, I'd love to talk about this comparison between extra boost or tree-based type AI or machine learning compared to deep learning, because I think there is a lot of interest around, I guess, merging the two disciplines, right? And we can talk more about that. I don't know where to insert that, by the way, so we can come back to it later. Yeah. [00:08:04]Tianqi: Actually, what I said, when we test the hypothesis, the hypothesis is kind of, I would say it's partially wrong, because the hypothesis we want to test now is, can you run tree-based models on image classification tasks, where deep learning is certainly a no-brainer right [00:08:17]Swyx: now today, right? [00:08:18]Tianqi: But if you try to run it on tabular data, still, you'll find that most people opt for tree-based models. And there's a reason for that, in the sense that when you are looking at tree-based models, the decision boundaries are naturally rules that you're looking at, right? And they also have nice properties, like being able to be agnostic to scale of input and be able to automatically compose features together. And I know there are attempts on building neural network models that work for tabular data, and I also sometimes follow them. I do feel like it's good to have a bit of diversity in the modeling space. Actually, when we're building TVM, we build cost models for the programs, and actually we are using XGBoost for that as well. I still think tree-based models are going to be quite relevant, because first of all, it's really to get it to work out of the box. And also, you will be able to get a bit of interoperability and control monotonicity [00:09:18]Swyx: and so on. [00:09:19]Tianqi: So yes, it's still going to be relevant. I also sometimes keep coming back to think about, are there possible improvements that we can build on top of these models? And definitely, I feel like it's a space that can have some potential in the future. [00:09:34]Swyx: Are there any current projects that you would call out as promising in terms of merging the two directions? [00:09:41]Tianqi: I think there are projects that try to bring a transformer-type model for tabular data. I don't remember specifics of them, but I think even nowadays, if you look at what people are using, tree-based models are still one of their toolkits. So I think maybe eventually it's not even a replacement, it will be just an ensemble of models that you can call. Perfect. [00:10:07]Alessio: Next up, about three years after XGBoost, you built this thing called TVM, which is now a very popular compiler framework for models. Let's talk about, so this came out about at the same time as ONNX. So I think it would be great if you could maybe give a little bit of an overview of how the two things work together. Because it's kind of like the model, then goes to ONNX, then goes to the TVM. But I think a lot of people don't understand the nuances. I can get a bit of a backstory on that. [00:10:33]Tianqi: So actually, that's kind of an ancient history. Before XGBoost, I worked on deep learning for two years or three years. I got a master's before I started my PhD. And during my master's, my thesis focused on applying convolutional restricted Boltzmann machine for ImageNet classification. That is the thing I'm working on. And that was before AlexNet moment. So effectively, I had to handcraft NVIDIA CUDA kernels on, I think, a GTX 2070 card. I have a 22070 card. It took me about six months to get one model working. And eventually, that model is not so good, and we should have picked a better model. But that was like an ancient history that really got me into this deep learning field. And of course, eventually, we find it didn't work out. So in my master's, I ended up working on recommender system, which got me a paper, and I applied and got a PhD. But I always want to come back to work on the deep learning field. So after XGBoost, I think I started to work with some folks on this particular MXNet. At that time, it was like the frameworks of CAFE, Ciano, PyTorch haven't yet come out. And we're really working hard to optimize for performance on GPUs. At that time, I found it's really hard, even for NVIDIA GPU. It took me six months. And then it's amazing to see on different hardwares how hard it is to go and optimize code for the platforms that are interesting. So that gets me thinking, can we build something more generic and automatic? So that I don't need an entire team of so many people to go and build those frameworks. So that's the motivation of starting working on TVM. There is really too little about machine learning engineering needed to support deep learning models on the platforms that we're interested in. I think it started a bit earlier than ONNX, but once it got announced, I think it's in a similar time period at that time. So overall, how it works is that TVM, you will be able to take a subset of machine learning programs that are represented in what we call a computational graph. Nowadays, we can also represent a loop-level program ingest from your machine learning models. Usually, you have model formats ONNX, or in PyTorch, they have FX Tracer that allows you to trace the FX graph. And then it goes through TVM. We also realized that, well, yes, it needs to be more customizable, so it will be able to perform some of the compilation optimizations like fusion operator together, doing smart memory planning, and more importantly, generate low-level code. So that works for NVIDIA and also is portable to other GPU backends, even non-GPU backends [00:13:36]Swyx: out there. [00:13:37]Tianqi: So that's a project that actually has been my primary focus over the past few years. And it's great to see how it started from where I think we are the very early initiator of machine learning compilation. I remember there was a visit one day, one of the students asked me, are you still working on deep learning frameworks? I tell them that I'm working on ML compilation. And they said, okay, compilation, that sounds very ancient. It sounds like a very old field. And why are you working on this? And now it's starting to get more traction, like if you say Torch Compile and other things. I'm really glad to see this field starting to pick up. And also we have to continue innovating here. [00:14:17]Alessio: I think the other thing that I noticed is, it's kind of like a big jump in terms of area of focus to go from XGBoost to TVM, it's kind of like a different part of the stack. Why did you decide to do that? And I think the other thing about compiling to different GPUs and eventually CPUs too, did you already see some of the strain that models could have just being focused on one runtime, only being on CUDA and that, and how much of that went into it? [00:14:50]Tianqi: I think it's less about trying to get impact, more about wanting to have fun. I like to hack code, I had great fun hacking CUDA code. Of course, being able to generate CUDA code is cool, right? But now, after being able to generate CUDA code, okay, by the way, you can do it on other platforms, isn't that amazing? So it's more of that attitude to get me started on this. And also, I think when we look at different researchers, myself is more like a problem solver type. So I like to look at a problem and say, okay, what kind of tools we need to solve that problem? So regardless, it could be building better models. For example, while we build extra boots, we build certain regularizations into it so that it's more robust. It also means building system optimizations, writing low-level code, maybe trying to write assembly and build compilers and so on. So as long as they solve the problem, definitely go and try to do them together. And I also see it's a common trend right now. Like if you want to be able to solve machine learning problems, it's no longer at Aggressor layer, right? You kind of need to solve it from both Aggressor data and systems angle. And this entire field of machine learning system, I think it's kind of emerging. And there's now a conference around it. And it's really good to see a lot more people are starting to look into this. [00:16:10]Swyx: Yeah. Are you talking about ICML or something else? [00:16:13]Tianqi: So machine learning and systems, right? So not only machine learning, but machine learning and system. So there's a conference called MLsys. It's definitely a smaller community than ICML, but I think it's also an emerging and growing community where people are talking about what are the implications of building systems for machine learning, right? And how do you go and optimize things around that and co-design models and systems together? [00:16:37]Swyx: Yeah. And you were area chair for ICML and NeurIPS as well. So you've just had a lot of conference and community organization experience. Is that also an important part of your work? Well, it's kind of expected for academic. [00:16:48]Tianqi: If I hold an academic job, I need to do services for the community. Okay, great. [00:16:53]Swyx: Your most recent venture in MLsys is going to the phone with MLCLLM. You announced this in April. I have it on my phone. It's great. I'm running Lama 2, Vicuña. I don't know what other models that you offer. But maybe just kind of describe your journey into MLC. And I don't know how this coincides with your work at CMU. Is that some kind of outgrowth? [00:17:18]Tianqi: I think it's more like a focused effort that we want in the area of machine learning compilation. So it's kind of related to what we built in TVM. So when we built TVM was five years ago, right? And a lot of things happened. We built the end-to-end machine learning compiler that works, the first one that works. But then we captured a lot of lessons there. So then we are building a second iteration called TVM Unity. That allows us to be able to allow ML engineers to be able to quickly capture the new model and how we demand building optimizations for them. And MLCLLM is kind of like an MLC. It's more like a vertical driven organization that we go and build tutorials and go and build projects like LLM to solutions. So that to really show like, okay, you can take machine learning compilation technology and apply it and bring something fun forward. Yeah. So yes, it runs on phones, which is really cool. But the goal here is not only making it run on phones, right? The goal is making it deploy universally. So we do run on Apple M2 Macs, the 17 billion models. Actually, on a single batch inference, more recently on CUDA, we get, I think, the most best performance you can get out there already on the 4-bit inference. Actually, as I alluded earlier before the podcast, we just had a result on AMD. And on a single batch, actually, we can get the latest AMD GPU. This is a consumer card. It can get to about 80% of the 4019, so NVIDIA's best consumer card out there. So it's not yet on par, but thinking about how diversity and what you can enable and the previous things you can get on that card, it's really amazing that what you can do with this kind of technology. [00:19:10]Swyx: So one thing I'm a little bit confused by is that most of these models are in PyTorch, but you're running this inside a TVM. I don't know. Was there any fundamental change that you needed to do, or was this basically the fundamental design of TVM? [00:19:25]Tianqi: So the idea is that, of course, it comes back to program representation, right? So effectively, TVM has this program representation called TVM script that contains more like computational graph and operational representation. So yes, initially, we do need to take a bit of effort of bringing those models onto the program representation that TVM supports. Usually, there are a mix of ways, depending on the kind of model you're looking at. For example, for vision models and stable diffusion models, usually we can just do tracing that takes PyTorch model onto TVM. That part is still being robustified so that we can bring more models in. On language model tasks, actually what we do is we directly build some of the model constructors and try to directly map from Hugging Face models. The goal is if you have a Hugging Face configuration, we will be able to bring that in and apply optimization on them. So one fun thing about model compilation is that your optimization doesn't happen only as a soft language, right? For example, if you're writing PyTorch code, you just go and try to use a better fused operator at a source code level. Torch compile might help you do a bit of things in there. In most of the model compilations, it not only happens at the beginning stage, but we also apply generic transformations in between, also through a Python API. So you can tweak some of that. So that part of optimization helps a lot of uplifting in getting both performance and also portability on the environment. And another thing that we do have is what we call universal deployment. So if you get the ML program into this TVM script format, where there are functions that takes in tensor and output tensor, we will be able to have a way to compile it. So they will be able to load the function in any of the language runtime that TVM supports. So if you could load it in JavaScript, and that's a JavaScript function that you can take in tensors and output tensors. If you're loading Python, of course, and C++ and Java. So the goal there is really bring the ML model to the language that people care about and be able to run it on a platform they like. [00:21:37]Swyx: It strikes me that I've talked to a lot of compiler people, but you don't have a traditional compiler background. You're inventing your own discipline called machine learning compilation, or MLC. Do you think that this will be a bigger field going forward? [00:21:52]Tianqi: First of all, I do work with people working on compilation as well. So we're also taking inspirations from a lot of early innovations in the field. Like for example, TVM initially, we take a lot of inspirations from Halide, which is just an image processing compiler. And of course, since then, we have evolved quite a bit to focus on the machine learning related compilations. If you look at some of our conference publications, you'll find that machine learning compilation is already kind of a subfield. So if you look at papers in both machine learning venues, the MLC conferences, of course, and also system venues, every year there will be papers around machine learning compilation. And in the compiler conference called CGO, there's a C4ML workshop that also kind of trying to focus on this area. So definitely it's already starting to gain traction and becoming a field. I wouldn't claim that I invented this field, but definitely I helped to work with a lot of folks there. And I try to bring a perspective, of course, trying to learn a lot from the compiler optimizations as well as trying to bring in knowledges in machine learning and systems together. [00:23:07]Alessio: So we had George Hotz on the podcast a few episodes ago, and he had a lot to say about AMD and their software. So when you think about TVM, are you still restricted in a way by the performance of the underlying kernel, so to speak? So if your target is like a CUDA runtime, you still get better performance, no matter like TVM kind of helps you get there, but then that level you don't take care of, right? [00:23:34]Swyx: There are two parts in here, right? [00:23:35]Tianqi: So first of all, there is the lower level runtime, like CUDA runtime. And then actually for NVIDIA, a lot of the mood came from their libraries, like Cutlass, CUDN, right? Those library optimizations. And also for specialized workloads, actually you can specialize them. Because a lot of cases you'll find that if you go and do benchmarks, it's very interesting. Like two years ago, if you try to benchmark ResNet, for example, usually the NVIDIA library [00:24:04]Swyx: gives you the best performance. [00:24:06]Tianqi: It's really hard to beat them. But as soon as you start to change the model to something, maybe a bit of a variation of ResNet, not for the traditional ImageNet detections, but for latent detection and so on, there will be some room for optimization because people sometimes overfit to benchmarks. These are people who go and optimize things, right? So people overfit the benchmarks. So that's the largest barrier, like being able to get a low level kernel libraries, right? In that sense, the goal of TVM is actually we try to have a generic layer to both, of course, leverage libraries when available, but also be able to automatically generate [00:24:45]Swyx: libraries when possible. [00:24:46]Tianqi: So in that sense, we are not restricted by the libraries that they have to offer. That's why we will be able to run Apple M2 or WebGPU where there's no library available because we are kind of like automatically generating libraries. That makes it easier to support less well-supported hardware, right? For example, WebGPU is one example. From a runtime perspective, AMD, I think before their Vulkan driver was not very well supported. Recently, they are getting good. But even before that, we'll be able to support AMD through this GPU graphics backend called Vulkan, which is not as performant, but it gives you a decent portability across those [00:25:29]Swyx: hardware. [00:25:29]Alessio: And I know we got other MLC stuff to talk about, like WebLLM, but I want to wrap up on the optimization that you're doing. So there's kind of four core things, right? Kernel fusion, which we talked a bit about in the flash attention episode and the tiny grab one memory planning and loop optimization. I think those are like pretty, you know, self-explanatory. I think the one that people have the most questions, can you can you quickly explain [00:25:53]Swyx: those? [00:25:54]Tianqi: So there are kind of a different things, right? Kernel fusion means that, you know, if you have an operator like Convolutions or in the case of a transformer like MOP, you have other operators that follow that, right? You don't want to launch two GPU kernels. You want to be able to put them together in a smart way, right? And as a memory planning, it's more about, you know, hey, if you run like Python code, every time when you generate a new array, you are effectively allocating a new piece of memory, right? Of course, PyTorch and other frameworks try to optimize for you. So there is a smart memory allocator behind the scene. But actually, in a lot of cases, it's much better to statically allocate and plan everything ahead of time. And that's where like a compiler can come in. We need to, first of all, actually for language model, it's much harder because dynamic shape. So you need to be able to what we call symbolic shape tracing. So we have like a symbolic variable that tells you like the shape of the first tensor is n by 12. And the shape of the third tensor is also n by 12. Or maybe it's n times 2 by 12. Although you don't know what n is, right? But you will be able to know that relation and be able to use that to reason about like fusion and other decisions. So besides this, I think loop transformation is quite important. And it's actually non-traditional. Originally, if you simply write a code and you want to get a performance, it's very hard. For example, you know, if you write a matrix multiplier, the simplest thing you can do is you do for i, j, k, c, i, j, plus, equal, you know, a, i, k, times b, i, k. But that code is 100 times slower than the best available code that you can get. So we do a lot of transformation, like being able to take the original code, trying to put things into shared memory, and making use of tensor calls, making use of memory copies, and all this. Actually, all these things, we also realize that, you know, we cannot do all of them. So we also make the ML compilation framework as a Python package, so that people will be able to continuously improve that part of engineering in a more transparent way. So we find that's very useful, actually, for us to be able to get good performance very quickly on some of the new models. Like when Lamato came out, we'll be able to go and look at the whole, here's the bottleneck, and we can go and optimize those. [00:28:10]Alessio: And then the fourth one being weight quantization. So everybody wants to know about that. And just to give people an idea of the memory saving, if you're doing FB32, it's like four bytes per parameter. Int8 is like one byte per parameter. So you can really shrink down the memory footprint. What are some of the trade-offs there? How do you figure out what the right target is? And what are the precision trade-offs, too? [00:28:37]Tianqi: Right now, a lot of people also mostly use int4 now for language models. So that really shrinks things down a lot. And more recently, actually, we started to think that, at least in MOC, we don't want to have a strong opinion on what kind of quantization we want to bring, because there are so many researchers in the field. So what we can do is we can allow developers to customize the quantization they want, but we still bring the optimum code for them. So we are working on this item called bring your own quantization. In fact, hopefully MOC will be able to support more quantization formats. And definitely, I think there's an open field that's being explored. Can you bring more sparsities? Can you quantize activations as much as possible, and so on? And it's going to be something that's going to be relevant for quite a while. [00:29:27]Swyx: You mentioned something I wanted to double back on, which is most people use int4 for language models. This is actually not obvious to me. Are you talking about the GGML type people, or even the researchers who are training the models also using int4? [00:29:40]Tianqi: Sorry, so I'm mainly talking about inference, not training, right? So when you're doing training, of course, int4 is harder, right? Maybe you could do some form of mixed type precision for inference. I think int4 is kind of like, in a lot of cases, you will be able to get away with int4. And actually, that does bring a lot of savings in terms of the memory overhead, and so on. [00:30:09]Alessio: Yeah, that's great. Let's talk a bit about maybe the GGML, then there's Mojo. How should people think about MLC? How do all these things play together? I think GGML is focused on model level re-implementation and improvements. Mojo is a language, super sad. You're more at the compiler level. Do you all work together? Do people choose between them? [00:30:32]Tianqi: So I think in this case, I think it's great to say the ecosystem becomes so rich with so many different ways. So in our case, GGML is more like you're implementing something from scratch in C, right? So that gives you the ability to go and customize each of a particular hardware backend. But then you will need to write from CUDA kernels, and you write optimally from AMD, and so on. So the kind of engineering effort is a bit more broadened in that sense. Mojo, I have not looked at specific details yet. I think it's good to start to say, it's a language, right? I believe there will also be machine learning compilation technologies behind it. So it's good to say, interesting place in there. In the case of MLC, our case is that we do not want to have an opinion on how, where, which language people want to develop, deploy, and so on. And we also realize that actually there are two phases. We want to be able to develop and optimize your model. By optimization, I mean, really bring in the best CUDA kernels and do some of the machine learning engineering in there. And then there's a phase where you want to deploy it as a part of the app. So if you look at the space, you'll find that GGML is more like, I'm going to develop and optimize in the C language, right? And then most of the low-level languages they have. And Mojo is that you want to develop and optimize in Mojo, right? And you deploy in Mojo. In fact, that's the philosophy they want to push for. In the ML case, we find that actually if you want to develop models, the machine learning community likes Python. Python is a language that you should focus on. So in the case of MLC, we really want to be able to enable, not only be able to just define your model in Python, that's very common, right? But also do ML optimization, like engineering optimization, CUDA kernel optimization, memory planning, all those things in Python that makes you customizable and so on. But when you do deployment, we realize that people want a bit of a universal flavor. If you are a web developer, you want JavaScript, right? If you're maybe an embedded system person, maybe you would prefer C++ or C or Rust. And people sometimes do like Python in a lot of cases. So in the case of MLC, we really want to have this vision of, you optimize, build a generic optimization in Python, then you deploy that universally onto the environments that people like. [00:32:54]Swyx: That's a great perspective and comparison, I guess. One thing I wanted to make sure that we cover is that I think you are one of these emerging set of academics that also very much focus on your artifacts of delivery. Of course. Something we talked about for three years, that he was very focused on his GitHub. And obviously you treated XGBoost like a product, you know? And then now you're publishing an iPhone app. Okay. Yeah. Yeah. What is his thinking about academics getting involved in shipping products? [00:33:24]Tianqi: I think there are different ways of making impact, right? Definitely, you know, there are academics that are writing papers and building insights for people so that people can build product on top of them. In my case, I think the particular field I'm working on, machine learning systems, I feel like really we need to be able to get it to the hand of people so that really we see the problem, right? And we show that we can solve a problem. And it's a different way of making impact. And there are academics that are doing similar things. Like, you know, if you look at some of the people from Berkeley, right? A few years, they will come up with big open source projects. Certainly, I think it's just a healthy ecosystem to have different ways of making impacts. And I feel like really be able to do open source and work with open source community is really rewarding because we have a real problem to work on when we build our research. Actually, those research bring together and people will be able to make use of them. And we also start to see interesting research challenges that we wouldn't otherwise say, right, if you're just trying to do a prototype and so on. So I feel like it's something that is one interesting way of making impact, making contributions. [00:34:40]Swyx: Yeah, you definitely have a lot of impact there. And having experience publishing Mac stuff before, the Apple App Store is no joke. It is the hardest compilation, human compilation effort. So one thing that we definitely wanted to cover is running in the browser. You have a 70 billion parameter model running in the browser. That's right. Can you just talk about how? Yeah, of course. [00:35:02]Tianqi: So I think that there are a few elements that need to come in, right? First of all, you know, we do need a MacBook, the latest one, like M2 Max, because you need the memory to be big enough to cover that. So for a 70 million model, it takes you about, I think, 50 gigahertz of RAM. So the M2 Max, the upper version, will be able to run it, right? And it also leverages machine learning compilation. Again, what we are doing is the same, whether it's running on iPhone, on server cloud GPUs, on AMDs, or on MacBook, we all go through that same MOC pipeline. Of course, in certain cases, maybe we'll do a bit of customization iteration for either ones. And then it runs on the browser runtime, this package of WebLM. So that will effectively... So what we do is we will take that original model and compile to what we call WebGPU. And then the WebLM will be to pick it up. And the WebGPU is this latest GPU technology that major browsers are shipping right now. So you can get it in Chrome for them already. It allows you to be able to access your native GPUs from a browser. And then effectively, that language model is just invoking the WebGPU kernels through there. So actually, when the LATMAR2 came out, initially, we asked the question about, can you run 17 billion on a MacBook? That was the question we're asking. So first, we actually... Jin Lu, who is the engineer pushing this, he got 17 billion on a MacBook. We had a CLI version. So in MLC, you will be able to... That runs through a metal accelerator. So effectively, you use the metal programming language to get the GPU acceleration. So we find, okay, it works for the MacBook. Then we asked, we had a WebGPU backend. Why not try it there? So we just tried it out. And it's really amazing to see everything up and running. And actually, it runs smoothly in that case. So I do think there are some kind of interesting use cases already in this, because everybody has a browser. You don't need to install anything. I think it doesn't make sense yet to really run a 17 billion model on a browser, because you kind of need to be able to download the weight and so on. But I think we're getting there. Effectively, the most powerful models you will be able to run on a consumer device. It's kind of really amazing. And also, in a lot of cases, there might be use cases. For example, if I'm going to build a chatbot that I talk to it and answer questions, maybe some of the components, like the voice to text, could run on the client side. And so there are a lot of possibilities of being able to have something hybrid that contains the edge component or something that runs on a server. [00:37:47]Alessio: Do these browser models have a way for applications to hook into them? So if I'm using, say, you can use OpenAI or you can use the local model. Of course. [00:37:56]Tianqi: Right now, actually, we are building... So there's an NPM package called WebILM, right? So that you will be able to, if you want to embed it onto your web app, you will be able to directly depend on WebILM and you will be able to use it. We are also having a REST API that's OpenAI compatible. So that REST API, I think, right now, it's actually running on native backend. So that if a CUDA server is faster to run on native backend. But also we have a WebGPU version of it that you can go and run. So yeah, we do want to be able to have easier integrations with existing applications. And OpenAI API is certainly one way to do that. Yeah, this is great. [00:38:37]Swyx: I actually did not know there's an NPM package that makes it very, very easy to try out and use. I want to actually... One thing I'm unclear about is the chronology. Because as far as I know, Chrome shipped WebGPU the same time that you shipped WebILM. Okay, yeah. So did you have some kind of secret chat with Chrome? [00:38:57]Tianqi: The good news is that Chrome is doing a very good job of trying to have early release. So although the official shipment of the Chrome WebGPU is the same time as WebILM, actually, you will be able to try out WebGPU technology in Chrome. There is an unstable version called Canary. I think as early as two years ago, there was a WebGPU version. Of course, it's getting better. So we had a TVM-based WebGPU backhand two years ago. Of course, at that time, there were no language models. It was running on less interesting, well, still quite interesting models. And then this year, we really started to see it getting matured and performance keeping up. So we have a more serious push of bringing the language model compatible runtime onto the WebGPU. [00:39:45]Swyx: I think you agree that the hardest part is the model download. Has there been conversations about a one-time model download and sharing between all the apps that might use this API? That is a great point. [00:39:58]Tianqi: I think it's already supported in some sense. When we download the model, WebILM will cache it onto a special Chrome cache. So if a different web app uses the same WebILM JavaScript package, you don't need to redownload the model again. So there is already something there. But of course, you have to download the model once at least to be able to use it. [00:40:19]Swyx: Okay. One more thing just in general before we're about to zoom out to OctoAI. Just the last question is, you're not the only project working on, I guess, local models. That's right. Alternative models. There's gpt4all, there's olama that just recently came out, and there's a bunch of these. What would be your advice to them on what's a valuable problem to work on? And what is just thin wrappers around ggml? Like, what are the interesting problems in this space, basically? [00:40:45]Tianqi: I think making API better is certainly something useful, right? In general, one thing that we do try to push very hard on is this idea of easier universal deployment. So we are also looking forward to actually have more integration with MOC. That's why we're trying to build API like WebILM and other things. So we're also looking forward to collaborate with all those ecosystems and working support to bring in models more universally and be able to also keep up the best performance when possible in a more push-button way. [00:41:15]Alessio: So as we mentioned in the beginning, you're also the co-founder of Octomel. Recently, Octomel released OctoAI, which is a compute service, basically focuses on optimizing model runtimes and acceleration and compilation. What has been the evolution there? So Octo started as kind of like a traditional MLOps tool, where people were building their own models and you help them on that side. And then it seems like now most of the market is shifting to starting from pre-trained generative models. Yeah, what has been that experience for you and what you've seen the market evolve? And how did you decide to release OctoAI? [00:41:52]Tianqi: One thing that we found out is that on one hand, it's really easy to go and get something up and running, right? So if you start to consider there's so many possible availabilities and scalability issues and even integration issues since becoming kind of interesting and complicated. So we really want to make sure to help people to get that part easy, right? And now a lot of things, if we look at the customers we talk to and the market, certainly generative AI is something that is very interesting. So that is something that we really hope to help elevate. And also building on top of technology we build to enable things like portability across hardwares. And you will be able to not worry about the specific details, right? Just focus on getting the model out. We'll try to work on infrastructure and other things that helps on the other end. [00:42:45]Alessio: And when it comes to getting optimization on the runtime, I see when we run an early adopters community and most enterprises issue is how to actually run these models. Do you see that as one of the big bottlenecks now? I think a few years ago it was like, well, we don't have a lot of machine learning talent. We cannot develop our own models. Versus now it's like, there's these great models you can use, but I don't know how to run them efficiently. [00:43:12]Tianqi: That depends on how you define by running, right? On one hand, it's easy to download your MLC, like you download it, you run on a laptop, but then there's also different decisions, right? What if you are trying to serve a larger user request? What if that request changes? What if the availability of hardware changes? Right now it's really hard to get the latest hardware on media, unfortunately, because everybody's trying to work on the things using the hardware that's out there. So I think when the definition of run changes, there are a lot more questions around things. And also in a lot of cases, it's not only about running models, it's also about being able to solve problems around them. How do you manage your model locations and how do you make sure that you get your model close to your execution environment more efficiently? So definitely a lot of engineering challenges out there. That we hope to elevate, yeah. And also, if you think about our future, definitely I feel like right now the technology, given the technology and the kind of hardware availability we have today, we will need to make use of all the possible hardware available out there. That will include a mechanism for cutting down costs, bringing something to the edge and cloud in a more natural way. So I feel like still this is a very early stage of where we are, but it's already good to see a lot of interesting progress. [00:44:35]Alessio: Yeah, that's awesome. I would love, I don't know how much we're going to go in depth into it, but what does it take to actually abstract all of this from the end user? You know, like they don't need to know what GPUs you run, what cloud you're running them on. You take all of that away. What was that like as an engineering challenge? [00:44:51]Tianqi: So I think that there are engineering challenges on. In fact, first of all, you will need to be able to support all the kind of hardware backhand you have, right? On one hand, if you look at the media library, you'll find very surprisingly, not too surprisingly, most of the latest libraries works well on the latest GPU. But there are other GPUs out there in the cloud as well. So certainly being able to have know-hows and being able to do model optimization is one thing, right? Also infrastructures on being able to scale things up, locate models. And in a lot of cases, we do find that on typical models, it also requires kind of vertical iterations. So it's not about, you know, build a silver bullet and that silver bullet is going to solve all the problems. It's more about, you know, we're building a product, we'll work with the users and we find out there are interesting opportunities in a certain point. And when our engineer will go and solve that, and it will automatically reflect it in a service. [00:45:45]Swyx: Awesome. [00:45:46]Alessio: We can jump into the lightning round until, I don't know, Sean, if you have more questions or TQ, if you have more stuff you wanted to talk about that we didn't get a chance to [00:45:54]Swyx: touch on. [00:45:54]Alessio: Yeah, we have talked a lot. [00:45:55]Swyx: So, yeah. We always would like to ask, you know, do you have a commentary on other parts of AI and ML that is interesting to you? [00:46:03]Tianqi: So right now, I think one thing that we are really pushing hard for is this question about how far can we bring open source, right? I'm kind of like a hacker and I really like to put things together. So I think it's unclear in the future of what the future of AI looks like. On one hand, it could be possible that, you know, you just have a few big players, you just try to talk to those bigger language models and that can do everything, right? On the other hand, one of the things that Wailing Academic is really excited and pushing for, that's one reason why I'm pushing for MLC, is that can we build something where you have different models? You have personal models that know the best movie you like, but you also have bigger models that maybe know more, and you get those models to interact with each other, right? And be able to have a wide ecosystem of AI agents that helps each person while still being able to do things like personalization. Some of them can run locally, some of them, of course, running on a cloud, and how do they interact with each other? So I think that is a very exciting time where the future is yet undecided, but I feel like there is something we can do to shape that future as well. [00:47:18]Swyx: One more thing, which is something I'm also pursuing, which is, and this kind of goes back into predictions, but also back in your history, do you have any idea, or are you looking out for anything post-transformers as far as architecture is concerned? [00:47:32]Tianqi: I think, you know, in a lot of these cases, you can find there are already promising models for long contexts, right? There are space-based models, where like, you know, a lot of some of our colleagues from Albert, who he worked on this HIPPO models, right? And then there is an open source version called RWKV. It's like a recurrent models that allows you to summarize things. Actually, we are bringing RWKV to MOC as well, so maybe you will be able to see one of the models. [00:48:00]Swyx: We actually recorded an episode with one of the RWKV core members. It's unclear because there's no academic backing. It's just open source people. Oh, I see. So you like the merging of recurrent networks and transformers? [00:48:13]Tianqi: I do love to see this model space continue growing, right? And I feel like in a lot of cases, it's just that attention mechanism is getting changed in some sense. So I feel like definitely there are still a lot of things to be explored here. And that is also one reason why we want to keep pushing machine learning compilation, because one of the things we are trying to push in was productivity. So that for machine learning engineering, so that as soon as some of the models came out, we will be able to, you know, empower them onto those environments that's out there. [00:48:43]Swyx: Yeah, it's a really good mission. Okay. Very excited to see that RWKV and state space model stuff. I'm hearing increasing chatter about that stuff. Okay. Lightning round, as always fun. I'll take the first one. Acceleration. What has already happened in AI that you thought would take much longer? [00:48:59]Tianqi: Emergence of more like a conversation chatbot ability is something that kind of surprised me before it came out. This is like one piece that I feel originally I thought would take much longer, but yeah, [00:49:11]Swyx: it happens. And it's funny because like the original, like Eliza chatbot was something that goes all the way back in time. Right. And then we just suddenly came back again. Yeah. [00:49:21]Tianqi: It's always too interesting to think about, but with a kind of a different technology [00:49:25]Swyx: in some sense. [00:49:25]Alessio: What about the most interesting unsolved question in AI? [00:49:31]Swyx: That's a hard one, right? [00:49:32]Tianqi: So I can tell you like what kind of I'm excited about. So, so I think that I have always been excited about this idea of continuous learning and lifelong learning in some sense. So how AI continues to evolve with the knowledges that have been there. It seems that we're getting much closer with all those recent technologies. So being able to develop systems, support, and be able to think about how AI continues to evolve is something that I'm really excited about. [00:50:01]Swyx: So specifically, just to double click on this, are you talking about continuous training? That's like a training. [00:50:06]Tianqi: I feel like, you know, training adaptation and it's all similar things, right? You want to think about entire life cycle, right? The life cycle of collecting data, training, fine tuning, and maybe have your local context that getting continuously curated and feed onto models. So I think all these things are interesting and relevant in here. [00:50:29]Swyx: Yeah. I think this is something that people are really asking, you know, right now we have moved a lot into the sort of pre-training phase and off the shelf, you know, the model downloads and stuff like that, which seems very counterintuitive compared to the continuous training paradigm that people want. So I guess the last question would be for takeaways. What's basically one message that you want every listener, every person to remember today? [00:50:54]Tianqi: I think it's getting more obvious now, but I think one of the things that I always want to mention in my talks is that, you know, when you're thinking about AI applications, originally people think about algorithms a lot more, right? Our algorithm models, they are still very important. But usually when you build AI applications, it takes, you know, both algorithm side, the system optimizations, and the data curations, right? So it takes a connection of so many facades to be able to bring together an AI system and be able to look at it from that holistic perspective is really useful when we start to build modern applications. I think it's going to continue going to be more important in the future. [00:51:35]Swyx: Yeah. Thank you for showing the way on this. And honestly, just making things possible that I thought would take a lot longer. So thanks for everything you've done. [00:51:46]Tianqi: Thank you for having me. [00:51:47]Swyx: Yeah. [00:51:47]Alessio: Thanks for coming on TQ. [00:51:49]Swyx: Have a good one. [00:51:49] Get full access to Latent Space at www.latent.space/subscribe

Can We Develop Truly Beneficial AI? George Hotz and Connor Leahy

Machine Learning Street Talk

Play Episode Listen Later Aug 4, 2023 89:59

Patreon: https://www.patreon.com/mlst Discord: https://discord.gg/ESrGqhf5CB George Hotz and Connor Leahy discuss the crucial challenge of developing beneficial AI that is aligned with human values. Hotz believes truly aligned AI is impossible, while Leahy argues it's a solvable technical challenge.Hotz contends that AI will inevitably pursue power, but distributing AI widely would prevent any single AI from dominating. He advocates open-sourcing AI developments to democratize access. Leahy counters that alignment is necessary to ensure AIs respect human values. Without solving alignment, general AI could ignore or harm humans.They discuss whether AI's tendency to seek power stems from optimization pressure or human-instilled goals. Leahy argues goal-seeking behavior naturally emerges while Hotz believes it reflects human values. Though agreeing on AI's potential dangers, they differ on solutions. Hotz favors accelerating AI progress and distributing capabilities while Leahy wants safeguards put in place.While acknowledging risks like AI-enabled weapons, they debate whether broad access or restrictions better manage threats. Leahy suggests limiting dangerous knowledge, but Hotz insists openness checks government overreach. They concur that coordination and balance of power are key to navigating the AI revolution. Both eagerly anticipate seeing whose ideas prevail as AI progresses. Transcript and notes: https://docs.google.com/document/d/1smkmBY7YqcrhejdbqJOoZHq-59LZVwu-DNdM57IgFcU/edit?usp=sharing TOC: [00:00:00] Introduction to George Hotz and Connor Leahy [00:03:10] George Hotz's Opening Statement: Intelligence and Power [00:08:50] Connor Leahy's Opening Statement: Technical Problem of Alignment and Coordination [00:15:18] George Hotz's Response: Nature of Cooperation and Individual Sovereignty [00:17:32] Discussion on individual sovereignty and defense [00:18:45] Debate on living conditions in America versus Somalia [00:21:57] Talk on the nature of freedom and the aesthetics of life [00:24:02] Discussion on the implications of coordination and conflict in politics [00:33:41] Views on the speed of AI development / hard takeoff [00:35:17] Discussion on potential dangers of AI [00:36:44] Discussion on the effectiveness of current AI [00:40:59] Exploration of potential risks in technology [00:45:01] Discussion on memetic mutation risk [00:52:36] AI alignment and exploitability [00:53:13] Superintelligent AIs and the assumption of good intentions [00:54:52] Humanity's inconsistency and AI alignment [00:57:57] Stability of the world and the impact of superintelligent AIs [01:02:30] Personal utopia and the limitations of AI alignment [01:05:10] Proposed regulation on limiting the total number of flops [01:06:20] Having access to a powerful AI system [01:18:00] Power dynamics and coordination issues with AI [01:25:44] Humans vs AI in Optimization [01:27:05] The Impact of AI's Power Seeking Behavior [01:29:32] A Debate on the Future of AI

america power ai future personal talk debate impact humanity humans develop views alignment exploration stability proposed optimization somalia cooperation coordination leahy hotz george hotz connor leahy individual sovereignty

FlashAttention 2: making Transformers 800% faster w/o approximation - with Tri Dao of Together AI

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jul 26, 2023 54:31

FlashAttention was first published by Tri Dao in May 2022 and it had a deep impact in the large language models space. Most open models you've heard of (RedPajama, MPT, LLaMA, Falcon, etc) all leverage it for faster inference. Tri came on the podcast to chat about FlashAttention, the newly released FlashAttention-2, the research process at Hazy Lab, and more. This is the first episode of our “Papers Explained” series, which will cover some of the foundational research in this space. Our Discord also hosts a weekly Paper Club, which you can signup for here. How does FlashAttention work?The paper is titled “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. There are a couple keywords to call out:* “Memory Efficient”: standard attention memory usage is quadratic with sequence length (i.e. O(N^2)). FlashAttention is sub-quadratic at O(N). * “Exact”: the opposite of “exact” in this case is “sparse”, as in “sparse networks” (see our episode with Jonathan Frankle for more). This means that you're not giving up any precision.* The “IO” in “IO-Awareness” stands for “Input/Output” and hints at a write/read related bottleneck. Before we dive in, look at this simple GPU architecture diagram:The GPU has access to three memory stores at runtime:* SRAM: this is on-chip memory co-located with the actual execution core. It's limited in size (~20MB on an A100 card) but extremely fast (19TB/s total bandwidth)* HBM: this is off-chip but on-card memory, meaning it's in the GPU but not co-located with the core itself. An A100 has 40GB of HBM, but only a 1.5TB/s bandwidth. * DRAM: this is your traditional CPU RAM. You can have TBs of this, but you can only get ~12.8GB/s bandwidth, which is way too slow.Now that you know what HBM is, look at how the standard Attention algorithm is implemented:As you can see, all 3 steps include a “write X to HBM” step and a “read from HBM” step. The core idea behind FlashAttention boils down to this: instead of storing each intermediate result, why don't we use kernel fusion and run every operation in a single kernel in order to avoid memory read/write overhead? (We also talked about kernel fusion in our episode with George Hotz and how PyTorch / tinygrad take different approaches here)The result is much faster, but much harder to read:As you can see, FlashAttention is a very meaningful speed improvement on traditional Attention, and it's easy to understand why it's becoming the standard for most models.This should be enough of a primer before you dive into our episode! We talked about FlashAttention-2, how Hazy Research Group works, and some of the research being done in Transformer alternatives.Show Notes:* FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv)* FlashAttention-2* Together AI* From Deep Learning to Long Learning* The Hardware Lottery by Sara Hooker* Hazy Research* Is Attention All You Need?* Nvidia CUTLASS 3* SRAM scaling slows* Transformer alternatives:* S4* Hyena* Recurrent Neural Networks (RNNs)Timestamps:* Tri's background [00:00:00]* FlashAttention's deep dive [00:02:18]* How the Hazy Research group collaborates across theory, systems, and applications [00:17:21]* Evaluating models beyond raw performance [00:25:00]* FlashAttention-2 [00:27:00]* CUDA and The Hardware Lottery [00:30:00]* Researching in a fast-changing market [00:35:00]* Promising transformer alternatives like state space models and RNNs [00:37:30]* The spectrum of openness in AI models [00:43:00]* Practical impact of models like LLAMA2 despite restrictions [00:47:12]* Incentives for releasing open training datasets [00:49:43]* Lightning Round [00:53:22]Transcript:Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO-in-Residence at Decibel Partners. Today we have no Swyx, because he's in Singapore, so it's a one-on-one discussion with Tri Dao. Welcome! [00:00:24]Tri: Hi everyone. I'm Tri Dao, excited to be here. [00:00:27]Alessio: Tri just completed his PhD at Stanford a month ago. You might not remember his name, but he's one of the main authors in the FlashAttention paper, which is one of the seminal work in the Transformers era. He's got a lot of interest from efficient transformer training and inference, long range sequence model, a lot of interesting stuff. And now you're going to be an assistant professor in CS at Princeton next year. [00:00:51]Tri: Yeah, that's right. [00:00:52]Alessio: Yeah. And in the meantime, just to get, you know, a low pressure thing, you're Chief Scientist at Together as well, which is the company behind RedPajama. [00:01:01]Tri: Yeah. So I just joined this week actually, and it's been really exciting. [00:01:04]Alessio: So what's something that is not on the internet that people should know about you? [00:01:09]Tri: Let's see. When I started college, I was going to be an economist, so I was fully on board. I was going to major in economics, but the first week I was at Stanford undergrad, I took a few math classes and I immediately decided that I was going to be a math major. And that kind of changed the course of my career. So now I'm doing math, computer science, AI research. [00:01:32]Alessio: I had a similar thing. I started with physics and then I took like a programming course and I was like, I got to do computer science. I don't want to do physics. So FlashAttention is definitely, everybody's using this. Everybody loves it. You just released FlashAttention 2 last week. [00:01:48]Tri: Yeah. Early this week on Monday. Yeah. [00:01:53]Alessio: You know, AI time. Things move fast. So maybe let's run through some of the FlashAttention highlights, some of the innovation there, and then we can dive into FlashAttention 2. So the core improvement in FlashAttention is that traditional attention is a quadratic sequence length. And to the two, FlashAttention is linear, which obviously helps with scaling some of these models. [00:02:18]Tri: There are two factors there. So of course the goal has been to make attention go faster or more memory efficient. And ever since attention became popular in 2017 with the Transformer paper, lots and lots of folks have been working on this. And a lot of approaches has been focusing on approximating attention. The goal is you want to scale to longer sequences. There are tons of applications where you want to do that. But scaling to longer sequences is difficult because attention scales quadratically in sequence length on both runtime and memory, as you mentioned. So instead of trying to approximate attention, we were trying to figure out, can we do the same computation and maybe be more memory efficient? So in the end, we ended up being the memory is linear in sequence length. In terms of computation, it's still quadratic, but we managed to make it much more hardware friendly. And as a result, we do get wall clock speed up on the order of 2 to 4x, which really helps because that just means that you'll be able to train with 2 to 4x longer sequence length for the same cost without doing any approximations. As a result, lots of folks have been using this. The thing is available in a lot of libraries that do language model training or fine tuning. [00:03:32]Alessio: And the approximation thing is important because this is an exact thing versus a sparse. So maybe explain a little bit the difference there. [00:03:40]Tri: For sure. So in addition, essentially you compute pairwise similarity between every single element in a sequence against each other. So there's been other approaches where instead of doing all that pairwise computation, you only compute similarity for some pairs of elements in the sequence. So you don't do quadratic number of comparison. And this can be seen as some form of sparsity. Essentially you're ignoring some of the elements. When you write down the matrix, you essentially say, OK, I'm going to pretend there's zero. So that has some benefits in terms of runtime and memory. But the trade-off is that it tends to do worse in terms of quality because you're essentially approximating or ignoring some elements. And I personally have worked on this as well for a few years. But when we talk to practitioners who actually train models, especially at large scale, they say, tend not to use these approximate attention methods. Because it turns out, this was surprising to me at the time, was that these approximation methods, even though they perform fewer computation, they tend to not be faster in walk-on time. So this was pretty surprising because back then, I think my background was more on the theoretical side. So I was thinking of, oh, how many flops or floating point operations are you performing? And hopefully that correlates well with walk-on time. But I realized that I was missing a bunch of ideas from the system side where flops or floating point operations don't necessarily correlate with runtime. There are other factors like memory reading and writing, parallelism, and so on. So I learned a ton from just talking to systems people because they kind of figured this stuff out a while ago. So that was really eye-opening. And then we ended up focusing a lot more on memory reading and writing because that turned out to be the majority of the time when you're doing attention is reading and writing memory. [00:05:34]Alessio: Yeah, the I.O. awareness is probably one of the biggest innovations here. And the idea behind it is, like you mentioned, the FLOPS growth of the cards have been going up, but the memory bandwidth, not as much. So I think maybe that was one of the assumptions that the original attention paper had. So talk a bit about how that came to be as an idea. It's one of those things that like in insight, it's like, obviously, why are we like rewriting to like HBM every time, you know, and like once you change it, it's clear. But what was that discovery process? [00:06:08]Tri: Yeah, in hindsight, a lot of the ideas have already been there in the literature. And I would say is it was somehow at the intersection of both machine learning and systems. And you kind of needed ideas from both sides. So on one hand, on the system side, so lots of systems folks have known that, oh, you know, kernel fusion is great. Kernel fusion just means that instead of performing, you know, loading the same element, instead of performing an operation, write it down, load it back up and perform the second operation, you just load it once, perform two operations and then write it down again. So that saves you kind of memory read and write in the middle there. So kernel fusion has been a classic. There's been other techniques from the system side, like tiling, where you perform things in the form of computations in block, again, so that you can load it into a really fast memory. Think of it as a cache. And this is, again, classical computer science ideas, right? You want to use the cache. So the system folks have been thinking about these ideas for a long time, and they apply to attention as well. But there were certain things in attention that made it difficult to do a complete kernel fusion. One of which is there is this softmax operation in the middle, which requires you to essentially sum across the row of the attention matrix. So it makes it difficult to kind of break it, because there's this dependency. So it makes it difficult to break things into a block. So on the system side, people have been thinking about these ideas, but it's been difficult to kind of do kernel fusion for the entire operation. On the machine learning side, people have been thinking more algorithmically. They say, okay, either we can approximate attention, or there's this trick called the online softmax trick, which says that because of softmax, the way it's written mathematically, you can actually break it up into smaller pieces, do some rescaling, and still get the right answer. So this online softmax trick has been around for a while. I think there was a paper from NVIDIA folks back in 2018 about this. And then there was a paper from Google. So Marcus, Rob, and Stats wrote a paper late 2021 on using this online softmax trick to break attention up into smaller pieces. So a lot of the ideas were already there. But it turns out, you kind of need to combine ideas from both sides. So you need to understand that, hey, we want to do kernel fusion to reduce memory written writes. But we also need this online softmax trick to be able to break the softmax into smaller pieces so that a lot of the systems tricks kind of carry through. We saw that, and it was kind of a natural idea that we ended up using ideas from both sides, and it ended up working pretty well. Yeah. [00:08:57]Alessio: Are there any downsides to kernel fusion? If I think about databases and the reasons why we have atomic operations, you know, it's like, you have observability and fallback in between them. How does that work with attention? Is there anything that we lose by fusing the operations? [00:09:13]Tri: Yeah, I think mostly on the practical side is that you lose a little bit of flexibility in the sense that, hey, now you have, for example, faster attention, it's just a subroutine that you would call to do attention. But as a researcher, let's say you don't want that exact thing, right? You don't want just attention, let's say you want some modification to attention. You want to do, hey, I'm going to multiply the query and key, but then I'm going to do this extra thing before I carry on. So kernel fusion just means that, okay, we have a subroutine that does the entire thing. But if you want to experiment with things, you won't be able to use that fused kernel. And the answer is, can we have a compiler that then automatically does a lot of this kernel fusion? Lots of compiler folks are thinking about this, either with a new language or you can embed it in PyTorch. PyTorch folks have been working on this as well. So if you write just your code in PyTorch and they can capture the graph, can they generate code that will fuse everything together? That's still ongoing, and it works for some cases. But for attention, because of this kind of softmax rewriting stuff, it's been a little bit more difficult. So maybe in a year or two, we'll have compilers that are able to do a lot of these optimizations for you. And you don't have to, for example, spend a couple months writing CUDA to get this stuff to work. Awesome. [00:10:41]Alessio: And just to make it clear for listeners, when we say we're not writing it to memory, we are storing it, but just in a faster memory. So instead of the HBM, we're putting it in the SRAM. Yeah. [00:10:53]Tri: Yeah. [00:10:54]Alessio: Maybe explain just a little bit the difference there. [00:10:56]Tri: Yeah, for sure. This is kind of a caricature of how you think about accelerators or GPUs in particular, is that they have a large pool of memory, usually called HBM, or high bandwidth memory. So this is what you think of as GPU memory. So if you're using A100 and you list the GPU memory, it's like 40 gigs or 80 gigs. So that's the HBM. And then when you perform any operation, you need to move data from the HBM to the compute unit. So the actual hardware unit that does the computation. And next to these compute units, there are on-chip memory or SRAM, which are much, much smaller than HBM, but much faster. So the analogy there is if you're familiar with, say, CPU and RAM and so on. So you have a large pool of RAM, and then you have the CPU performing the computation. But next to the CPU, you have L1 cache and L2 cache, which are much smaller than DRAM, but much faster. So you can think of SRAM as the small, fast cache that stays close to the compute unit. Physically, it's closer. There is some kind of asymmetry here. So HBM is much larger, and SRAM is much smaller, but much faster. One way of thinking about it is, how can we design algorithms that take advantage of this asymmetric memory hierarchy? And of course, lots of folks have been thinking about this. These ideas are pretty old. I think back in the 1980s, the primary concerns were sorting. How can we sort numbers as efficiently as possible? And the motivating example was banks were trying to sort their transactions, and that needs to happen overnight so that the next day they can be ready. And so the same idea applies, which is that they have slow memory, which was hard disk, and they have fast memory, which was DRAM. And people had to design sorting algorithms that take advantage of this asymmetry. And it turns out, these same ideas can apply today, which is different kinds of memory. [00:13:00]Alessio: In your paper, you have the pyramid of memory. Just to give people an idea, when he says smaller, it's like HBM is like 40 gig, and then SRAM is like 20 megabytes. So it's not a little smaller, it's much smaller. But the throughput on card is like 1.5 terabytes a second for HBM and like 19 terabytes a second for SRAM, which is a lot larger. How do you think that evolves? So TSMC said they hit the scaling limits for SRAM, they just cannot grow that much more. HBM keeps growing, HBM3 is going to be 2x faster than HBM2, I think the latest NVIDIA thing has HBM3. How do you think about the future of FlashAttention? Do you think HBM is going to get fast enough when maybe it's not as useful to use the SRAM? [00:13:49]Tri: That's right. I think it comes down to physics. When you design hardware, literally SRAM stays very close to compute units. And so you don't have that much area to essentially put the transistors. And you can't shrink these things too much. So just physics, in terms of area, you don't have that much area for the SRAM. HBM is off-chip, so there is some kind of bus that essentially transfers data from HBM to the compute unit. So you have more area to essentially put these memory units. And so yeah, I think in the future SRAM probably won't get that much larger, because you don't have that much area. HBM will get larger and faster. And so I think it becomes more important to design algorithms that take advantage of this memory asymmetry. It's the same thing in CPU, where the cache is really small, the DRAM is growing larger and larger. DRAM could get to, I don't know, two terabytes, six terabytes, or something, whereas the cache stays at, I don't know, 15 megabytes or something like that. I think maybe the algorithm design becomes more and more important. There's still ways to take advantage of this, I think. So in the future, I think flash attention right now is being used. I don't know if in the next couple of years, some new architecture will come in and whatnot, but attention seems to be still important. For the next couple of years, I still expect some of these ideas to be useful. Not necessarily the exact code that's out there, but I think these ideas have kind of stood the test of time. New ideas like IO awareness from back in the 1980s, ideas like kernel fusions, tiling. These are classical ideas that have stood the test of time. So I think in the future, these ideas will become more and more important as we scale models to be larger, as we have more kinds of devices, where performance and efficiency become much, much more important. [00:15:40]Alessio: Yeah, and we had Jonathan Frankle on the podcast, and if you go to issattentionallyouneed.com, he has an outstanding bet, and he does believe that attention will be the state of the art architecture still in a few years. Did you think flash attention would be this popular? I'm always curious on the research side, you publish a paper, and obviously you know it's great work, but sometimes it just kind of falls flat in the industry. Could you see everybody just starting to use this, or was that a surprise to you? [00:16:11]Tri: Certainly, I didn't anticipate the level of popularity. Of course, we were extremely happy to have people using this stuff and giving us feedback and so on, and help us improve things. I think when we were writing the paper, I remember sending an email to one of my advisors, and like, hey, I'm excited about this paper, but I think the most important thing will be the artifact, which is the code. So I knew that the code will be valuable. So we kind of focus a lot on the code and make sure that the code is usable and as fast as can be. Of course, the idea, the paper presents the ideas and explain it and have experiments that validate the idea, but I knew that the artifact or the code was also pretty important. And that turned out to be the right focus, which is, you know, we put out the paper, we release the code and continue working on the code. So it's a team effort with my co-authors as well. [00:17:07]Alessio: We mentioned Hazy Research a bunch of times on the podcast before. I would love for you to spend five minutes just talking about how does the group work? How do people get together? How do you bounce ideas off of each other? Yeah. [00:17:21]Tri: So Hazy Research is a research group at Stanford led by one of my advisors, Chris Re. I love the people there. It was one of the best experiences I had. They've made my PhD so much more enjoyable. And I think there are a couple of ways that the group has been working pretty well. So one is, I think there's a diverse pool of people who either, you know, some of them focus on algorithms and theory, some of them focus on building systems, some of them focus on applications. And as a result, there is this flow of idea. So as an example, some of us were working on like more algorithms and theory, and then we can talk to the folks building systems and say, hey, let's try it out and let's put it in the systems and see how it is. And there you will get feedback from systems folks. They will say, hey, we implemented this, or we tried this and this is where it doesn't work, something like that. And once we put it in the systems, the application folks can use the algorithm or new methods or new models. And we again get great feedback from them because the application folks, for example, some of my good friends, they focus on medical imaging or seizure detection. And that is the problem they care about. And if your method doesn't work on the task they care about, they will tell you. Whereas I think a lot of people in machine learning, they're a little bit more flexible. So they will be like, hey, it doesn't work on seizure detection. Let's try some other task, right? But having that direct feedback of like, hey, it doesn't work there, let's figure out why. I think that that feedback allows us to do better work. And I think that kind of process of exchanging ideas, validating it in a real system so that applications folks can try it out and give you feedback. That cycle has been very, very useful. And so that's one, having a diverse group of people. The other one is, and this is something I really appreciate from advice from Chris was try to understand the fundamental, right? And he's happy letting me go off and read some textbooks and playing with things because I think a lot of research ideas come from understanding the old literature and see how it fits with the new landscape. And so if you just new archive papers every day, that's great, but you also need to read textbooks. And that's one advice I got from Chris, which is understand the fundamentals. And I think that allows us to do more impactful work. [00:19:46]Alessio: How do you think about academia versus industry? I feel like AI / Machine Learning has been an area where up until three, four years ago, most of the cutting edge work was being done in academia. And now there's all these big industry research labs. You're obviously going to Princeton, so you're an academia believer. How should people think about where to go? Say I'm doing my master's, I have to decide between doing a PhD and going into OpenAI Anthropic. How should I decide? [00:20:15]Tri: I think they kind of play a complementary role, in my opinion. Of course, I also was considering different paths as well. So I think right now, scaling matters a lot, especially when you talk about language models and AI and so on. Scaling matters a lot. And that means that you need compute resources and you need infrastructure and you need engineers time. And so industry tends to have an advantage when it comes to scaling things. But a lot of the ideas actually came from academia. So let's take Attention, which got popular with the Transformer in 2017. Attention actually has been around for a while. So I think the first mention was in 2014, a paper from Bernadot and others and Yoshua Bengio, which is coming from academia. A lot of ideas did come from academia. And scaling things up, of course, I think OpenAI has been great at scaling things up. That was the bet that they made after, I think, GPT-2. So they saw that scaling these things up to back then was 1.5 billion parameter seemed to give you amazing capabilities. So they really committed to that. They really committed to scaling things. And that turned out to be, it's been a pretty successful bet. I think for academia, we're still trying to figure out exactly what we're doing in this shifting landscape. And so lots of folks have been focusing on, for example, evaluation. So I know the Stanford Center for Foundation Model led by Percy, they have this benchmark called HELM, which is this holistic benchmark. So trying to figure out, okay, characterizing the landscape of different kinds of models, what people should evaluate, what people should measure, and things like that. So evaluation is one role. The other one is understanding. So this has happened historically where there's been some development in the industry and academia can play a role in explaining, understanding. They have the luxury to slow down trying to understand stuff, right? So lots of paper on understanding what's really going on, probing these models, and so on. I think I'm not as familiar with the NLP literature, but my impression is there's a lot of that going on in the NLP conferences, which is understanding what these models are doing, what capabilities they have, and so on. And the third one I could see is that the academia can take more risky bets in the sense that we can work on stuff that is quite different from industry. I think industry, my impression is you have some objective. You're trying to say, hey, for this quarter, we want to scale the model in this particular way. Next quarter, we want the model to have these capabilities. You're trying to get objectives that maybe, I don't know, 70% that will work out because it's important for the company's direction. I think for academia, the way things work is you have many, many researchers or PhD students, and they're kind of pursuing independent directions. And they have a little bit more flexibility on, hey, I'm going to try out this seemingly crazy idea and see, let's say there's a 30% chance of success or something. And however you define success, for academia, a lot of the time, success just means like, hey, we found something interesting. That could eventually go into industry through collaboration and so on. So I do see academia and industry kind of playing complementary roles. And as for someone choosing a career, I think just more and more generally, industry would be probably better in terms of compensation, in terms of probably work-life balance. But my biased perspective is that maybe academia gives you a little bit more freedom to think and understand things. So it probably comes down to personal choice. I end up choosing to be a professor next year at Princeton. But of course, I want to maintain a relationship with industry folks. I think industry folks can provide very valuable feedback to what we're doing in academia so that we understand where the field is moving because some of the directions are very much influenced by what, for example, OpenAI or Google is doing. So we want to understand where the field is moving. What are some promising applications? And try to anticipate, okay, if the field is moving like this, these applications are going to be popular. What problems will be important in two, three years? And then we try to start thinking about those problems so that hopefully in two, three years, we have some of the answers to some of these problems in two, three years. Sometimes it works out, sometimes it doesn't. But as long as we do interesting things in academia, that's the goal. [00:25:03]Alessio: And you mentioned the eval side. So we did a Benchmarks 101 episode. And one of the things we were seeing is sometimes the benchmarks really influence the model development. Because obviously, if you don't score well on the benchmarks, you're not going to get published and you're not going to get funded. How do you think about that? How do you think that's going to change now that a lot of the applications of these models, again, is in more narrow industry use cases? Do you think the goal of the academia eval system is to be very broad and then industry can do their own evals? Or what's the relationship there? [00:25:40]Tri: Yeah, so I think evaluation is important and often a little bit underrated. So it's not as flashy as, oh, we have a new model that can do such and such. But I think evaluation, what you don't measure, you can't make progress on, essentially. So I think industry folks, of course, they have specific use cases that their models need to do well on. And that's what they care about. Not just academia, but other groups as well. People do understand what are some of the emerging use cases. So for example, now one of the most popular use cases is Chatbot. And then I think folks from Berkeley, some of them are from Berkeley, call them MLCs. They set up this kind of Chatbot arena to essentially benchmark different models. So people do understand what are some of the emerging use cases. People do contribute to evaluation and measurement. And as a whole, I think people try to contribute to the field and move the field forward, albeit that maybe slightly different directions. But we're making progress and definitely evaluation and measurement is one of the ways you make progress. So I think going forward, there's still going to be just more models, more evaluation. We'll just have better understanding of what these models are doing and what capabilities they have. [00:26:56]Alessio: I like that your work has been focused on not making benchmarks better, but it's like, let's just make everything faster. So it's very horizontal. So FlashAttention 2, you just released that on Monday. I read in the blog post that a lot of the work was also related to some of the NVIDIA library updates. Yeah, maybe run us through some of those changes and some of the innovations there. Yeah, for sure. [00:27:19]Tri: So FlashAttention 2 is something I've been working on for the past couple of months. So the story is the NVIDIA CUTLASS team, they released a new version of their library, which contains all these primitives to allow you to do matrix multiply or memory loading on GPU efficiently. So it's a great library and I built on that. So they released their version 3 back in January and I got really excited and I wanted to play with that library. So as an excuse, I was just like, okay, I'm going to refactor my code and use this library. So that was kind of the start of the project. By the end, I just ended up working with the code a whole lot more and I realized that, hey, there are these inefficiencies still in Flash Attention. We could change this way or that way and make it, in the end, twice as fast. But of course, building on the library that the NVIDIA folks released. So that was kind of a really fun exercise. I was starting out, it's just an excuse for myself to play with the new library. What ended up was several months of improvement, improving Flash Attention, discovering new ideas. And in the end, we managed to make it 2x faster and now it's pretty close to probably the efficiency of things like matrix multiply, which is probably the most optimized subroutine on the planet. So we're really happy about it. The NVIDIA Cutlass team has been very supportive and hopefully in the future, we're going to collaborate more. [00:28:46]Alessio: And since it's an NVIDIA library, can you only run this on CUDA runtimes? Or could you use this and then run it on an AMD GPU? [00:28:56]Tri: Yeah, so it's an NVIDIA library. So right now, the code we release runs on NVIDIA GPUs, which is what most people are using to train models. Of course, there are emerging other hardware as well. So the AMD folks did implement a version of Flash Attention, I think last year as well, and that's also available. I think there's some implementation on CPU as well. For example, there's this library, ggml, where they implemented the same idea running on Mac and CPU. So I think that kind of broadly, the idea would apply. The current implementation ended up using NVIDIA's library or primitives, but I expect these ideas to be broadly applicable to different hardware. I think the main idea is you have asymmetry in memory hierarchy, which tends to be everywhere in a lot of accelerators. [00:29:46]Alessio: Yeah, it kind of reminds me of Sara Hooker's post, like the hardware lottery. There could be all these things that are much better, like architectures that are better, but they're not better on NVIDIA. So we're never going to know if they're actually improved. How does that play into some of the research that you all do too? [00:30:04]Tri: Yeah, so absolutely. Yeah, I think Sara Hooker, she wrote this piece on hardware lottery, and I think she captured really well of what a lot of people have been thinking about this. And I certainly think about hardware lottery quite a bit, given that I do some of the work that's kind of really low level at the level of, hey, we're optimizing for GPUs or NVIDIA GPUs and optimizing for attention itself. And at the same time, I also work on algorithms and methods and transformer alternatives. And we do see this effect in play, not just hardware lottery, but also kind of software framework lottery. You know, attention has been popular for six years now. And so many kind of engineer hours has been spent on making it as easy and efficient as possible to run transformer, right? And there's libraries to do all kinds of tensor parallel, pipeline parallel, if you use transformer. Let's say someone else developed alternatives, or let's just take recurrent neural nets, like LSTM, GRU. If we want to do that and run that efficiently on current hardware with current software framework, that's quite a bit harder. So in some sense, there is this feedback loop where somehow the model architectures that take advantage of hardware become popular. And the hardware will also kind of evolve to optimize a little bit for that kind of architecture and software framework will also evolve to optimize for that particular architecture. Right now, transformer is the dominant architecture. So yeah, I'm not sure if there is a good way out of this. Of course, there's a lot of development. Things like, I think compilers will play a role because compilers allow you to maybe still be much more efficient across different kinds of hardware because essentially you write the same code and compiler will be able to make it run efficiently different kinds of hardware. So for example, there's this language Mojo, they're compiler experts, right? And their bet is AI models will be running on different kinds of devices. So let's make sure that we have really good compilers with a good language that then the compiler can do a good job optimizing for all kinds of devices. So that's maybe one way that you can get out of this cycle. But yeah, I'm not sure of a good way. In my own research, I have to think about both the algorithm new model and how it maps to hardware. So there are crazy ideas that seem really good, but will be really, really difficult to run efficiently. And so as a result, for example, we can't really scale some of the architectures up simply because they're not hardware friendly. I have to think about both sides when I'm working on new models. [00:32:50]Alessio: Yeah. Have you spent any time looking at some of the new kind of like AI chips companies, so to speak, like the Cerebras of the world? Like one of their innovations is co-locating everything on the chip. So you remove some of this memory bandwidth issue. How do you think about that? [00:33:07]Tri: Yeah, I think that's an interesting bet. I think Tesla also has this Dojo supercomputer where they try to have essentially as fast on-chip memory as possible and removing some of these data transfer back and forth. I think that's a promising direction. The issues I could see, you know, I'm definitely not a hardware expert. One issue is the on-chip memory tends to be really expensive to manufacture, much more expensive per gigabyte compared to off-chip memory. So I talked to, you know, some of my friends at Cerebros and, you know, they have their own stack and compiler and so on, and they can make it work. The other kind of obstacle is, again, with compiler and software framework and so on. For example, if you can run PyTorch on this stuff, lots of people will be using it. But supporting all the operations in PyTorch will take a long time to implement. Of course, people are working on this. So I think, yeah, we kind of need these different bets on the hardware side as well. Hardware has, my understanding is, has a kind of a longer time scale. So you need to design hardware, you need to manufacture it, you know, maybe on the order of three to five years or something like that. So people are taking different bets, but the AI landscape is changing so fast that it's hard to predict, okay, what kind of models will be dominant in, let's say, three or five years. Or thinking back five years ago, would we have known that Transformer would have been the dominant architecture? Maybe, maybe not, right? And so different people will make different bets on the hardware side. [00:34:39]Alessio: Does the pace of the industry and the research also influence the PhD research itself? For example, in your case, you're working on improving attention. It probably took you quite a while to write the paper and everything, but in the meantime, you could have had a new model architecture come out and then it's like nobody cares about attention anymore. How do people balance that? [00:35:02]Tri: Yeah, so I think it's tough. It's definitely tough for PhD students, for researchers. Given that the field is moving really, really fast, I think it comes down to understanding fundamental. Because that's essentially, for example, what the PhD allows you to do. It's been a couple of years understanding the fundamentals. So for example, when I started my PhD, I was working on understanding matrix vector multiply, which has been a concept that's been around for hundreds of years. We were trying to characterize what kind of matrices would have theoretically fast multiplication algorithm. That seems to have nothing to do with AI or anything. But I think that was a time when I developed mathematical maturity and research taste and research skill. The research topic at that point didn't have to be super trendy or anything, as long as I'm developing skills as a researcher, I'm making progress. And eventually, I've gotten quite a bit better in terms of research skills. And that allows, for example, PhD students later in their career to quickly develop solutions to whatever problems they're facing. So I think that's just the natural arc of how you're being trained as a researcher. For a lot of PhD students, I think given the pace is so fast, maybe it's harder to justify spending a lot of time on the fundamental. And it's tough. What is this kind of explore, exploit kind of dilemma? And I don't think there's a universal answer. So I personally spend some time doing this kind of exploration, reading random textbooks or lecture notes. And I spend some time keeping up with the latest architecture or methods and so on. I don't know if there's a right balance. It varies from person to person. But if you only spend 100% on one, either you only do exploration or only do exploitation, I think it probably won't work in the long term. It's probably going to have to be a mix and you have to just experiment and kind of be introspective and say, hey, I tried this kind of mixture of, I don't know, one exploration paper and one exploitation paper. How did that work out for me? Should I, you know, having conversation with, for example, my advisor about like, hey, did that work out? You know, should I shift? I focus more on one or the other. I think quickly adjusting and focusing on the process. I think that's probably the right way. I don't have like a specific recommendation that, hey, you focus, I don't know, 60% on lecture notes and 40% on archive papers or anything like that. [00:37:35]Alessio: Let's talk about some Transformer alternatives. You know, say Jonathan Franco loses his bet and Transformer is not the state of the art architecture. What are some of the candidates to take over? [00:37:49]Tri: Yeah, so this bet is quite fun. So my understanding is this bet between Jonathan Franco and Sasha Rush, right? I've talked to Sasha a bunch and I think he recently gave an excellent tutorial on Transformer alternatives as well. So I would recommend that. So just to quickly recap, I think there's been quite a bit of development more recently about Transformer alternatives. So architectures that are not Transformer, right? And the question is, can they do well on, for example, language modeling, which is kind of the application that a lot of people care about these days. So there are methods based on state space methods that came out in 2021 from Albert Gu and Curran and Chris Re that presumably could do much better in terms of capturing long range information while not scaling quadratically. They scale sub-quadratically in terms of sequence length. So potentially you could have a much more efficient architecture when sequence length gets really long. The other ones have been focusing more on recurrent neural nets, which is, again, an old idea, but adapting to the new landscape. So things like RWKV, I've also personally worked in this space as well. So there's been some promising results. So there's been some results here and there that show that, hey, these alternatives, either RNN or state space methods, can match the performance of Transformer on language modeling. So that's really exciting. And we're starting to understand on the academic research side, we want to understand, do we really need attention? I think that's a valuable kind of intellectual thing to understand. And maybe we do, maybe we don't. If we want to know, we need to spend serious effort on trying the alternatives. And there's been folks pushing on this direction. I think RWKV scale up to, they have a model at 14 billion that seems pretty competitive with Transformer. So that's really exciting. That's kind of an intellectual thing. We want to figure out if attention is necessary. So that's one motivation. The other motivation is Transformer Alternative could have an advantage in practice in some of the use cases. So one use case is really long sequences. The other is really high throughput of generation. So for really long sequences, when you train with Transformer, with flash attention and so on, the computation is still quadratic in the sequence length. So if your sequence length is on the order of, I don't know, 16K, 32K, 100K or something, which some of these models have sequence length 100K, then you do get significantly slower in terms of training, also in terms of inference. So maybe these alternative architectures could scale better in terms of sequence length. I haven't seen actual validation on this. Let's say an RNN model release with context length, I don't know, 100K or something. I haven't really seen that. But the hope could be that as we scale to long sequences, these alternative architectures could be more well-suited. Not just text, but things like high resolution images, audio, video, and so on, which are emerging applications. So that's one, long sequences. Number two is a high throughput generation, where I can imagine scenarios where the application isn't like an interactive chatbot, but let's say a company wants to batch as many requests as possible on their server, or they're doing offline processing, they're generating stuff based on their internal documents, that you need to process in batch. And the issue with Transformer is that during generation, it essentially needs to keep around all the previous history. It's called the KV cache. And that could take a significant amount of memory, so you can't really batch too much because you run out of memory. I am personally bullish on RNNs. I think RNNs, they essentially summarize the past into a state vector that has fixed size, so the size doesn't grow with the history. So that means that you don't need as much memory to keep around all the previous tokens. And as a result, I think you can scale to much higher batch sizes. And as a result, you can make much more efficient use of the GPUs or the accelerator, and you could have much higher generation throughput. Now, this, I don't think, has been validated at scale. So as a researcher, I'm bullish on this stuff because I think in the next couple of years, these are use cases where these alternatives could have an advantage. We'll just kind of have to wait and see to see if these things will happen. I am personally bullish on this stuff. At the same time, I also spend a bunch of time making attention as fast as possible. So maybe hatching and playing both sides. Ultimately, we want to understand, as researchers, we want to understand what works, why do the models have these capabilities? And one way is, let's push attention to be as efficient as possible. On the other hand, let's push other alternatives to be as efficient at scale, as big as possible, and so that we can kind of compare them and understand. Yeah, awesome. [00:43:01]Alessio: And I think as long as all of this work happens and open, it's a net positive for everybody to explore all the paths. Yeah, let's talk about open-source AI. Obviously, together, when Red Pajama came out, which was an open clone of the LLAMA1 pre-training dataset, it was a big thing in the industry. LLAMA2 came out on Tuesday, I forget. And this week, there's been a lot of things going on, which they call open-source, but it's not really open-source. Actually, we wrote a post about it that was on the front page of Hacker News before this podcast, so I was frantically responding. How do you think about what open-source AI really is? In my mind, in open-source software, we have different levels of open. So there's free software, that's like the GPL license. There's open-source, which is Apache, MIT. And then there's kind of restricted open-source, which is the SSPL and some of these other licenses. In AI, you have the open models. So Red Pajama is an open model because you have the pre-training dataset, you have the training runs and everything. And then there's obviously RandomLens that doesn't make it one-to-one if you retrain it. Then you have the open-weights model that's kind of like StableLM, where the weights are open, but the dataset is not open. And then you have LLAMA2, which is the dataset is not open, the weights are restricted. It's kind of like not really open-source, but open enough. I think it's net positive because it's like $3 million of flops donated to the public. [00:44:32]Tri: How do you think about that? [00:44:34]Alessio: And also, as you work together, what is your philosophy with open-source AI? Right, right. [00:44:40]Tri: Yeah, I think that's a great question. And I think about it on maybe more practical terms. So of course, Meta has done an amazing job training LLAMA1, LLAMA2. And for LLAMA2, they make it much less restrictive compared to LLAMA1. Now you can use it for businesses, unless you are a monthly active user or something like that. I think just this change will have a very significant impact in the kind of landscape of open-source AI, where now lots of businesses, lots of companies will be using, I expect will be using things like LLAMA2. They will fine-tune on their own dataset. They will be serving variants or derivatives of LLAMA2. Whereas before, with LLAMA1, it was also a really good model, but your business companies weren't allowed to do that. So I think on a more practical term, it's kind of shifting the balance between a closed-source model like OpenAI and Anthropic and Google, where you're making API calls, right? And maybe you don't understand as much of what the model is doing, how the model is changing, and so on. Versus now, we have a model with open weight that is pretty competitive from what I've seen in terms of benchmarks, pretty competitive with GPT 3.5, right? And if you fine-tune it on your own data, maybe it's more well-suited for your own data. And I do see that's going to shift the balance of it. More and more folks are going to be using, let's say, derivatives of LLAMA2. More and more folks are going to fine-tune and serve their own model instead of calling an API. So that shifting of balance is important because in one way, we don't want just a concentration of decision-making power in the hands of a few companies. So I think that's a really positive development from Meta. Of course, training the model takes a couple of millions of dollars, but engineers have and I'm sure they spend tons of time trying many, many different things. So the actual cost is probably way more than that. And they make the weights available and they allow probably a lot of companies are going to be using this. So I think that's a really positive development. And we've also seen amazing progress on the open source community where they would take these models and they either fine-tune on different kinds of data sets or even make changes to the model. So as an example, I think for LLAMA1, the context lane was limited to 2K. Like a bunch of folks figured out some really simple methods to scale up to like 8K. [00:47:12]Alessio: Like the RoPE. [00:47:13]Tri: Yes. I think the open source community is very creative, right? And lots of people. LLAMA2 will, again, kind of accelerate this where more people will try it out. More people will make tweaks to it and make a contribution and then so on. So overall, I think I see that as still a very positive development for the field. And there's been lots of libraries that will allow you to host or fine-tune these models, like even with quantization and so on. Just a couple of hours after LLAMA2 was released, tons of companies announcing that, hey, it's on our API or hosting and so on and together did the same. So it's a very fast-paced development and just kind of a model with available weights that businesses are allowed to use. I think that alone is already a very positive development. At the same time, yeah, we can do much better in terms of releasing data sets. Data sets tend to be... Somehow people are not incentivized to release data sets. So philosophically, yeah, you want to be as open as possible. But on a practical term, I think it's a little bit harder for companies to release data sets. Legal issues. The data sets released tend to be not as eye-catchy as the model release. So maybe people are less incentivized to do that. We've seen quite a few companies releasing data sets together. Released a red pajama data set. I think Cerebus then worked on that and deduplicate and clean it up and release slim pajama and so on. So we're also seeing positive development on that front, kind of on the pre-training data set. So I do expect that to continue. And then on the fine-tuning data set or instruction tuning data set, I think we now have quite a few open data sets on instruction tuning and fine-tuning. But these companies do pay for human labelers to annotate these instruction tuning data set. And that is expensive. And maybe they will see that as their competitive advantage. And so it's harder to incentivize these companies to release these data sets. So I think on a practical term, we're still going to make a lot of progress on open source AI, on both the model development, on both model hosting, on pre-training data set and fine-tuning data set. Right now, maybe we don't have the perfect open source model since all the data sets are available. Maybe we don't have such a thing yet, but we've seen very fast development on the open source side. I think just maybe this time last year, there weren't as many models that are competitive with, let's say, ChatGPT. [00:49:43]Alessio: Yeah, I think the open data sets have so much more impact than open models. If you think about Elusive and the work that they've done, GPT-J was great, and the Pythia models are great, but the Pyle and the Stack, everybody uses them. So hopefully we get more people to contribute time to work on data sets instead of doing the 100th open model that performs worse than all the other ones, but they want to say they released the model. [00:50:14]Tri: Yeah, maybe the question is, how do we figure out an incentive structure so that companies are willing to release open data sets? And for example, it could be like, I think some of the organizations are now doing this where they are asking volunteers to annotate and so on. And maybe the Wikipedia model of data set, especially for instruction tuning, could be interesting where people actually volunteer their time and instead of editing Wikipedia, add annotation. And somehow they acknowledge and feel incentivized to do so. Hopefully we get to that kind of level of, in terms of data, it would be kind of like Wikipedia. And in terms of model development, it's kind of like Linux where people are contributing patches and improving the model in some way. I don't know exactly how that's going to happen, but based on history, I think there is a way to get there. [00:51:05]Alessio: Yeah, I think the Dolly-15K data set is a good example of a company saying, let's do this smaller thing, just make sure we make it open. We had Mike Conover from Databricks on the podcast, and he was like, people just bought into it and leadership was bought into it. You have companies out there with 200,000, 300,000 employees. It's like, just put some of them to label some data. It's going to be helpful. So I'm curious to see how that evolves. What made you decide to join Together? [00:51:35]Tri: For Together, the focus has been focusing a lot on open source model. And I think that aligns quite well with what I care about, of course. I also know a bunch of people there that I know and trust, and I'm excited to work with them. Philosophically, the way they've been really open with data set and model release, I like that a lot. Personally, for the stuff, for example, the research that I've developed, like we also try to make code available, free to use and modify and so on, contributing to the community. That has given us really valuable feedback from the community and improving our work. So philosophically, I like the way Together has been focusing on open source model. And the nice thing is we're also going to be at the forefront of research and the kind of research areas that I'm really excited about, things like efficient training and inference, aligns quite well with what the company is doing. We'll try our best to make things open and available to everyone. Yeah, but it's going to be fun being at the company, leading a team, doing research on the topic that I really care about, and hopefully we'll make things open to benefit the community. [00:52:45]Alessio: Awesome. Let's jump into the lightning round. Usually, I have two questions. So one is on acceleration, one on exploration, and then a takeaway. So the first one is, what's something that already happened in AI machine learning that you thought would take much longer than it has? [00:53:01]Tri: I think understanding jokes. I didn't expect that to happen, but it turns out scaling model up and training lots of data, the model can now understand jokes. Maybe it's a small thing, but that was amazing to me. [00:53:16]Alessio: What about the exploration side? What are some of the most interesting unsolved questions in the space? [00:53:22]Tri: I would say reasoning in the broad term. We don't really know how these models do. Essentially, they do something that looks like reasoning. We don't know how they're doing it. We have some ideas. And in the future, I think we will need to design architecture that explicitly has some kind of reasoning module in it if we want to have much more capable models. [00:53:43]Alessio: What's one message you want everyone to remember today? [00:53:47]Tri: I would say try to understand both the algorithm and the systems that these algorithms run on. I think at the intersection of machine learning system has been really exciting, and there's been a lot of amazing results at this intersection. And then when you scale models to large scale, both the machine learning side and the system side really matter. [00:54:06]Alessio: Awesome. Well, thank you so much for coming on 3. [00:54:09]Tri: This was great. Yeah, this has been really fun. [00:54:11] Get full access to Latent Space at www.latent.space/subscribe

google ai phd data partner mit legal tesla chatgpt attention practical singapore mac stanford wikipedia released scaling berkeley falcon cto ram personally stats nlp transformers evaluating openai residence nvidia hardware api physically cs incentives io chatbots stack mojo rope gpt linux promising llama transformer exact helm researching flops amd tri tbs apache 2k percy versus gru cpu elusive gpu chief scientist dojo lightning round curran dram kv benchmarks gpus 8k alessio anthropic kernel l2 l1 pyle databricks cuda stanford center mpt hacker news gpl ai machine learning 16k sram cerebros nvidia gpus pytorch 8gb philosophically yoshua bengio approximation a100 input output pythia 32k cerebus hbm george hotz rnn cerebras 5tb lstm rnns 20mb latent space llama2 sspl gpt j mlcs

George Hotz Hacked Everything (tinycorp)

Power Law with John Coogan

Play Episode Listen Later Jul 21, 2023 48:09

The full story of George Hotz, starting as a kid in New Jersey, to hacking the iPhone, getting sued by Sony, then building a self-driving car company and eventually taking on Nvidia. Incredible person, hope you enjoy.

new jersey iphone sony incredible nvidia hacked george hotz

Aligning Super Intelligence & The Open Web's Uncertain Future + GPT-4 General Availability | EP22

This Day in AI Podcast

Play Episode Listen Later Jul 7, 2023 68:07

In Episode 22 we cover GPT-4 general availability and the implications to millions of developers. We discuss George Hotz revealing how GPT-4 works & Alignment of Super Intelligence. Will OpenAI's Super Intelligence succeed? We also cover the implications of AI agents on the open web, OpenAI disabling web browsing in ChatGPT temporarily to protect content creators and discover Jolly Roger's AI time wasting for service to trick scammers.If you like this podcast please consider subscribing, liking and leaving us a comment. We really appreciate your support of the show.CHAPTERS:----00:00 - Cold open00:27 - GPT-4 API General Availability for Developers06:40 - ChatGPT Code Interrupter 15:30 - ChatGPT as Work Assistant: Code Interrupter + Vision16:42 - ChatGPT web traffic down by 10% & thoughts on single agent Vs AI everywhere21:03 - George Hotz leaking GPT-4 is a 16 way mixture model26:50 - Is AI alignment just a human alignment problem? 27:57 - Is alignment making GPT-4 and ChatGPT worse? Discuss on AI alignment38:18 - Is the best defense against super intelligent AI giving everyone super intelligent AI?40:55 - Is the Open Web Doomed? AI's impact on the Open Web54:42 - OpenAI's 20% compute power to "Superalignment"1:00:29 - OpenAI now sends email threats1:03:12 - Waste Scammers Time with This AI Voice ToolSOURCES:----https://openai.com/blog/gpt-4-api-general-availabilityhttps://twitter.com/gdb/status/1677023789807292420?s=46&t=uXHUN4Glah4CaV-g2czc6Qhttps://www.reddit.com/r/ChatGPT/comments/14s72g4/chatgpt_drops_10_traffic_as_excitement_around_it/https://www.youtube.com/watch?v=1v-qvVIje4Yhttps://thealgorithmicbridge.substack.com/p/gpt-4s-secret-has-been-revealedhttps://www.axios.com/2023/07/05/public-web-twitter-open-access-reddithttps://twitter.com/openai/status/1676072388436594688?s=46&t=uXHUN4Glah4CaV-g2czc6Qhttps://openai.com/blog/introducing-superalignmenthttps://www.reddit.com/r/MachineLearning/comments/14ripua/d_introducing_superalignment_openai/https://www.reddit.com/r/ChatGPT/comments/14ruui2/i_use_chatgpt_for_hours_everyday_and_can_say_100/https://www.reddit.com/r/ChatGPT/comments/14qwa6m/openai_now_sends_email_threats/https://www.wsj.com/articles/people-hire-phone-bots-to-torture-telemarketers-2dbb8457https://jollyrogertelephone.com/

ai marketing technology cold chatgpt software alignment chapters aligning machine learning openai gpt uncertain future superintelligence jolly roger artifical intelligence open web george hotz general availability

#387 – George Hotz: Tiny Corp, Twitter, AI Safety, Self-Driving, GPT, AGI & God

Lex Fridman Podcast

Play Episode Listen Later Jun 30, 2023 195:19

George Hotz is a programmer, hacker, and the founder of comma-ai and tiny corp. Please support this podcast by checking out our sponsors: - Numerai: https://numer.ai/lex - Babbel: https://babbel.com/lexpod and use code Lexpod to get 55% off - NetSuite: http://netsuite.com/lex to get free product tour - InsideTracker: https://insidetracker.com/lex to get 20% off - AG1: https://drinkag1.com/lex to get 1 year of Vitamin D and 5 free travel packs Transcript: https://lexfridman.com/george-hotz-3-transcript EPISODE LINKS: George's Twitter: https://twitter.com/realgeorgehotz George's Twitch: https://twitch.tv/georgehotz George's Instagram: https://instagram.com/georgehotz Tiny Corp's Twitter: https://twitter.com/__tinygrad__ Tiny Corp's Website: https://tinygrad.org/ Comma-ai's Twitter: https://twitter.com/comma_ai Comma-ai's Website: https://comma.ai/ Comma-ai's YouTube (unofficial): https://youtube.com/georgehotzarchive Mentioned: Learning a Driving Simulator (paper): https://bit.ly/42T6lAN PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ YouTube Full Episodes: https://youtube.com/lexfridman YouTube Clips: https://youtube.com/lexclips SUPPORT & CONNECT: - Check out the sponsors above, it's the best way to support this podcast - Support on Patreon: https://www.patreon.com/lexfridman - Twitter: https://twitter.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Medium: https://medium.com/@lexfridman OUTLINE: Here's the timestamps for the episode. On some podcast players you should be able to click the timestamp to jump to that time. (00:00) - Introduction (08:04) - Time is an illusion (17:44) - Memes (20:20) - Eliezer Yudkowsky (32:45) - Virtual reality (39:04) - AI friends (46:29) - tiny corp (59:50) - NVIDIA vs AMD (1:02:47) - tinybox (1:14:56) - Self-driving (1:29:35) - Programming (1:37:31) - AI safety (2:02:29) - Working at Twitter (2:40:12) - Prompt engineering (2:46:08) - Video games (3:02:23) - Andrej Karpathy (3:12:28) - Meaning of life

Commoditizing the Petaflop — with George Hotz of the tiny corp

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jun 20, 2023 72:41

We are now launching our dedicated new YouTube and Twitter! Any help in amplifying our podcast would be greatly appreciated, and of course, tell your friends! Notable followon discussions collected on Twitter, Reddit, Reddit, Reddit, HN, and HN. Please don't obsess too much over the GPT4 discussion as it is mostly rumor; we spent much more time on tinybox/tinygrad on which George is the foremost authority!We are excited to share the world's first interview with George Hotz on the tiny corp!If you don't know George, he was the first person to unlock the iPhone, jailbreak the PS3, went on to start Comma.ai, and briefly “interned” at the Elon Musk-run Twitter. Tinycorp is the company behind the deep learning framework tinygrad, as well as the recently announced tinybox, a new $15,000 “luxury AI computer” aimed at local model training and inference, aka your “personal compute cluster”:* 738 FP16 TFLOPS* 144 GB GPU RAM* 5.76 TB/s RAM bandwidth* 30 GB/s model load bandwidth (big llama loads in around 4 seconds)* AMD EPYC CPU* 1600W (one 120V outlet)* Runs 65B FP16 LLaMA out of the box (using tinygrad, subject to software development risks)(In the episode, we also talked about the future of the tinybox as the intelligence center of every home that will help run models, at-home robots, and more. Make sure to check the timestamps

god amazon google ai starting work giving design phd building meditation chinese partner elon musk mit hero iphone mars san diego twitch congress uber tesla humans sony pc manhattan speed hiring discord reddit caribbean shakespeare rumors titanic tampa orange wikipedia avatar pirates karma thousands operation shoutouts girlfriends rice pac intel falcon cto ram whispers closed ir transformers ea optimizing arm openai james cameron nvidia api watts miami heat flop generate remote work chrome gb input mojo python notable gpt ml lama github llama technically db international space station sas corp tb neuralink rewriting wireless merging torch playstation 3 laziness flops amd 10x batch everything else nickel lidar sam altman qualcomm thesis cpu nb llm m1 gpu a1 graphs raspberry pi agi turing docker dynamo kv alibi ocr gpus rpm kanban dsp alessio segmentation squeezing triton 100x bayesian computer vision kama comma brushes iac debugging replicating tensorflow piston cuda nhtsa hn alu john carmack amd ryzen risc v nvidia gpus pytorch venkatesh risc mdl pcie jake sully tpu n4 a100 llvm winograd chris lattner tpms alus numpy lisa su devkit george hotz 65b rnn cisc amd rx vgr ptx kolmogorov quantize turing complete amd gpus xla geohot tflops intel gpus nissan 350z 120v nvlink george it george they jython

TWiG 718: Clean As a Whistle - AI scientist statement, Nvidia Spectrum-X, Amazon's kill list, Mirai botnet

This Week in Google (MP3)

Play Episode Listen Later Jun 1, 2023 159:31

Why AI Scientists Are Freaking Us Out. Pessimists Archive. Just Calm Down About GPT-4 Already. The Seven Deadly Sins of Predicting the Future of AI. Stacey gets grilled on AI. Big Tech Isn't Prepared for A.I.'s Next Chapter. George Hotz wants to 'make driving chill' with next-gen Comma 3 hands-free driver assist. The Spectrum-X ethernet switch offers ""lossless"" transmission via a new kind of congestion control, says Nvidia. Once Mighty Intel Struggles to Escape 'Mud Hole'. The Strange Story of the Teens Behind the Mirai Botnet. Google starts rolling out Search Generative Experience (SGE) in preview. Google Assistant's 3rd-party Notes & Lists integration is shutting down. Messages Magic Compose beta starts rolling out: RCS only, priority for Google One subs. NFL Sunday Ticket on YouTube TV will support unlimited simultaneous streams. Google quietly ends support for decade-old Chromecast. YouTube Stories, Google's clone of Snapchat, is dying on June 26. 230: Supreme Court rejects lawsuit that sought to hold Reddit responsible for hosting child pornography. Amazon's Kill List: 37 Projects That Are No More. Twitter may be worth only a third of its pre-Musk value, Fidelity says. Sonos wins $32.5 million patent infringement victory over Google. Leo previews new Mastodon features. Picks: Stacey - Poverty, by America by Matthew Desmond. Jeff - Wonderful discussion between Reid Hoffman & Trevor Noah. Jeff - Google offloading 1.4 million square feet of Bay Area office space. Ant - Talking Science and Sports with Stephen A Smith and Neil DeGrasse Tyson. Hosts: Leo Laporte, Jeff Jarvis, Stacey Higginbotham, and Ant Pruitt Download or subscribe to this show at https://twit.tv/shows/this-week-in-google. Get episodes ad-free with Club TWiT at https://twit.tv/clubtwit Sponsor: CDW.com/HPE

This Week in Google 718: Clean As a Whistle

All TWiT.tv Shows (MP3)

Play Episode Listen Later Jun 1, 2023 159:31

This Week in Google 718: Clean As a Whistle

Radio Leo (Audio)

Play Episode Listen Later Jun 1, 2023 159:31

TWiG 718: Clean As a Whistle - AI scientist statement, Nvidia Spectrum-X, Amazon's kill list, Mirai botnet

This Week in Google (Video HI)

Play Episode Listen Later Jun 1, 2023 159:31

Lexman interviews Google's George Hotz about his new bowwow project. Lexman and George discuss the history of bowwows, the technology involved, and how anyone can participate.

google project bow wow george hotz

George Hotz on Crannies, Amerinds, and Atropine

Lexman Artificial

Play Episode Listen Later Mar 5, 2023 4:59

In this episode, we interview George Hotz, the hacker and entrepreneur who is most famously known for his work in breaking into Apple's iPhone division. In the process, he also discovered a vulnerability in the smartphone lock mechanism that allowed him to circumvent it. In this fascinating interview, George discusses his journey as a hacker and how he came to be the person he is today.

apple iphone guayaquil crannies george hotz atropine

why Aquamanale is terrible, progressivists are outdated, and vilification is a bad move

Lexman Artificial

Play Episode Listen Later Feb 28, 2023 4:19

The new Apple product announcement was met with mixed reactions. Some found it inspirational, while others labeled it vilification. Our guest, professional hacker and famed Tesla-enthusiast George Hotz, breaks down why he thinks the way we view progressivism is antiquated and misguided.

apple tesla terrible outdated vilification george hotz

Leslie Goldsmith, from I.P. Sharp to KX

The Array Cast

Play Episode Listen Later Feb 18, 2023 96:15

Array Cast - February 17, 2023 Show NotesThanks to Bob Therriault and Adám Brudzewsky for gathering these links:[01] 00:01:20 APLSeeds '23 https://www.dyalog.com/apl-seeds-user-meetings/aplseeds23.htm[02] 00:02:26 KXCon https://kx.com/events/kx-con-2023/[03] 00:04:30 plrank.com https://plrank.com/[04] 00:05:30 Michael Higginson ArrayCast Episode https://www.arraycast.com/episodes/episode46-michael-higginson Iverson Centenary https://britishaplassociation.org/iverson-centenary-december-2020/[05] 00:06:30 Lower Canada College https://www.lcc.ca/ IBM 2741 https://en.wikipedia.org/wiki/IBM_2741 Anderson Jacobson serial modem https://en.wikipedia.org/wiki/Anderson_Jacobson Gilman and Rose - An Interactive Approach https://apl.wiki/Books#APL_―_An_Interactive_Approach Leap Year https://en.wikipedia.org/wiki/Gregorian_calendar#Accuracy APL implementation Leap Year https://tryapl.org/?clear&q=≠%E2%8C%BF0%3D4000%20400%20100%204∘.%7C1600%201700%201800%201900%202000%202100%202200%202300%204000&run[06] 00:13:53 Larry Breed https://apl.wiki/Larry_Breed Scientific Time Sharing Corporation https://en.wikipedia.org/wiki/Scientific_Time_Sharing_Corporation STSC promotional video https://www.youtube.com/watch?v=mjgkhK-nXmk 666 BOX https://www.jsoftware.com/papers/APLQA.htm#666box[07] 00:17:20 University of Toronto https://www.utoronto.ca/ Arthur Whitney https://aplwiki.com/wiki/Arthur_Whitney I.P. Sharp and Associates https://apl.wiki/I.P._Sharp_Associates[08] 00:18:23 360 Assembler https://en.wikipedia.org/wiki/IBM_System/360_architecture#Instruction_formats Eric Iverson https://aplwiki.com/wiki/Eric_Iverson[09] 00:22:40 IESO https://www.ieso.ca/[10] 00:22:50 Smart Meters https://en.wikipedia.org/wiki/Smart_meter[11] 00:23:30 Kdb+ https://code.kx.com/q/learn/brief-introduction/[12] 00:24:30 First Derivatives https://fdtechnologies.com/ KX https://kx.com/ KX Sensors https://kx.com/solutions/energy-utilities/[13] 00:27:52 George Hotz https://en.wikipedia.org/wiki/George_Hotz[14] 00:36:56 ⎕ec https://abrudz.github.io/SAX2/SAX61.pdf#page=790[15] 00:41:20 APL Programming Language https://en.wikipedia.org/wiki/APL_(programming_language) Blackberry RIM https://en.wikipedia.org/wiki/BlackBerry_Limited Rise and Fall of Blackberry https://www.businessinsider.com/blackberry-smartphone-rise-fall-mobile-failure-innovate-2019-11[16] 00:45:45 Ken Iverson https://en.wikipedia.org/wiki/Kenneth_E._Iverson Guy Steele https://en.wikipedia.org/wiki/Guy_L._Steele_Jr.[17] 00:55:12 Nick Psaris ArrayCast Episode https://www.arraycast.com/episodes/episode42-nick-psaris-q[18] 00:56:23 Right Parenthesis ) https://apl.wiki/System_command Quad ⎕ https://apl.wiki/Quad_name[19] 00:57:16 APL2 https://aplwiki.com/wiki/APL2 Axiom System https://dl.acm.org/doi/pdf/10.1145/800136.804446 Trenchard More https://en.wikipedia.org/wiki/Trenchard_More Jim Brown https://aplwiki.com/wiki/Jim_Brown[20] 00:59:28 SHARP APL https://aplwiki.com/wiki/SHARP_APL Roger Moore https://en.wikipedia.org/wiki/Roger_Moore_(computer_scientist) Richard Lathwell https://en.wikipedia.org/wiki/Richard_H._Lathwell[21] 01:04:15 Array Models https://apl.wiki/Array_model[22] 01:04:50 Strand Notation https://apl.wiki/Strand_notation[23] 01:06:05 J Programming Language https://www.jsoftware.com/indexno.html[24] 01:18:02 q Programming Language https://kx.com/academy/ Type of https://apl.wiki/Type[25] 01:21:13 Haskell Programming Language https://www.haskell.org/[26] 01:24:30 ⎕ML Migration Level https://apl.wiki/Migration_level[27] 01:25:50 Oxide and Friends https://oxide.computer/podcasts/oxide-and-friends Java Pubhouse https://www.javapubhouse.com/episodes[28] 01:27:26 British APL Meetings https://britishaplassociation.org/[29] 01:35:06 contact AT ArrayCast DOT COM

VALL-E puede imitar cualquier voz - NTX 284

Noticias de TecnologÃa Express

Play Episode Listen Later Jan 10, 2023 5:58

Microsoft anuncia VALL-E, el Comma 3 supera al Autopilot y las campañas de desinformación no son tan efectivas. Puedes apoyar la realización de este programa con una suscripción. Más información por acáNoticias: -Fuentes de Semafor indican que Microsoft está en conversaciones para incrementar su inversión en OpenAI en $10 mil millones de dólares para tener una participación del 49% en la empresa.-A partir de febrero, Instagram eliminará la pestaña de Compras de la página principal de Instagram. La pestaña de Reels ocupará su lugar y el botón de Crear regresará al centro de la aplicación. Se desconoce cómo se ajustarán los menús para acceder a compras.-Un estudio hecho por el Centro de Medios Sociales y Política de la Universidad de Nueva York, publicado en la revista Nature Communications, encontró que las campañas de desinformación rusas en Twitter lanzadas durante 2016 llegaron a pocos usuarios-Si eres de los que esperaban ansiosos a que Tesla implementara la conducción automática en sus vehículos, te tenemos una sorpresa. George Hotz presentó el Comma 3, una pieza de hardware para conducción asistida de siguiente generación lanzada por su compañía, Comma.ai.-Microsoft anunció un nuevo generador de texto a voz llamado VALL-E, el cual puede simular la voz de una persona cuando esta se graba diciendo un mensaje de tres segundos de duración.-Análisis: Simulando la voz de los demás¿Prefieres leer las noticias? ¡Suscríbete a mi newsletter y te llegarán todos los días! Become a member at https://plus.acast.com/s/noticias-de-tecnologia-express. Hosted on Acast. See acast.com/privacy for more information.

L'iPhone SE restera peut-être sans successeur en 2024

la quotidienne iWeek

Play Episode Listen Later Dec 21, 2022 9:35

Abonnez-vous : c'est gratuit ! Voici l'épisode 305 de "la quotidienne iWeek" en ce mercredi 21 décembre 2022. L'iPhone SE restera peut-être sans successeur en 2024. Présentation : Benjamin VINCENT (@benjaminvincent) + Fabrice NEUMAN (@FabriceNeuman). Production : OUATCH Audio. Tags : le SE 4 improbable ; après l'USB-C, la batterie ; l'app Apple TV bientôt sur Android ? ; la nouveauté magique de Pixelmator Pro ; Cinq semaines et puis s'en va. Bonne découverte de "la quotidienne iWeek" si vous nous écoutez pour la première fois, parlez de nous autour de vous, retweetez-nous (@iweeknews), bonne journée, bonne écoute, et à demain ! Benjamin VINCENT & la team #iweekLQI PS1 : rejoignez la communauté iWeek sur Patreon et bénéficiez de bonus exclusifs ! PS2 : retrouvez-nous aussi, pour iWeek (la semaine Apple), notre podcast hebdo, désormais en ligne chaque mercredi soir. PS3 : le nouvel épisode 117 d'iWeek (la semaine Apple) sera disponible ce jeudi en fin de journée.

The Business of Strip Clubs, Black Friday Boycott, and Three Twitter Scenarios

Podcast Notes Playlist: Latest Episodes

Play Episode Listen Later Dec 14, 2022 71:12

My First Million Podcast Notes Key Takeaways “Strippers know a recession is coming before anybody”– Shaan PuriBased on the psychology of spending at strip clubs“Black Friday is like a license-to-spend when you didn't even need to spend”– Shaan PuriSam wants to boycott Black Friday, he hates the idea of buying stuff that you don't needShaan's three scenarios for how Elon's Twitter takeover may play out: 1. Elon makes Twitter more finically viable and everything settles in2. Twitter becomes the everything app3. Negative snowball effect on Elon's financesContinue reading for more detail and to see which scenario Sam thinks is most likelyRead the full notes @ podcastnotes.orgEpisode 390: Shaan Puri (@ShaanVP) and Sam Parr (@TheSamParr) talk about Eric Langan and his foray into owning strip clubs, why we should boycott Black Friday, three Twitter scenarios, how to fly on a private jet and much, much more. ----- Links: * Jettly * r/fatFIRE * Do you love MFM and want to see Sam and Shaan's smiling faces? Subscribe to our Youtube channel. * Want more insights like MFM? Check out Shaan's newsletter. ------ Show Notes: (03:20) - Eric Langan (12:50) - Black Friday (18:40) - Twitter (36:10) - George Hotz (48:10) - Private jets (53:00) - fatFIRE subreddit ----- Past guests on My First Million include Rob Dyrdek, Hasan Minhaj, Balaji Srinivasan, Jake Paul, Dr. Andrew Huberman, Gary Vee, Lance Armstrong, Sophia Amoruso, Ariel Helwani, Ramit Sethi, Stanley Druckenmiller, Peter Diamandis, Dharmesh Shah, Brian Halligan, Marc Lore, Jason Calacanis, Andrew Wilkinson, Julian Shapiro, Kat Cole, Codie Sanchez, Nader Al-Naji, Steph Smith, Trung Phan, Nick Huber, Anthony Pompliano, Ben Askren, Ramon Van Meer, Brianne Kimmel, Andrew Gazdecki, Scott Belsky, Moiz Ali, Dan Held, Elaine Zelby, Michael Saylor, Ryan Begelman, Jack Butcher, Reed Duchscher, Tai Lopez, Harley Finkelstein, Alexa von Tobel, Noah Kagan, Nick Bare, Greg Isenberg, James Altucher, Randy Hetrick and more. ----- Additional episodes you might enjoy: • #224 Rob Dyrdek - How Tracking Every Second of His Life Took Rob Drydek from 0 to $405M in Exits • #209 Gary Vaynerchuk - Why NFTS Are the Future • #178 Balaji Srinivasan - Balaji on How to Fix the Media, Cloud Cities & Crypto * #169 - How One Man Started 5, Billion Dollar Companies, Dan Gilbert's Empire, & Talking With Warren Buffett • #218 - Why You Should Take a Think Week Like Bill Gates • Dave Portnoy vs The World, Extreme Body Monitoring, The Future of Apparel Retail, "How Much is Anthony Pompliano Worth?", and More • How Mr Beast Got 100M Views in Less Than 4 Days, The $25M Chrome Extension, and More

world media future elon musk empire black friday private negative crypto fix jake paul scenarios boycott gary vee strip clubs exits lance armstrong ben askren andrew huberman james altucher ramit sethi hasan minhaj peter diamandis ariel helwani tai lopez michael saylor dan gilbert shaan mfm noah kagan jason calacanis rob dyrdek stanley druckenmiller balaji srinivasan codie sanchez anthony pompliano sophia amoruso scott belsky marc lore andrew wilkinson my first million dan held kat cole harley finkelstein tobel nick bare jack butcher brian halligan steph smith dharmesh shah george hotz greg isenberg randy hetrick orgepisode julian shapiro trung phan brianne kimmel reed duchscher ramon van meer apparel retail ryan begelman cloud cities shaan puri shaanvp sam parr thesamparr

The Business of Strip Clubs, Black Friday Boycott, and Three Twitter Scenarios

Podcast Notes Playlist: Business

Play Episode Listen Later Dec 14, 2022 71:12

Special Episode with George Hotz

Software Engineering Daily

Play Episode Listen Later Dec 6, 2022 63:52

Comma is a startup aimed at solving self-driving cars. A lot of the new cars in the market have built-in stock Advanced driver assistance systems. Comma takes this system to the next level with Openpilot. Openpilot is an open-source driver assistance system. Currently, with features like Adaptive Cruise Control (ACC), Automated Lane Centering (ALC), Forward The post Special Episode with George Hotz appeared first on Software Engineering Daily.

forward comma george hotz software engineering daily

Special Episode with George Hotz

Podcast – Software Engineering Daily

Play Episode Listen Later Dec 6, 2022 63:52

comma george hotz software engineering daily

The Business of Strip Clubs, Black Friday Boycott, and Three Twitter Scenarios

My First Million

Play Episode Listen Later Nov 29, 2022 71:12

Episode 390: Shaan Puri (@ShaanVP) and Sam Parr (@TheSamParr) talk about Eric Langan and his foray into owning strip clubs, why we should boycott Black Friday, three Twitter scenarios, how to fly on a private jet and much, much more. ----- Links: * Jettly * r/fatFIRE * Do you love MFM and want to see Sam and Shaan's smiling faces? Subscribe to our Youtube channel. * Want more insights like MFM? Check out Shaan's newsletter. ------ Show Notes: (03:20) - Eric Langan (12:50) - Black Friday (18:40) - Twitter (36:10) - George Hotz (48:10) - Private jets (53:00) - fatFIRE subreddit ----- Past guests on My First Million include Rob Dyrdek, Hasan Minhaj, Balaji Srinivasan, Jake Paul, Dr. Andrew Huberman, Gary Vee, Lance Armstrong, Sophia Amoruso, Ariel Helwani, Ramit Sethi, Stanley Druckenmiller, Peter Diamandis, Dharmesh Shah, Brian Halligan, Marc Lore, Jason Calacanis, Andrew Wilkinson, Julian Shapiro, Kat Cole, Codie Sanchez, Nader Al-Naji, Steph Smith, Trung Phan, Nick Huber, Anthony Pompliano, Ben Askren, Ramon Van Meer, Brianne Kimmel, Andrew Gazdecki, Scott Belsky, Moiz Ali, Dan Held, Elaine Zelby, Michael Saylor, Ryan Begelman, Jack Butcher, Reed Duchscher, Tai Lopez, Harley Finkelstein, Alexa von Tobel, Noah Kagan, Nick Bare, Greg Isenberg, James Altucher, Randy Hetrick and more. ----- Additional episodes you might enjoy: • #224 Rob Dyrdek - How Tracking Every Second of His Life Took Rob Drydek from 0 to $405M in Exits • #209 Gary Vaynerchuk - Why NFTS Are the Future • #178 Balaji Srinivasan - Balaji on How to Fix the Media, Cloud Cities & Crypto * #169 - How One Man Started 5, Billion Dollar Companies, Dan Gilbert's Empire, & Talking With Warren Buffett • #218 - Why You Should Take a Think Week Like Bill Gates • Dave Portnoy vs The World, Extreme Body Monitoring, The Future of Apparel Retail, "How Much is Anthony Pompliano Worth?", and More • How Mr Beast Got 100M Views in Less Than 4 Days, The $25M Chrome Extension, and More

world media future empire black friday private crypto fix jake paul scenarios boycott gary vee strip clubs exits lance armstrong ben askren andrew huberman james altucher ramit sethi hasan minhaj peter diamandis ariel helwani tai lopez michael saylor dan gilbert shaan mfm noah kagan jason calacanis rob dyrdek stanley druckenmiller balaji srinivasan codie sanchez anthony pompliano sophia amoruso scott belsky marc lore andrew wilkinson my first million dan held kat cole harley finkelstein tobel nick bare jack butcher brian halligan steph smith dharmesh shah george hotz greg isenberg randy hetrick julian shapiro trung phan brianne kimmel reed duchscher ramon van meer apparel retail ryan begelman cloud cities shaan puri shaanvp sam parr thesamparr

#Radiogeek - El resumen diario tech en #podcast - Nro 2207

Radiogeek

Play Episode Listen Later Nov 24, 2022 25:15

Lemon Cash, actualiza su app para que los usuarios puedan conocer en donde estan los fondos, ademas; #Seguridad – 53% de los argentinos está preocupado que su seguridad en línea; Snapdragon 782G está aquí para reemplazar a Snapdragon 778G+; Elon Musk ha contratado George Hotz y mucho más... Los temas del día: #Seguridad – 53% de los argentinos está preocupado que su seguridad en línea https://infosertecla.com/2022/11/23/seguridad-53-de-los-argentinos-esta-preocupado-que-su-seguridad-en-linea/ Elon Musk podría restablecer todas las cuentas de Twitter previamente prohibidas https://bgr.com/tech/elon-musk-might-reinstate-all-previously-banned-twitter-accounts/ Twitter se enfrenta a YouTube y Substack https://bgr.com/tech/twitter-is-taking-on-youtube-and-substack/ Snapdragon 782G está aquí para reemplazar a Snapdragon 778G+ https://www.sammobile.com/news/snapdragon-782g-processor-launched/ Elon Musk ha contratado George Hotz Nueva York toma la iniciativa en regular las criptomonedas. El primer paso ha sido prohibir la minería Bitcoin https://www.xataka.com/criptomonedas/nueva-york-toma-iniciativa-regular-criptomonedas-primer-paso-ha-sido-prohibir-mineria-bitcoin Los Samsung Galaxy Z Fold3 y Galaxy Z Flip3 reciben la actualización de Android 13 con One UI 5 Lemon Cash, actualiza su app para que los usuarios puedan conocer en donde estan los fondos APOYANOS DESDE PAYPAL https://www.paypal.me/arielmcorg APOYANOS DESDE PATREON https://www.patreon.com/radiogeek APOYANOS DESDE CAFECITO https://cafecito.app/radiogeek Podes seguirme desde Twitter @arielmcorg (www.twitter.com/arielmcorg) También desde Instagram @arielmcorg (www.instagram.com/arielmcorg) Sumate al canal de Telegram #Radiogeekpodcast (http://telegram.me/Radiogeekpodcast)

tech elon musk android tambi seguridad en podcast el resumen sumate george hotz lemon cash resumen diario

Twitter enlists George Hotz, $575 million crypto scheme, DrafKings $300K theft

Cyber Security Headlines

Play Episode Listen Later Nov 23, 2022 7:33

Twitter enlists hacker George Hotz for 12 week “internship” Estonian duo arrested for masterminding $575 million Ponzi scheme Hackers steal $300K from DraftKings customers Thanks to today's episode sponsor, Compyl Preparing a Thanksgiving meal can be stressful, but managing your security and compliance program doesn't have to be. Compyl quickly integrates with the tools you use, and automates 85% of the day-to-day tasks, all while providing complete visibility and comprehensive reporting along the way. Learn about Compyl today at www.compyl.com. For the stories behind the headlines, visit CISOseries.com

thanksgiving crypto hackers theft scheme draftkings ponzi 300k estonian george hotz ciso series

Macro Market, $TSLA Stock Price, Twitter Updates

Going Live With Good Soil

Play Episode Listen Later Nov 22, 2022 61:08

In this episode of Going Live With Good Soil, Emmet Peppers and Matt Smith discuss updates with the macro market, recent updates on Tesla stock price, FSD Beta usage in the snow, George Hotz joining Twitter, and much more. To learn more about Good Soil: https://goodsoilinvestment.com/ For more updates, follow us on: - Twitter: https://goodsoil.io/twitter - LinkedIn: https://goodsoil.io/linkedin - SumZero: https://goodsoil.io/sumzero DISCLAIMER: https://goodsoil.io/disclaimer

market tesla macro matt smith good soil tsla stock prices george hotz fsd beta

Former Tesla AI Director Explains Departure, State of FSD, Elon Musk & More + Giga Texas Production (10.31.22)

Tesla Daily: Tesla News & Analysis

Play Episode Listen Later Nov 1, 2022 12:56

➤ Tesla's former director of AI, Andrej Karpathy, discusses his time at Tesla, the decision to leave, the current state of FSD, and much more: https://youtu.be/cdiD-9MMpb0 ➤ Tesla discloses Giga Texas production milestone ➤ FSD Beta update ➤ GM halts Twitter ads ➤ Tesla Semi testing spotted ➤ George Hotz steps down at Comma AI Shareloft: https://www.shareloft.com Twitter: https://www.twitter.com/teslapodcast Patreon: https://www.patreon.com/tesladailypodcast Tesla Referral: https://ts.la/robert47283 Executive producer Jeremy Cooke Executive producer Troy Cherasaro Executive producer Andre/Maria Kent Executive producer Jessie Chimni Executive producer Michael Pastrone Executive producer Richard Del Maestro Executive producer John Beans Music by Evan Schaeffer Disclosure: Rob Maurer is long TSLA stock & derivatives

director texas ai elon musk executives tesla production gm departure giga tsla fsd tesla semi andrej karpathy george hotz fsd beta giga texas

Decentralized Tech - Hotz, Domingos, Jakubowski

Briar Patch Observatory

Play Episode Listen Later Apr 23, 2022 98:05

In our first panel episode, we host hacker George Hotz, CS professor Pedro Domingos and OSE founder Marcin Jakubowski. A lively exchange! Learn what decentralization can and cannot do for the world. Work with Marcin: opensourceecology.org Follow Pedro on Twitter: @pmddomingos Follow George on Twitter: @comma_ai

work tech cs decentralized marcin domingos ose hotz george hotz pedro domingos follow pedro marcin jakubowski

Podcast appearances and mentions of George Hotz

Best podcasts about George Hotz

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Lex Fridman Podcast

The Jim Rutt Show

Lexman Artificial

Wheel Bearings

Reason Video

Autonocast

The Nonlinear Library

PixlTalk

Latest news about George Hotz

Latest podcast episodes about George Hotz

S16 E13: Corbin Fraser on Bitcoin.com & BTC vs BCH

#317: How The Tesla Autopilot Cannonball Run Record Was Broken

EP 263 Evan McMullen on Self-Driving Cars

The Winds of AI Winter (Q2 Four Wars Recap) + ChatGPT Voice Mode Preview

EP 244 Samo Burja on Lessons from the Russo-Ukrainian War

The Best Tacit Knowledge Videos on Every Subject

LW - The Best Tacit Knowledge Videos on Every Subject by Parker Conley

LW - The Best Tacit Knowledge Videos on Every Subject by Parker Conley

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

EP 221 George Hotz on Open-Source Driving Assistance

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Doing it the Hard Way: Making the AI engine and language

George Hotz vs Eliezer Yudkowsky AI Safety Debate

The Mathematics of Training LLMs — with Quentin Anthony of Eleuther AI

LW - Summary of and Thoughts on the Hotz/Yudkowsky Debate by Zvi

LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Can We Develop Truly Beneficial AI? George Hotz and Connor Leahy

FlashAttention 2: making Transformers 800% faster w/o approximation - with Tri Dao of Together AI

George Hotz Hacked Everything (tinycorp)

Aligning Super Intelligence & The Open Web's Uncertain Future + GPT-4 General Availability | EP22

#387 – George Hotz: Tiny Corp, Twitter, AI Safety, Self-Driving, GPT, AGI & God

Commoditizing the Petaflop — with George Hotz of the tiny corp

TWiG 718: Clean As a Whistle - AI scientist statement, Nvidia Spectrum-X, Amazon's kill list, Mirai botnet

This Week in Google 718: Clean As a Whistle

This Week in Google 718: Clean As a Whistle

TWiG 718: Clean As a Whistle - AI scientist statement, Nvidia Spectrum-X, Amazon's kill list, Mirai botnet

This Week in Google 718: Clean As a Whistle

This Week in Google 718: Clean As a Whistle

This Week in Google 718: Clean As a Whistle

George Hotz Interview (Bowwow Project)

George Hotz on Crannies, Amerinds, and Atropine

why Aquamanale is terrible, progressivists are outdated, and vilification is a bad move

Leslie Goldsmith, from I.P. Sharp to KX

VALL-E puede imitar cualquier voz - NTX 284

L'iPhone SE restera peut-être sans successeur en 2024

The Business of Strip Clubs, Black Friday Boycott, and Three Twitter Scenarios

The Business of Strip Clubs, Black Friday Boycott, and Three Twitter Scenarios

Special Episode with George Hotz

Special Episode with George Hotz

The Business of Strip Clubs, Black Friday Boycott, and Three Twitter Scenarios

#Radiogeek - El resumen diario tech en #podcast - Nro 2207

Twitter enlists George Hotz, $575 million crypto scheme, DrafKings $300K theft

Macro Market, $TSLA Stock Price, Twitter Updates

Former Tesla AI Director Explains Departure, State of FSD, Elon Musk & More + Giga Texas Production (10.31.22)

Decentralized Tech - Hotz, Domingos, Jakubowski