POPULARITY
Wohoo, hey ya'll, Alex here,I'm back from the desert (pic at the end) and what a great feeling it is to be back in the studio to talk about everything that happened in AI! It's been a pretty full week (or two) in AI, with Coding agent space heating up, Grok entering the ring and taking over free tokens, Codex 10xing usage and Anthropic... well, we'll get to Anthropic. Today on the show we had Roger and Bhavesh from Nous Research cover the awesome Hermes 4 release and the new PokerBots benchmark, then we had a returning favorite, Kwindla Hultman Kramer, to talk about the GA of RealTime voice from OpenAI. Plus we got some massive funding news, some drama with model quality on Claude Code, and some very exciting news right here from CoreWeave aquiring OpenPipe!
HEADLINESEthereum Foundation Pushes to Make Layer 2s Feel Like One ChainM0 Raises $40M to Redefine Stablecoins With Programmable, Application-Specific ModelsChainlink Brings U.S. Government Macroeconomic Data Onchain in Partnership With Commerce DepartmentEliza Labs Sues Musk's X for Antitrust Violations and Copycat AIAave Launches Horizon: Real-World Collateral Meets DeFiLittle BitsUSDT Ports to Bitcoin via RGB for Lightning-Fast PaymentsCircle mints $2.5B USDC in 48 hours – $USDC market cap crosses $70B for the first time everDeFi Lending Surges to Record HighsWHERE TO FIND DCNdailycryptonews.nethttps://twitter.com/DCNDailyCryptoEMAIL or FOLLOW the HostEmail: kyle@dailycryptonews.net*****Magic Newton Wallethttps://magic.linkTrader Cobb X: @TraderCobbhttps://www.thegrowmeco.com/Editing Serviceshttps://www.contentbuck.com——————————————————————***NOT FINANCIAL, LEGAL, OR TAX ADVICE! JUST OPINION! I AM NOT AN EXPERT! I DO NOT GUARANTEE A PARTICULAR OUTCOME I HAVE NO INSIDE KNOWLEDGE! YOU NEED TO DO YOUR OWN RESEARCH AND MAKE YOUR OWN DECISIONS! THIS IS JUST EDUCATION & ENTERTAINMENT! Hosted on Acast. See acast.com/privacy for more information.
US equities were higher in Wednesday trading, with the Dow Jones, S&P500, and Nasdaq closing up 32bps, 24bps, and 21bps respectively. Market was still in waiting mode for Nvidia results after the close and PCE inflation on Friday. NY Fed's Williams telling CNBC monetary policy is moderately restrictive and data could warrant a gradual reduction in rates. Treasury's auction of $70B in 5s saw a 0.7bp tail, though domestic demand was solid. Earnings results included some well-received prints out of the cloud software space and mixed takeaways surrounding the consumer-facing names.
Welcome back to another episode of Upside at the EUVC Podcast, where Dan Bowyer, Mads Jensen of SuperSeed and Lomax from Outsized Ventures unpack what's happening in European tech and venture capital.This week: Why Meta and Microsoft are minting cash from AI, what Figma's IPO signals for SaaS, whether the EU got rolled in its new trade deal with the US, and how Europe's AI scene is finally delivering billion‑dollar exits. Plus: OpenAI's new “Study Mode” and Harry Stebbings' Project Europe—an “anti‑YC” deep‑tech accelerator for founders under 25.
自从 ChatGPT 横空出世,几乎所有关于大模型的讨论都离不开 Transformer,那 Transformer 架构也支撑了这一轮生成式 AI 的快速发展。然而在 Transformer 架构的背后,行业也遇到了难以回避的瓶颈:推理和训练成本居高不下,长上下文能力依赖庞大的显存和算力,端侧部署和商业落地困难。Transformer 的困境让神经网络的另一条路径重新被审视——那就是RNN,循环神经网络。 今天我们请到的嘉宾,是元始智能的联合创始人和 COO 罗璇。他与另一位创始人彭博一起持续的探索基于循环神经网络的可扩展架构 RWKV。RWKV 架构能否在 Transformer 面临的核心问题上提供一种替代方案?新的架构是否给端侧模型的发展带来更多更大的机会?今天我们将和罗璇一起,从底层架构的设计出发,聊聊 RWKV 的可扩展性、 下一代大模型可能的走向,以及端侧 AI 的机会与未来。 本期人物 罗璇,元始智能联合创始人兼 COO Yaxian,「科技早知道」主播 主要话题 [03:30] 训练效率低、Scaling law 见顶,成本高昂,Transformer 的瓶颈催生新架构的探索 [08:15] 高效并行、低复杂度,易端侧部署,RWKV 为 Transformer 提供了可替代方案 [13:24] 新型 RNN 与 Attention 混合模型就像油电混动车,但纯电才是大模型的未来 [17:07] 大厂押注新架构:基于 RWKV 架构的模型已达到 70B 激活参数 [23:47] 突破算力、内存和功耗限制,RWKV 天生适合端侧部署 [26:24] 未来 80% 的 AI 计算将在端侧,巨头尚未涉足的增量市场才是创业公司的机会 [32:35] 端侧机会有哪些?空间计算或是下一个风口 [38:20] RWKV 的 「ChatGPT」时刻将至:新架构对 AGI 的实现必不可少 延伸阅读 RNN(Recurrent Neural Network) 即循环神经网络,是一类专为处理序列数据设计的深度学习架构。它的核心机制是「循环」:当前时刻的输出不仅依赖于当前输入,还受到上一个时刻隐藏状态的影响,因此 RNN 具备记忆历史信息的能力。但经典的 RNN 也存在梯度消失/梯度爆炸、训练难以并行化和难以扩展至大模型规模等问题。RWKV 是一种结合 RNN 和 Transformer 优势的神经网络架构。 Mamba 架构 是一个专为高效处理长文本而设计的线性时间复杂度模型架构,它通过状态空间模型(State Space Model, SSM)实现类似 RNN 的信息传递方式,但比传统 RNN 更强、比 Transformer 更快。 LSTM(Long Short-Term Memory) 是一种改进版的 RNN 架构,全称为 「长短期记忆网络」。是一种具有“记忆控制能力”的循环神经网络,能够有效建模长期依赖关系,是 RNN 在深度学习时代的关键进化版本。 MoE 模型 MoE(Mixture of Experts,专家混合模型)是一种通过多个子网络(专家)组成的架构,每次仅激活其中一部分以提升计算效率与模型容量。它通过「按需使用」不同专家,实现高效推理与更强的任务适应能力。 XR(Extended Reality) 指扩展现实,是虚拟现实(VR)、增强现实(AR)和混合现实(MR)的统称,用于描述融合现实与数字内容的交互体验。 幕后制作 监制:Yaxian 后期:迪卡 运营:George 设计:饭团 商业合作 声动活泼商业化小队,点击链接直达声动商务会客厅 (https://sourl.cn/9h28kj),也可发送邮件至 business@shengfm.cn 联系我们。
S&P Futures are positive this morning as market react to the latest trade developments. Nivida appears to have the green light to ship its H20 chip to China. President Trump indicates a wiliness to discuss tariff rates with the E.U. President Trump will be in Pennsylvania today and is expected announce a $70B investment in AI and Energy. Before the bell today is the June inflation data as the CPI data is due out. TTD gains on its inclusion to the S&P500. On the earnings front, JPM, BK & WFC are higher after earnings beats. Tomorrow morning, JNJ, BAC, MS & GS will be reporting.
US equities finished mixed today, with the Dow Jones down 25bps, the S&P500 flat, and the Nasdaq rising 31bps. May new home sales posted a big miss, lowest since October, and weakest May print since 2019. Fed Chair Powell delivered his second day of monetary policy testimony commenting future trade deals may allow the Fed to consider rate cuts. Fed also announced its proposed SLR changes. Today's $70B 5-year note auction stopped through 0.5 bp.
Walking through the 10 most important aspects of Trump's Big, Beautiful Bill. I cover the aspects I believe are beneficial as well as the biggest downside. A concerning increase in the national debt...0:00 Introduction0:10 Makes 2017 Tax Cuts Permanent1:14 MAGA Savings Accounts For Babies2:00 Stricter Work Requirements for Medicaid2:16 Tax-Free Tips, OT Pay & Car Loan Interest3:37 $70B for Security Border3:54 $150B Boost in Military Spending4:38 Ends Clean Energy Incentives5:44 SALT Cap Raised from $10k - $40k6:40 Judicial Oversight Limited6:55 Adds $4 Trillion to National DebtWant a Life Insurance Policy? Go Here: https://bttr.ly/bw-yt-aa-clarity Want FREE Whole Life Insurance Resources & Education? Go Here: https://bttr.ly/yt-bw-vault______________________________________________ Learn More About BetterWealth: https://betterwealth.comDISCLAIMER: https://bttr.ly/aapolicy*This video is for entertainment purposes only and is not financial or legal advice.Financial Advice Disclaimer: All content on this channel is for education, discussion, and illustrative purposes only and should not be construed as professional financial advice or recommendation. Should you need such advice, consult a licensed financial or tax advisor. No guarantee is given regarding the accuracy of the information on this channel. Neither host nor guests can be held responsible for any direct or incidental loss incurred by applying any of the information offered.
Startups in the DACH region are heating up — from AI unicorns to VC mega-funds. In this monthly startup news roundup, we unpack the most important trends in Germany, Austria, and Switzerland's tech ecosystem. Learn how Parloa became Germany's second unicorn of the year, how Quantum Systems secured €160M to dominate aerial AI, and how the EIB's €70B initiative is reshaping venture capital in Europe. We also dive into Lyft's FreeNow acquisition, biotech patent wars, the MiCA crypto license, and regulatory risks like the EU Data Act.
Welcome back to the Fintech Takes podcast. I'm your host, Alex Johnson, and today we're digging into one of the most urgent (and underdiscussed) financial issues in America: gambling. My guest is Alex DeMarco, founder and CEO of MoneyStack, who's helping reframe gambling addiction not just as a behavioral health issue, but as a financial systems crisis. Since 2018, when the Supreme Court cracked open the door to state-by-state legalization of mobile sports betting, we've seen a gold rush in gambling. Operators are now pulling in more than $70B annually (ads are everywhere, the apps are engineered for nonstop engagement, and the harm is rising fast). In NJ, one of the first states to legalize, 6% of adults are already experiencing moderate to severe gambling-related issues (double the national average). We connect the dots between gambling and familiar fintech business models: the same behavioral nudges, same VIP economics, the same revenue dependence on a vulnerable sliver of power users. If overdraft fees and gamified trading feel predatory, this is that (but on steroids). We unpack: Why sports betting apps now hold three times more per wager than old-school sportsbooks How engagement tactics mimic (and often outstrip) the most addictive elements of gamified finance Why we're watching investing and gambling blur into one screen (and one behavior) What proactive financial intervention might look like, and why most help comes too late How banks and fintechs can step up (detecting risk early, training advisors, and supporting families in recovery) We close with this big question: when gambling is mobile, funded from a checking account, and styled like Robinhood … can the financial industry really say it's not their problem? This episode is brought to you by: Newline™ by Fifth Third is an innovative, API-first platform that enables fintechs to launch embedded payment, card and deposit solutions directly with Fifth Third Bank. Visit Newline53.com to see how Newline can elevate your business. The world needs MoR. With Paddle as your Merchant of Record (MoR), the global growth is yours. The risk, compliance and accountability are ours. Simple. Paddle offers all the benefits of an enterprise-grade billing system but with MoR flexibility, MoR control, and MoR focus on your core product. Visit paddle.com to learn more. Sign up for Alex's Fintech Takes newsletter for the latest insightful analysis on fintech trends, along with a heaping pile of pop culture references and copious footnotes. Every Monday and Thursday: https://workweek.com/brand/fintech-takes/ And for more exclusive insider content, don't forget to check out my YouTube page. Follow Alex (DeMarco): LinkedIn: https://www.linkedin.com/in/alexdemarco/ MoneyStack: https://www.linkedin.com/company/moneystack/ Follow Alex (Johnson): YouTube: https://www.youtube.com/channel/UCJgfH47QEwbQmkQlz1V9rQA/videos LinkedIn: https://www.linkedin.com/in/alexhjohnson X: https://www.twitter.com/AlexH_Johnson
From building a treasury team from scratch with just a laptop to managing billions in assets and navigating four major corporate deals. This week on the podcast treasury leader Mike Tackley reveals what it really takes to scale treasury functions that drive business growth, earn stakeholder trust, and withstand economic turbulence.Mike Tackley, an experienced finance leader and most recently the Group Treasurer at Harbour Energy, shares the unconventional path that led him from night school and brown paper bags full of receipts to leading treasury strategy at one of the UK's largest independent oil and gas companies. With decades of experience including a key role at BG Group before its $70B acquisition by Shell, Mike offers an insider's perspective on treasury leadership during times of rapid scale and transformation.Main topics discussed:Mike's unconventional entry into finance and treasuryHis pivotal career move from auditing to BG Group, and why internal controls were his gateway into treasuryHow Sarbanes-Oxley changed corporate treasury structures and complianceSetting up middle, back, and front office treasury functions from the ground upLessons learned managing credit risk during the 2008 financial crisisBuilding Harbour Energy's treasury team post-BG and launching operations with no legacy systemsThe strategic use of Reserve Based Lending (RBL) to finance acquisitionsTreasury's evolving role in M&A deals, cash flow forecasting, and regulatory reportingIntegrating treasury with the broader business through stakeholder educationTips for growing and hiring a high-performance treasury teamYou can connect with Mike Tackley on LinkedIn. ---
AI and earnings took center stage this week. Perplexity's building a browser, OpenAI wants to be your shopping assistant, and Big Tech dropped their Q1 numbers. We cover: Perplexity builds a browser – CEO Aravind Srinivas explains how Comet aims to work at the operating system level, enabling actions like scraping pages, taking actions on your behalf, and improving ad targeting with granular user data. Hardware partnerships fuel expansion – Perplexity will be pre-installed on new Motorola Razrs and is in talks with Samsung. This move mirrors Google's own bundling tactics… just as Google faces antitrust heat. OpenAI launches shoppable search – You can now shop directly within ChatGPT results, with comparisons, recommendations, and product discovery baked into the interface. Shopify integration is rumored, but no ads—for now. The web is changing – In this new model, websites aren't for browsing. They're for bots to scrape, analyze, and feed insights back to the user in one seamless chat interface. Apple earnings – $95.4B in revenue, beating expectations. Hardware ticked up slightly, but services brought in $26.65B, growing 11.65% YoY. Meta earnings – $42.3B revenue (+16%), driven by ads. Ad impressions rose 5%, pricing jumped 10%. Meta boosted AI capex projections to $64–$72B. Alphabet earnings – $90.2B in revenue (+12%), strong performance in search. YouTube ads slightly missed projections. Tariff pressures are hitting Google Shopping spend. Amazon earnings – $29.3B in ad revenue, right on target, but a gloomy Q2 forecast due to tariffs and consumer uncertainty dragged stock down 4%. Microsoft earnings – $70B revenue (+13%), thanks to resilient software and cloud services. Hardware may take a hit, but they're better positioned than most to handle turbulence. Learn more about your ad choices. Visit megaphone.fm/adchoices
Kevin Green and Jeff Pierce provide live reaction to Alphabet (GOOGL) earnings as shares initially pop higher on a top and bottom line beat. Jeff puts the Google Cloud revenue into perspective as it added nearly $3.5b in Y/Y growth. Kevin examines the company's decision to authorize a buyback up to $70B. Then, the duo dive into Intel (INTC) earnings after the company announced plans to eliminate management layers and cut an unidentified amount of jobs in 2Q. The chipmaker slashed its capex spend under new CEO Lip-Bu Tan and provided lower than expected guidance for its 2Q.======== Schwab Network ========Empowering every investor and trader, every market day. Subscribe to the Market Minute newsletter - https://schwabnetwork.com/subscribeDownload the iOS app - https://apps.apple.com/us/app/schwab-network/id1460719185Download the Amazon Fire Tv App - https://www.amazon.com/TD-Ameritrade-Network/dp/B07KRD76C7Watch on Sling - https://watch.sling.com/1/asset/191928615bd8d47686f94682aefaa007/watchWatch on Vizio - https://www.vizio.com/en/watchfreeplus-exploreWatch on DistroTV - https://www.distro.tv/live/schwab-network/Follow us on X – https://twitter.com/schwabnetworkFollow us on Facebook – https://www.facebook.com/schwabnetworkFollow us on LinkedIn - https://www.linkedin.com/company/schwab-network/ About Schwab Network - https://schwabnetwork.com/about
Welcome back to The Gwart Show! Today, Jarry Xiao, co-founder and engineering lead at Ellipsis Labs, joins to chat tradfi to crypto infrastructure. Jarry dives deep into Phoenix, their central limit order book on Solana, and introduces Atlas - their purpose-built blockchain for financial applications. The conversation touches blockchain trading challenges, the economics of validator networks, and why truly effective trading systems require opinionated design. Follow our guest on Twitter! https://x.com/jarxiao?lang=en Subscribe to the newsletter! https://newsletter.blockspacemedia.com Notes: - Phoenix did $70B+ in organic trading volume - Solana's 400ms blocks still too slow for HFT - Atlas uses single verifiable sequencer model - Market makers need priority cancellations - Blockchain revenue sustainability questionable - MEV/sandwiching not sustainable for finance Timestamps: 00:00 Start 00:33 Who is Jarray? 01:09 TradFi experience 01:50 What was Jarry trading? 03:23 Phoenix order book 06:45 Aggregators 10:31 Open validator set blockchain 13:15 Market making on SOL 17:05 Atlas 20:57 Limitations of SOL 25:02 Rejecting SOL base assumptions 28:09 What does Atlas unlock? 34:46 Purpose built? 36:31 Opinion sequencing 37:53 Multiple use cases 42:37 Stock market on-chain 46:53 Why L2 sequencer design? 54:11 Memecoins here to stay 1:04:39 Why did it take so long?
Hey Folks, Alex here, celebrating an absolutely crazy (to me) milestone, of #100 episodes of ThursdAI
US equities finished lower in Wednesday trading, though ended a bit off worst levels, with the Dow Jones, S&P500, and Nasdaq closing down 31bps, 112bps, and 204bps respectively. Big area of scrutiny today has been tech weakness, with negative AI headlines, trade, and technicals among the areas of blame. Durable goods orders beat; but core capital goods orders posted a surprise contraction, while core capital goods shipments came in ahead. Fed's Kashkari called for an extended hold. Treasury's auction of $70B in 5-year notes tailed by 0.5bp.
Ejaaz is back with David to unpack the whirlwind effects of President Trump's $70B memecoin launch, which slammed 400K fresh users onto Solana and shattered revenue records in under 30 hours. Meanwhile, a $500B federal boost to AI—backed by OpenAI and SoftBank—has everyone on edge about how crypto's AI agent economy might ride this newfound momentum. On the protocol front, Arc's Rake framework and AI16z's curated Eliza ecosystem are both stepping up, incubating projects like SoulGraph and Listen that promise a more user-friendly DeFAI future. Virtuals keeps shipping at breakneck speed too, testing new multi-chain waters, instituting a bold buyback, and hinting at metaverse collabs that could reshape gaming for good. From automated DeFi positions and trading dashboards to fully autonomous NPCs that commentate (and maybe soon play) your favorite games, the lines between AI and crypto are getting blurrier—and more exciting—by the day. Buckle up as the AI agent revolution continues to redefine what “on-chain” can really mean. Ready for the ride? ------ BANKLESS SPONSOR TOOLS:
In this explosive episode, we dive deep into two transformative developments in crypto: the unprecedented launch of Trump's presidential memecoin and the innovative tokenization of Colombian coffee trade. Our guest ApeDude breaks down how Trump's $70B token launch just days before inauguration has fundamentally changed crypto's legitimacy and explores what this means for the future of political fundraising, government adoption, and mainstream acceptance.We then explore Real World Arabica's mission to revolutionize the $5B Colombian coffee industry through blockchain technology, making premium coffee more accessible while improving conditions for farmers.Key Topics:The implications of the first presidential memecoinHow Trump's token changes crypto's legitimacy foreverThe future of political fundraising through tokenizationBringing Colombian specialty coffee to the blockchainCreating crypto-friendly farmer associationsThe next generation of crypto-native commerceWhether you're a crypto enthusiast, political observer, or coffee lover, this episode offers unique insights into how blockchain technology is reshaping both politics and traditional commerce.Guest: Ape Dude (@rwarabica) Host: Thomas Bahamas (@thomasbahamasfi)Note: This episode contains several discussions about cryptocurrency and politics. Nothing in this episode constitutes financial advice. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit thomasbahamas.substack.com
Welcome to The Chopping Block – where crypto insiders Haseeb Qureshi, Tom Schmidt, Tarun Chitra, and Robert Leshner chop it up about the latest in crypto. In this episode, we dive into one of the wildest weeks in crypto history, unpacking the unprecedented launch of Trump's memecoin and its explosive $70 billion valuation. We discuss how it broke every record imaginable, the fallout from Melania's token, and the chaos it unleashed across the Solana network. From the ethics of a sitting president launching a coin to the implications for retail investors and the future of memecoins, we're covering it all. Plus, we debate whether this marks a new chapter for crypto adoption or a troubling precedent for political grift. Stay tuned for this jam-packed emergency episode! Show highlights
Arnaud et Emmanuel discutent des versions Java, font un résumé de l'ecosystème WebAssembly, discutent du nouveau Model Context Protocol, parlent d'observabilité avec notamment les Wide Events et de pleins d'autres choses encore. Enregistré le 17 janvier 2025 Téléchargement de l'épisode LesCastCodeurs-Episode–321.mp3 ou en vidéo sur YouTube. News Langages java trend par InfoQ https://www.infoq.com/articles/java-trends-report–2024/ Java 17 finalement depasse 11 et 8 ~30/33% Java 21 est à 1.4% commonhaus apparait GraalVM en early majority Spring AI et langchain4j en innovateurs SB 3 voit son adoption augmenter Un bon résumé sur WebAssembly, les différentes specs comme WASM GC, WASI, WIT, etc https://2ality.com/2025/01/webassembly-language-ecosystem.html WebAssembly (Wasm) est un format d'instructions binaires pour une machine virtuelle basée sur une pile, permettant la portabilité et l'efficacité du code. Wasm a évolué à partir d'asm.js, un sous-ensemble de JavaScript qui pouvait fonctionner à des vitesses proches de celles natives. WASI (WebAssembly System Interface) permet à Wasm de fonctionner en dehors des navigateurs Web, fournissant des API pour le système de fichiers, CLI, HTTP, etc. Le modèle de composant WebAssembly permet l'interopérabilité entre les langages Wasm à l'aide de WIT (Wasm Interface Type) et d'ABI canonique. Les composants Wasm se composent d'un module central et d'interfaces WIT pour les importations/exportations, facilitant l'interaction indépendante du langage. Les interfaces WIT décrivent les types et les fonctions, tandis que les mondes WIT définissent les capacités et les besoins d'un composant (importations/exportations). La gestion des packages Wasm est assurée par Warg, un protocole pour les registres de packages Wasm. Une enquête a montré que Rust est le langage Wasm le plus utilisé, suivi de Kotlin et de C++; de nombreux autres langages sont également en train d'émerger. Un algorithme de comptage a taille limitée ne mémoire a été inventé https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count–20240516/ élimine un mot de manière aléatoire mais avec une probabilité connue quand il y a besoin de récupérer de l'espace cela se fait par round et on augmente la probabilité de suppression à chaque round donc au final, ne nombre de mots / la probabilité d'avoir été éliminé donne une mesure approximative mais plutot précise Librairies Les contributions Spring passent du CLA au DCO https://spring.io/blog/2025/01/06/hello-dco-goodbye-cla-simplifying-contributions-to-spring d'abord manuel amis meme automatisé le CLA est une document legal complexe qui peut limiter les contribuitions le DCO vient le Linux je crois et est super simple accord que la licence de la conmtrib est celle du projet accord que le code est public et distribué en perpetuité s'appuie sur les -s de git pour le sign off Ecrire un serveur MCP en Quarkus https://quarkus.io/blog/mcp-server/ MCP est un protocol proposé paor Antropic pour integrer des outils orchestrables par les LLMs MCP est frais et va plus loin que les outils offre la notion de resource (file), de functions (tools), et de proimpts pre-built pour appeler l'outil de la meilleure façon On en reparlera a pres avec les agent dans un article suivant il y a une extension Quarkus pour simplifier le codage un article plus detaillé sur l'integration Quarkus https://quarkus.io/blog/quarkus-langchain4j-mcp/ GreenMail un mini mail server en java https://greenmail-mail-test.github.io/greenmail/#features-api Utile pour les tests d'integration Supporte SMTP, POP3 et IMAP avec TLS/SSL Propose des integrations JUnit, Spring Une mini UI et des APIs REST permettent d'interagir avec le serveur si par exemple vous le partagé dans un container (il n'y a pas d'integration TestContainer existante mais elle n'est pas compliquée à écrire) Infrastructure Docker Bake in a visual way https://dev.to/aurelievache/understanding-docker-part–47-docker-bake–4p05 docker back propose d'utiliser des fichiers de configuration (format HCL) pour lancer ses builds d'images et docker compose en gros voyez ce DSL comme un Makefile très simplifié pour les commandes docker qui souvent peuvent avoir un peu trop de paramètres Datadog continue de s'etendre avec l'acquisition de Quickwit https://www.datadoghq.com/blog/datadog-acquires-quickwit/ Solution open-source de recherche des logs qui peut être déployée on-premise et dans le cloud https://quickwit.io/ Les logs ne quittent plus votre environment ce qui permet de répondre à des besoins de sécurité, privacy et réglementaire Web 33 concepts en javascript https://github.com/leonardomso/33-js-concepts Call Stack, Primitive Types, Value Types and Reference Types, Implicit, Explicit, Nominal, Structuring and Duck Typing, == vs === vs typeof, Function Scope, Block Scope and Lexical Scope, Expression vs Statement, IIFE, Modules and Namespaces, Message Queue and Event Loop, setTimeout, setInterval and requestAnimationFrame, JavaScript Engines, Bitwise Operators, Type Arrays and Array Buffers, DOM and Layout Trees, Factories and Classes, this, call, apply and bind, new, Constructor, instanceof and Instances, Prototype Inheritance and Prototype Chain, Object.create and Object.assign, map, reduce, filter, Pure Functions, Side Effects, State Mutation and Event Propagation, Closures, High Order Functions, Recursion, Collections and Generators, Promises, async/await, Data Structures, Expensive Operation and Big O Notation, Algorithms, Inheritance, Polymorphism and Code Reuse, Design Patterns, Partial Applications, Currying, Compose and Pipe, Clean Code Data et Intelligence Artificielle Phi 4 et les small language models https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi–4-microsoft%e2%80%99s-newest-small-language-model-specializing-in-comple/4357090 Phi 4 un SML pour les usages locaux notamment 14B de parametres belle progression de ~20 points sur un score aggregé et qui le rapproche de Llama 3.3 et ses 70B de parametres bon en math (data set synthétique) Comment utiliser Gemini 2.0 Flash Thinking (le modèle de Google qui fait du raisonnement à la sauce chain of thought) en Java avec LangChain4j https://glaforge.dev/posts/2024/12/20/lets-think-with-gemini–2-thinking-mode-and-langchain4j/ Google a sorti Gemini 2.0 Flash, un petit modèle de la famille Gemini the “thinking mode” simule les cheminements de pensée (Chain of thoughts etc) décompose beaucoup plus les taches coplexes en plusiewurs taches un exemple est montré sur le modele se battant avec le probleme Les recommendations d'Antropic sur les systèmes d'agents https://www.anthropic.com/research/building-effective-agents défini les agents et les workflow Ne recommence pas les frameworks (LangChain, Amazon Bedrock AI Agent etc) le fameux débat sur l'abstraction Beaucoup de patterns implementable avec quelques lignes sans frameworks Plusieurs blocks de complexité croissante Augmented LLM (RAG, memory etc): Anthropic dit que les LLMs savent coordonner cela via MCP apr exemple Second: workflow prompt chaining : avec des gates et appelle les LLMs savent coordonner successivement ; favorise la precision vs la latence vu que les taches sont décomposées en plusieurs calls LLMs Workflow routing: classifie une entree et choisie la route a meilleure: separation de responsabilité Workflow : parallelisation: LLM travaillent en paralllele sur une tache et un aggregateur fait la synthèse. Paralleisaiton avec saucissonage de la tache ou voter sur le meilleur réponse Workflow : orchestrator workers: quand les taches ne sont pas bounded ou connues (genre le nombre de fichiers de code à changer) - les sous taches ne sont pas prédéfinies Workflow: evaluator optimizer: nun LLM propose une réponse, un LLM l'évalue et demande une meilleure réponse au besoin Agents: commande ou interaction avec l;humain puis autonome meme si il peut revenir demander des precisions à l'humain. Agents sont souvent des LLM utilisât des outil pour modifier l'environnement et réagir a feedback en boucle Ideal pour les problèmes ouverts et ou le nombre d'étapes n'est pas connu Recommende d'y aller avec une complexité progressive L'IA c'est pas donné https://techcrunch.com/2025/01/05/openai-is-losing-money-on-its-pricey-chatgpt-pro-plan-ceo-sam-altman-says/ OpenAI annonce que même avec des licenses à 200$/mois ils ne couvrent pas leurs couts associés… A quand l'explosion de la bulle IA ? Outillage Ghostty, un nouveau terminal pour Linux et macOS : https://ghostty.org/ Initié par Mitchell Hashimoto (hashicorp) Ghostty est un émulateur de terminal natif pour macOS et Linux. Il est écrit en Swift et utilise AppKit et SwiftUI sur macOS, et en Zig et utilise l'API GTK4 C sur Linux. Il utilise des composants d'interface utilisateur native et des raccourcis clavier et souris standard. Il prend en charge Quick Look, Force Touch et d'autres fonctionnalités spécifiques à macOS. Ghostty essaie de fournir un ensemble riche de fonctionnalités utiles pour un usage quotidien. Comment Pinterest utilise Honeycomb pour améliorer sa CI https://medium.com/pinterest-engineering/how-pinterest-leverages-honeycomb-to-enhance-ci-observability-and-improve-ci-build-stability–15eede563d75 Pinterest utilise Honeycomb pour améliorer l'observabilité de l'intégration continue (CI). Honeycomb permet à Pinterest de visualiser les métriques de build, d'analyser les tendances et de prendre des décisions basées sur les données. Honeycomb aide également Pinterest à identifier les causes potentielles des échecs de build et à rationaliser les tâches d'astreinte. Honeycomb peut également être utilisé pour suivre les métriques de build locales iOS aux côtés des détails de la machine, ce qui aide Pinterest à prioriser les mises à niveau des ordinateurs portables pour les développeurs. Méthodologies Suite à notre épisode sur les différents types de documentation, cet article parle des bonnes pratiques à suivre pour les tutoriels https://refactoringenglish.com/chapters/rules-for-software-tutorials/ Écrivez des tutoriels pour les débutants, en évitant le jargon et la terminologie complexe. Promettez un résultat clair dans le titre et expliquez l'objectif dans l'introduction. Montrez le résultat final tôt pour réduire les ambiguïtés. Rendez les extraits de code copiables et collables, en évitant les invites de shell et les commandes interactives. Utilisez les versions longues des indicateurs de ligne de commande pour plus de clarté. Séparez les valeurs définies par l'utilisateur de la logique réutilisable à l'aide de variables d'environnement ou de constantes nommées. Épargnez au lecteur les tâches inutiles en utilisant des scripts. Laissez les ordinateurs évaluer la logique conditionnelle, pas le lecteur. Maintenez le code en état de fonctionnement tout au long du tutoriel. Enseignez une chose par tutoriel et minimisez les dépendances. Les Wide events, un “nouveau” concept en observabilité https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/ un autre article https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics L'idée est de logger des evenements (genre JSON log) avec le plus d'infos possible de la machine, la ram, la versiond e l'appli, l'utilisateur, le numero de build qui a produit l'appli, la derniere PR etc etc ca permet de filtrer et grouper by et de voir des correlations visuelles tres rapidement et de zoomer tiens les ventes baisses de 20% tiens en fait ca vient de l'appli andriod tiens aps correle a la version de l'appli mais la version de l'os si! le deuxieme article est facile a lire le premier est un guide d'usage exhaustif du concept Entre argumenter et se donner 5 minutes https://signalvnoise.com/posts/3124-give-it-five-minutes on veut souvent argumenter aka poser des questions en ayant déjà la reponse en soi emotionnellement mais ca amene beaucoup de verbiage donner 5 minutes à l'idée le temps d'y penser avant d'argumenter Loi, société et organisation Des juges fédéraux arrêtent le principe de la neutralité du net https://www.lemonde.fr/pixels/article/2025/01/03/les-etats-unis-reviennent-en-arriere-sur-le-principe-de-la-neutralite-du-net_6479575_4408996.html?lmd_medium=al&lmd_campaign=envoye-par-appli&lmd_creation=ios&lmd_source=default la neutralité du net c'est l'interdiction de traiter un paquet différemment en fonction de son émetteur Par exemple un paquet Netflix qui serait ralenti vs un paquet Amazon Donald trump est contre cette neutralité. À voir les impacts concrets dans un marché moins régulé. Rubrique débutant Un petit article sur les float vs les double en Java https://www.baeldung.com/java-float-vs-double 4 vs 8 bytes precision max de 7 vs 15 echele 10^38 vs 10^308 (ordre de grandeur) perf a peu pret similaire sauf peut etre pour des modeles d'IA qui vont privilegier une taille plus petite parfois attention overflow et les accumulation d'erreurs d'approximation BigDecimal Conférences La liste des conférences provenant de Developers Conferences Agenda/List par Aurélie Vache et contributeurs : 20 janvier 2025 : Elastic{ON} - Paris (France) 22–25 janvier 2025 : SnowCamp 2025 - Grenoble (France) 24–25 janvier 2025 : Agile Games Île-de-France 2025 - Paris (France) 6–7 février 2025 : Touraine Tech - Tours (France) 21 février 2025 : LyonJS 100 - Lyon (France) 28 février 2025 : Paris TS La Conf - Paris (France) 6 mars 2025 : DevCon #24 : 100% IA - Paris (France) 13 mars 2025 : Oracle CloudWorld Tour Paris - Paris (France) 14 mars 2025 : Rust In Paris 2025 - Paris (France) 19–21 mars 2025 : React Paris - Paris (France) 20 mars 2025 : PGDay Paris - Paris (France) 20–21 mars 2025 : Agile Niort - Niort (France) 25 mars 2025 : ParisTestConf - Paris (France) 26–29 mars 2025 : JChateau Unconference 2025 - Cour-Cheverny (France) 27–28 mars 2025 : SymfonyLive Paris 2025 - Paris (France) 28 mars 2025 : DataDays - Lille (France) 28–29 mars 2025 : Agile Games France 2025 - Lille (France) 3 avril 2025 : DotJS - Paris (France) 3 avril 2025 : SoCraTes Rennes 2025 - Rennes (France) 4 avril 2025 : Flutter Connection 2025 - Paris (France) 10–11 avril 2025 : Android Makers - Montrouge (France) 10–12 avril 2025 : Devoxx Greece - Athens (Greece) 16–18 avril 2025 : Devoxx France - Paris (France) 23–25 avril 2025 : MODERN ENDPOINT MANAGEMENT EMEA SUMMIT 2025 - Paris (France) 24 avril 2025 : IA Data Day 2025 - Strasbourg (France) 29–30 avril 2025 : MixIT - Lyon (France) 7–9 mai 2025 : Devoxx UK - London (UK) 15 mai 2025 : Cloud Toulouse - Toulouse (France) 16 mai 2025 : AFUP Day 2025 Lille - Lille (France) 16 mai 2025 : AFUP Day 2025 Lyon - Lyon (France) 16 mai 2025 : AFUP Day 2025 Poitiers - Poitiers (France) 24 mai 2025 : Polycloud - Montpellier (France) 5–6 juin 2025 : AlpesCraft - Grenoble (France) 5–6 juin 2025 : Devquest 2025 - Niort (France) 11–13 juin 2025 : Devoxx Poland - Krakow (Poland) 12–13 juin 2025 : Agile Tour Toulouse - Toulouse (France) 12–13 juin 2025 : DevLille - Lille (France) 17 juin 2025 : Mobilis In Mobile - Nantes (France) 24 juin 2025 : WAX 2025 - Aix-en-Provence (France) 25–27 juin 2025 : BreizhCamp 2025 - Rennes (France) 26–27 juin 2025 : Sunny Tech - Montpellier (France) 1–4 juillet 2025 : Open edX Conference - 2025 - Palaiseau (France) 7–9 juillet 2025 : Riviera DEV 2025 - Sophia Antipolis (France) 18–19 septembre 2025 : API Platform Conference - Lille (France) & Online 2–3 octobre 2025 : Volcamp - Clermont-Ferrand (France) 6–10 octobre 2025 : Devoxx Belgium - Antwerp (Belgium) 9–10 octobre 2025 : Forum PHP 2025 - Marne-la-Vallée (France) 16–17 octobre 2025 : DevFest Nantes - Nantes (France) 4–7 novembre 2025 : NewCrafts 2025 - Paris (France) 6 novembre 2025 : dotAI 2025 - Paris (France) 7 novembre 2025 : BDX I/O - Bordeaux (France) 12–14 novembre 2025 : Devoxx Morocco - Marrakech (Morocco) 23–25 avril 2026 : Devoxx Greece - Athens (Greece) 17 juin 2026 : Devoxx Poland - Krakow (Poland) Nous contacter Pour réagir à cet épisode, venez discuter sur le groupe Google https://groups.google.com/group/lescastcodeurs Contactez-nous via X/twitter https://twitter.com/lescastcodeurs ou Bluesky https://bsky.app/profile/lescastcodeurs.com Faire un crowdcast ou une crowdquestion Soutenez Les Cast Codeurs sur Patreon https://www.patreon.com/LesCastCodeurs Tous les épisodes et toutes les infos sur https://lescastcodeurs.com/
This episode is sponsored by RapidSOS. Close the safety gap and transform your emergency response with RapidSOS. Visit https://rapidsos.com/eyeonai/ today to learn how AI-powered safety can protect your people and boost your bottom line. In this episode of the Eye on AI podcast, we explore the world of AI inference technology with Rodrigo Liang, co-founder and CEO of SambaNova Systems. Rodrigo shares his journey from high-performance chip design to building SambaNova, a company revolutionizing how enterprises leverage AI through scalable, power-efficient solutions. We dive into SambaNova's groundbreaking achievements, including their record-breaking inference models, the Lama 405B and 70B, which deliver unparalleled speed and accuracy—all on a single rack consuming less than 10 kilowatts of power. Throughout the conversation, Rodrigo highlights the seismic shift from AI training to inference, explaining why production AI is now about speed, efficiency, and real-time applications. He details SambaNova's approach to open-source models, modular deployment, and multi-tenancy, enabling enterprises to scale AI without costly infrastructure overhauls. We also discuss the competitive landscape of AI hardware, the challenges of NVIDIA's dominance, and how SambaNova is paving the way for a new era of AI innovation. Rodrigo explains the critical importance of power efficiency and how SambaNova's technology is unlocking opportunities for enterprises to deploy private, secure AI systems on-premises and in the cloud. Discover how SambaNova is redefining AI for enterprise adoption, enabling real-time AI, and setting new standards in efficiency and scalability. Don't forget to like, subscribe, and hit the notification bell to stay updated on the latest breakthroughs in AI, technology, and enterprise innovation! Stay Updated: Craig Smith Twitter: https://twitter.com/craigss Eye on A.I. Twitter: https://twitter.com/EyeOn_AI
Applications for the 2025 AI Engineer Summit are up, and you can save the date for AIE Singapore in April and AIE World's Fair 2025 in June.Happy new year, and thanks for 100 great episodes! Please let us know what you want to see/hear for the next 100!Full YouTube Episode with Slides/ChartsLike and subscribe and hit that bell to get notifs!Timestamps* 00:00 Welcome to the 100th Episode!* 00:19 Reflecting on the Journey* 00:47 AI Engineering: The Rise and Impact* 03:15 Latent Space Live and AI Conferences* 09:44 The Competitive AI Landscape* 21:45 Synthetic Data and Future Trends* 35:53 Creative Writing with AI* 36:12 Legal and Ethical Issues in AI* 38:18 The Data War: GPU Poor vs. GPU Rich* 39:12 The Rise of GPU Ultra Rich* 40:47 Emerging Trends in AI Models* 45:31 The Multi-Modality War* 01:05:31 The Future of AI Benchmarks* 01:13:17 Pionote and Frontier Models* 01:13:47 Niche Models and Base Models* 01:14:30 State Space Models and RWKB* 01:15:48 Inference Race and Price Wars* 01:22:16 Major AI Themes of the Year* 01:22:48 AI Rewind: January to March* 01:26:42 AI Rewind: April to June* 01:33:12 AI Rewind: July to September* 01:34:59 AI Rewind: October to December* 01:39:53 Year-End Reflections and PredictionsTranscript[00:00:00] Welcome to the 100th Episode![00:00:00] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co host Swyx for the 100th time today.[00:00:12] swyx: Yay, um, and we're so glad that, yeah, you know, everyone has, uh, followed us in this journey. How do you feel about it? 100 episodes.[00:00:19] Alessio: Yeah, I know.[00:00:19] Reflecting on the Journey[00:00:19] Alessio: Almost two years that we've been doing this. We've had four different studios. Uh, we've had a lot of changes. You know, we used to do this lightning round. When we first started that we didn't like, and we tried to change the question. The answer[00:00:32] swyx: was cursor and perplexity.[00:00:34] Alessio: Yeah, I love mid journey. It's like, do you really not like anything else?[00:00:38] Alessio: Like what's, what's the unique thing? And I think, yeah, we, we've also had a lot more research driven content. You know, we had like 3DAO, we had, you know. Jeremy Howard, we had more folks like that.[00:00:47] AI Engineering: The Rise and Impact[00:00:47] Alessio: I think we want to do more of that too in the new year, like having, uh, some of the Gemini folks, both on the research and the applied side.[00:00:54] Alessio: Yeah, but it's been a ton of fun. I think we both started, I wouldn't say as a joke, we were kind of like, Oh, we [00:01:00] should do a podcast. And I think we kind of caught the right wave, obviously. And I think your rise of the AI engineer posts just kind of get people. Sombra to congregate, and then the AI engineer summit.[00:01:11] Alessio: And that's why when I look at our growth chart, it's kind of like a proxy for like the AI engineering industry as a whole, which is almost like, like, even if we don't do that much, we keep growing just because there's so many more AI engineers. So did you expect that growth or did you expect that would take longer for like the AI engineer thing to kind of like become, you know, everybody talks about it today.[00:01:32] swyx: So, the sign of that, that we have won is that Gartner puts it at the top of the hype curve right now. So Gartner has called the peak in AI engineering. I did not expect, um, to what level. I knew that I was correct when I called it because I did like two months of work going into that. But I didn't know, You know, how quickly it could happen, and obviously there's a chance that I could be wrong.[00:01:52] swyx: But I think, like, most people have come around to that concept. Hacker News hates it, which is a good sign. But there's enough people that have defined it, you know, GitHub, when [00:02:00] they launched GitHub Models, which is the Hugging Face clone, they put AI engineers in the banner, like, above the fold, like, in big So I think it's like kind of arrived as a meaningful and useful definition.[00:02:12] swyx: I think people are trying to figure out where the boundaries are. I think that was a lot of the quote unquote drama that happens behind the scenes at the World's Fair in June. Because I think there's a lot of doubt or questions about where ML engineering stops and AI engineering starts. That's a useful debate to be had.[00:02:29] swyx: In some sense, I actually anticipated that as well. So I intentionally did not. Put a firm definition there because most of the successful definitions are necessarily underspecified and it's actually useful to have different perspectives and you don't have to specify everything from the outset.[00:02:45] Alessio: Yeah, I was at um, AWS reInvent and the line to get into like the AI engineering talk, so to speak, which is, you know, applied AI and whatnot was like, there are like hundreds of people just in line to go in.[00:02:56] Alessio: I think that's kind of what enabled me. People, right? Which is what [00:03:00] you kind of talked about. It's like, Hey, look, you don't actually need a PhD, just, yeah, just use the model. And then maybe we'll talk about some of the blind spots that you get as an engineer with the earlier posts that we also had on on the sub stack.[00:03:11] Alessio: But yeah, it's been a heck of a heck of a two years.[00:03:14] swyx: Yeah.[00:03:15] Latent Space Live and AI Conferences[00:03:15] swyx: You know, I was, I was trying to view the conference as like, so NeurIPS is I think like 16, 17, 000 people. And the Latent Space Live event that we held there was 950 signups. I think. The AI world, the ML world is still very much research heavy. And that's as it should be because ML is very much in a research phase.[00:03:34] swyx: But as we move this entire field into production, I think that ratio inverts into becoming more engineering heavy. So at least I think engineering should be on the same level, even if it's never as prestigious, like it'll always be low status because at the end of the day, you're manipulating APIs or whatever.[00:03:51] swyx: But Yeah, wrapping GPTs, but there's going to be an increasing stack and an art to doing these, these things well. And I, you know, I [00:04:00] think that's what we're focusing on for the podcast, the conference and basically everything I do seems to make sense. And I think we'll, we'll talk about the trends here that apply.[00:04:09] swyx: It's, it's just very strange. So, like, there's a mix of, like, keeping on top of research while not being a researcher and then putting that research into production. So, like, people always ask me, like, why are you covering Neuralibs? Like, this is a ML research conference and I'm like, well, yeah, I mean, we're not going to, to like, understand everything Or reproduce every single paper, but the stuff that is being found here is going to make it through into production at some point, you hope.[00:04:32] swyx: And then actually like when I talk to the researchers, they actually get very excited because they're like, oh, you guys are actually caring about how this goes into production and that's what they really really want. The measure of success is previously just peer review, right? Getting 7s and 8s on their um, Academic review conferences and stuff like citations is one metric, but money is a better metric.[00:04:51] Alessio: Money is a better metric. Yeah, and there were about 2200 people on the live stream or something like that. Yeah, yeah. Hundred on the live stream. So [00:05:00] I try my best to moderate, but it was a lot spicier in person with Jonathan and, and Dylan. Yeah, that it was in the chat on YouTube.[00:05:06] swyx: I would say that I actually also created.[00:05:09] swyx: Layen Space Live in order to address flaws that are perceived in academic conferences. This is not NeurIPS specific, it's ICML, NeurIPS. Basically, it's very sort of oriented towards the PhD student, uh, market, job market, right? Like literally all, basically everyone's there to advertise their research and skills and get jobs.[00:05:28] swyx: And then obviously all the, the companies go there to hire them. And I think that's great for the individual researchers, but for people going there to get info is not great because you have to read between the lines, bring a ton of context in order to understand every single paper. So what is missing is effectively what I ended up doing, which is domain by domain, go through and recap the best of the year.[00:05:48] swyx: Survey the field. And there are, like NeurIPS had a, uh, I think ICML had a like a position paper track, NeurIPS added a benchmarks, uh, datasets track. These are ways in which to address that [00:06:00] issue. Uh, there's always workshops as well. Every, every conference has, you know, a last day of workshops and stuff that provide more of an overview.[00:06:06] swyx: But they're not specifically prompted to do so. And I think really, uh, Organizing a conference is just about getting good speakers and giving them the correct prompts. And then they will just go and do that thing and they do a very good job of it. So I think Sarah did a fantastic job with the startups prompt.[00:06:21] swyx: I can't list everybody, but we did best of 2024 in startups, vision, open models. Post transformers, synthetic data, small models, and agents. And then the last one was the, uh, and then we also did a quick one on reasoning with Nathan Lambert. And then the last one, obviously, was the debate that people were very hyped about.[00:06:39] swyx: It was very awkward. And I'm really, really thankful for John Franco, basically, who stepped up to challenge Dylan. Because Dylan was like, yeah, I'll do it. But He was pro scaling. And I think everyone who is like in AI is pro scaling, right? So you need somebody who's ready to publicly say, no, we've hit a wall.[00:06:57] swyx: So that means you're saying Sam Altman's wrong. [00:07:00] You're saying, um, you know, everyone else is wrong. It helps that this was the day before Ilya went on, went up on stage and then said pre training has hit a wall. And data has hit a wall. So actually Jonathan ended up winning, and then Ilya supported that statement, and then Noam Brown on the last day further supported that statement as well.[00:07:17] swyx: So it's kind of interesting that I think the consensus kind of going in was that we're not done scaling, like you should believe in a better lesson. And then, four straight days in a row, you had Sepp Hochreiter, who is the creator of the LSTM, along with everyone's favorite OG in AI, which is Juergen Schmidhuber.[00:07:34] swyx: He said that, um, we're pre trading inside a wall, or like, we've run into a different kind of wall. And then we have, you know John Frankel, Ilya, and then Noam Brown are all saying variations of the same thing, that we have hit some kind of wall in the status quo of what pre trained, scaling large pre trained models has looked like, and we need a new thing.[00:07:54] swyx: And obviously the new thing for people is some make, either people are calling it inference time compute or test time [00:08:00] compute. I think the collective terminology has been inference time, and I think that makes sense because test time, calling it test, meaning, has a very pre trained bias, meaning that the only reason for running inference at all is to test your model.[00:08:11] swyx: That is not true. Right. Yeah. So, so, I quite agree that. OpenAI seems to have adopted, or the community seems to have adopted this terminology of ITC instead of TTC. And that, that makes a lot of sense because like now we care about inference, even right down to compute optimality. Like I actually interviewed this author who recovered or reviewed the Chinchilla paper.[00:08:31] swyx: Chinchilla paper is compute optimal training, but what is not stated in there is it's pre trained compute optimal training. And once you start caring about inference, compute optimal training, you have a different scaling law. And in a way that we did not know last year.[00:08:45] Alessio: I wonder, because John is, he's also on the side of attention is all you need.[00:08:49] Alessio: Like he had the bet with Sasha. So I'm curious, like he doesn't believe in scaling, but he thinks the transformer, I wonder if he's still. So, so,[00:08:56] swyx: so he, obviously everything is nuanced and you know, I told him to play a character [00:09:00] for this debate, right? So he actually does. Yeah. He still, he still believes that we can scale more.[00:09:04] swyx: Uh, he just assumed the character to be very game for, for playing this debate. So even more kudos to him that he assumed a position that he didn't believe in and still won the debate.[00:09:16] Alessio: Get rekt, Dylan. Um, do you just want to quickly run through some of these things? Like, uh, Sarah's presentation, just the highlights.[00:09:24] swyx: Yeah, we can't go through everyone's slides, but I pulled out some things as a factor of, like, stuff that we were going to talk about. And we'll[00:09:30] Alessio: publish[00:09:31] swyx: the rest. Yeah, we'll publish on this feed the best of 2024 in those domains. And hopefully people can benefit from the work that our speakers have done.[00:09:39] swyx: But I think it's, uh, these are just good slides. And I've been, I've been looking for a sort of end of year recaps from, from people.[00:09:44] The Competitive AI Landscape[00:09:44] swyx: The field has progressed a lot. You know, I think the max ELO in 2023 on LMSys used to be 1200 for LMSys ELOs. And now everyone is at least at, uh, 1275 in their ELOs, and this is across Gemini, Chadjibuti, [00:10:00] Grok, O1.[00:10:01] swyx: ai, which with their E Large model, and Enthopic, of course. It's a very, very competitive race. There are multiple Frontier labs all racing, but there is a clear tier zero Frontier. And then there's like a tier one. It's like, I wish I had everything else. Tier zero is extremely competitive. It's effectively now three horse race between Gemini, uh, Anthropic and OpenAI.[00:10:21] swyx: I would say that people are still holding out a candle for XAI. XAI, I think, for some reason, because their API was very slow to roll out, is not included in these metrics. So it's actually quite hard to put on there. As someone who also does charts, XAI is continually snubbed because they don't work well with the benchmarking people.[00:10:42] swyx: Yeah, yeah, yeah. It's a little trivia for why XAI always gets ignored. The other thing is market share. So these are slides from Sarah. We have it up on the screen. It has gone from very heavily open AI. So we have some numbers and estimates. These are from RAMP. Estimates of open AI market share in [00:11:00] December 2023.[00:11:01] swyx: And this is basically, what is it, GPT being 95 percent of production traffic. And I think if you correlate that with stuff that we asked. Harrison Chase on the LangChain episode, it was true. And then CLAUD 3 launched mid middle of this year. I think CLAUD 3 launched in March, CLAUD 3. 5 Sonnet was in June ish.[00:11:23] swyx: And you can start seeing the market share shift towards opening, uh, towards that topic, uh, very, very aggressively. The more recent one is Gemini. So if I scroll down a little bit, this is an even more recent dataset. So RAM's dataset ends in September 2 2. 2024. Gemini has basically launched a price war at the low end, uh, with Gemini Flash, uh, being basically free for personal use.[00:11:44] swyx: Like, I think people don't understand the free tier. It's something like a billion tokens per day. Unless you're trying to abuse it, you cannot really exhaust your free tier on Gemini. They're really trying to get you to use it. They know they're in like third place, um, fourth place, depending how you, how you count.[00:11:58] swyx: And so they're going after [00:12:00] the Lower tier first, and then, you know, maybe the upper tier later, but yeah, Gemini Flash, according to OpenRouter, is now 50 percent of their OpenRouter requests. Obviously, these are the small requests. These are small, cheap requests that are mathematically going to be more.[00:12:15] swyx: The smart ones obviously are still going to OpenAI. But, you know, it's a very, very big shift in the market. Like basically 2023, 2022, To going into 2024 opening has gone from nine five market share to Yeah. Reasonably somewhere between 50 to 75 market share.[00:12:29] Alessio: Yeah. I'm really curious how ramped does the attribution to the model?[00:12:32] Alessio: If it's API, because I think it's all credit card spin. . Well, but it's all, the credit card doesn't say maybe. Maybe the, maybe when they do expenses, they upload the PDF, but yeah, the, the German I think makes sense. I think that was one of my main 2024 takeaways that like. The best small model companies are the large labs, which is not something I would have thought that the open source kind of like long tail would be like the small model.[00:12:53] swyx: Yeah, different sizes of small models we're talking about here, right? Like so small model here for Gemini is AB, [00:13:00] right? Uh, mini. We don't know what the small model size is, but yeah, it's probably in the double digits or maybe single digits, but probably double digits. The open source community has kind of focused on the one to three B size.[00:13:11] swyx: Mm-hmm . Yeah. Maybe[00:13:12] swyx: zero, maybe 0.5 B uh, that's moon dream and that is small for you then, then that's great. It makes sense that we, we have a range for small now, which is like, may, maybe one to five B. Yeah. I'll even put that at, at, at the high end. And so this includes Gemma from Gemini as well. But also includes the Apple Foundation models, which I think Apple Foundation is 3B.[00:13:32] Alessio: Yeah. No, that's great. I mean, I think in the start small just meant cheap. I think today small is actually a more nuanced discussion, you know, that people weren't really having before.[00:13:43] swyx: Yeah, we can keep going. This is a slide that I smiley disagree with Sarah. She's pointing to the scale SEAL leaderboard. I think the Researchers that I talked with at NeurIPS were kind of positive on this because basically you need private test [00:14:00] sets to prevent contamination.[00:14:02] swyx: And Scale is one of maybe three or four people this year that has really made an effort in doing a credible private test set leaderboard. Llama405B does well compared to Gemini and GPT 40. And I think that's good. I would say that. You know, it's good to have an open model that is that big, that does well on those metrics.[00:14:23] swyx: But anyone putting 405B in production will tell you, if you scroll down a little bit to the artificial analysis numbers, that it is very slow and very expensive to infer. Um, it doesn't even fit on like one node. of, uh, of H100s. Cerebras will be happy to tell you they can serve 4 or 5B on their super large chips.[00:14:42] swyx: But, um, you know, if you need to do anything custom to it, you're still kind of constrained. So, is 4 or 5B really that relevant? Like, I think most people are basically saying that they only use 4 or 5B as a teacher model to distill down to something. Even Meta is doing it. So with Lama 3. [00:15:00] 3 launched, they only launched the 70B because they use 4 or 5B to distill the 70B.[00:15:03] swyx: So I don't know if like open source is keeping up. I think they're the, the open source industrial complex is very invested in telling you that the, if the gap is narrowing, I kind of disagree. I think that the gap is widening with O1. I think there are very, very smart people trying to narrow that gap and they should.[00:15:22] swyx: I really wish them success, but you cannot use a chart that is nearing 100 in your saturation chart. And look, the distance between open source and closed source is narrowing. Of course it's going to narrow because you're near 100. This is stupid. But in metrics that matter, is open source narrowing?[00:15:38] swyx: Probably not for O1 for a while. And it's really up to the open source guys to figure out if they can match O1 or not.[00:15:46] Alessio: I think inference time compute is bad for open source just because, you know, Doc can donate the flops at training time, but he cannot donate the flops at inference time. So it's really hard to like actually keep up on that axis.[00:15:59] Alessio: Big, big business [00:16:00] model shift. So I don't know what that means for the GPU clouds. I don't know what that means for the hyperscalers, but obviously the big labs have a lot of advantage. Because, like, it's not a static artifact that you're putting the compute in. You're kind of doing that still, but then you're putting a lot of computed inference too.[00:16:17] swyx: Yeah, yeah, yeah. Um, I mean, Llama4 will be reasoning oriented. We talked with Thomas Shalom. Um, kudos for getting that episode together. That was really nice. Good, well timed. Actually, I connected with the AI meta guy, uh, at NeurIPS, and, um, yeah, we're going to coordinate something for Llama4. Yeah, yeah,[00:16:32] Alessio: and our friend, yeah.[00:16:33] Alessio: Clara Shi just joined to lead the business agent side. So I'm sure we'll have her on in the new year.[00:16:39] swyx: Yeah. So, um, my comment on, on the business model shift, this is super interesting. Apparently it is wide knowledge that OpenAI wanted more than 6. 6 billion dollars for their fundraise. They wanted to raise, you know, higher, and they did not.[00:16:51] swyx: And what that means is basically like, it's very convenient that we're not getting GPT 5, which would have been a larger pre train. We should have a lot of upfront money. And [00:17:00] instead we're, we're converting fixed costs into variable costs, right. And passing it on effectively to the customer. And it's so much easier to take margin there because you can directly attribute it to like, Oh, you're using this more.[00:17:12] swyx: Therefore you, you pay more of the cost and I'll just slap a margin in there. So like that lets you control your growth margin and like tie your. Your spend, or your sort of inference spend, accordingly. And it's just really interesting to, that this change in the sort of inference paradigm has arrived exactly at the same time that the funding environment for pre training is effectively drying up, kind of.[00:17:36] swyx: I feel like maybe the VCs are very in tune with research anyway, so like, they would have noticed this, but, um, it's just interesting.[00:17:43] Alessio: Yeah, and I was looking back at our yearly recap of last year. Yeah. And the big thing was like the mixed trial price fights, you know, and I think now it's almost like there's nowhere to go, like, you know, Gemini Flash is like basically giving it away for free.[00:17:55] Alessio: So I think this is a good way for the labs to generate more revenue and pass down [00:18:00] some of the compute to the customer. I think they're going to[00:18:02] swyx: keep going. I think that 2, will come.[00:18:05] Alessio: Yeah, I know. Totally. I mean, next year, the first thing I'm doing is signing up for Devin. Signing up for the pro chat GBT.[00:18:12] Alessio: Just to try. I just want to see what does it look like to spend a thousand dollars a month on AI?[00:18:17] swyx: Yes. Yes. I think if your, if your, your job is a, at least AI content creator or VC or, you know, someone who, whose job it is to stay on, stay on top of things, you should already be spending like a thousand dollars a month on, on stuff.[00:18:28] swyx: And then obviously easy to spend, hard to use. You have to actually use. The good thing is that actually Google lets you do a lot of stuff for free now. So like deep research. That they just launched. Uses a ton of inference and it's, it's free while it's in preview.[00:18:45] Alessio: Yeah. They need to put that in Lindy.[00:18:47] Alessio: I've been using Lindy lately. I've been a built a bunch of things once we had flow because I liked the new thing. It's pretty good. I even did a phone call assistant. Um, yeah, they just launched Lindy voice. Yeah, I think once [00:19:00] they get advanced voice mode like capability today, still like speech to text, you can kind of tell.[00:19:06] Alessio: Um, but it's good for like reservations and things like that. So I have a meeting prepper thing. And so[00:19:13] swyx: it's good. Okay. I feel like we've, we've covered a lot of stuff. Uh, I, yeah, I, you know, I think We will go over the individual, uh, talks in a separate episode. Uh, I don't want to take too much time with, uh, this stuff, but that suffice to say that there is a lot of progress in each field.[00:19:28] swyx: Uh, we covered vision. Basically this is all like the audience voting for what they wanted. And then I just invited the best people I could find in each audience, especially agents. Um, Graham, who I talked to at ICML in Vienna, he is currently still number one. It's very hard to stay on top of SweetBench.[00:19:45] swyx: OpenHand is currently still number one. switchbench full, which is the hardest one. He had very good thoughts on agents, which I, which I'll highlight for people. Everyone is saying 2025 is the year of agents, just like they said last year. And, uh, but he had [00:20:00] thoughts on like eight parts of what are the frontier problems to solve in agents.[00:20:03] swyx: And so I'll highlight that talk as well.[00:20:05] Alessio: Yeah. The number six, which is the Hacken agents learn more about the environment, has been a Super interesting to us as well, just to think through, because, yeah, how do you put an agent in an enterprise where most things in an enterprise have never been public, you know, a lot of the tooling, like the code bases and things like that.[00:20:23] Alessio: So, yeah, there's not indexing and reg. Well, yeah, but it's more like. You can't really rag things that are not documented. But people know them based on how they've been doing it. You know, so I think there's almost this like, you know, Oh, institutional knowledge. Yeah, the boring word is kind of like a business process extraction.[00:20:38] Alessio: Yeah yeah, I see. It's like, how do you actually understand how these things are done? I see. Um, and I think today the, the problem is that, Yeah, the agents are, that most people are building are good at following instruction, but are not as good as like extracting them from you. Um, so I think that will be a big unlock just to touch quickly on the Jeff Dean thing.[00:20:55] Alessio: I thought it was pretty, I mean, we'll link it in the, in the things, but. I think the main [00:21:00] focus was like, how do you use ML to optimize the systems instead of just focusing on ML to do something else? Yeah, I think speculative decoding, we had, you know, Eugene from RWKB on the podcast before, like he's doing a lot of that with Fetterless AI.[00:21:12] swyx: Everyone is. I would say it's the norm. I'm a little bit uncomfortable with how much it costs, because it does use more of the GPU per call. But because everyone is so keen on fast inference, then yeah, makes sense.[00:21:24] Alessio: Exactly. Um, yeah, but we'll link that. Obviously Jeff is great.[00:21:30] swyx: Jeff is, Jeff's talk was more, it wasn't focused on Gemini.[00:21:33] swyx: I think people got the wrong impression from my tweet. It's more about how Google approaches ML and uses ML to design systems and then systems feedback into ML. And I think this ties in with Lubna's talk.[00:21:45] Synthetic Data and Future Trends[00:21:45] swyx: on synthetic data where it's basically the story of bootstrapping of humans and AI in AI research or AI in production.[00:21:53] swyx: So her talk was on synthetic data, where like how much synthetic data has grown in 2024 in the pre training side, the post training side, [00:22:00] and the eval side. And I think Jeff then also extended it basically to chips, uh, to chip design. So he'd spend a lot of time talking about alpha chip. And most of us in the audience are like, we're not working on hardware, man.[00:22:11] swyx: Like you guys are great. TPU is great. Okay. We'll buy TPUs.[00:22:14] Alessio: And then there was the earlier talk. Yeah. But, and then we have, uh, I don't know if we're calling them essays. What are we calling these? But[00:22:23] swyx: for me, it's just like bonus for late in space supporters, because I feel like they haven't been getting anything.[00:22:29] swyx: And then I wanted a more high frequency way to write stuff. Like that one I wrote in an afternoon. I think basically we now have an answer to what Ilya saw. It's one year since. The blip. And we know what he saw in 2014. We know what he saw in 2024. We think we know what he sees in 2024. He gave some hints and then we have vague indications of what he saw in 2023.[00:22:54] swyx: So that was the Oh, and then 2016 as well, because of this lawsuit with Elon, OpenAI [00:23:00] is publishing emails from Sam's, like, his personal text messages to Siobhan, Zelis, or whatever. So, like, we have emails from Ilya saying, this is what we're seeing in OpenAI, and this is why we need to scale up GPUs. And I think it's very prescient in 2016 to write that.[00:23:16] swyx: And so, like, it is exactly, like, basically his insights. It's him and Greg, basically just kind of driving the scaling up of OpenAI, while they're still playing Dota. They're like, no, like, we see the path here.[00:23:30] Alessio: Yeah, and it's funny, yeah, they even mention, you know, we can only train on 1v1 Dota. We need to train on 5v5, and that takes too many GPUs.[00:23:37] Alessio: Yeah,[00:23:37] swyx: and at least for me, I can speak for myself, like, I didn't see the path from Dota to where we are today. I think even, maybe if you ask them, like, they wouldn't necessarily draw a straight line. Yeah,[00:23:47] Alessio: no, definitely. But I think like that was like the whole idea of almost like the RL and we talked about this with Nathan on his podcast.[00:23:55] Alessio: It's like with RL, you can get very good at specific things, but then you can't really like generalize as much. And I [00:24:00] think the language models are like the opposite, which is like, you're going to throw all this data at them and scale them up, but then you really need to drive them home on a specific task later on.[00:24:08] Alessio: And we'll talk about the open AI reinforcement, fine tuning, um, announcement too, and all of that. But yeah, I think like scale is all you need. That's kind of what Elia will be remembered for. And I think just maybe to clarify on like the pre training is over thing that people love to tweet. I think the point of the talk was like everybody, we're scaling these chips, we're scaling the compute, but like the second ingredient which is data is not scaling at the same rate.[00:24:35] Alessio: So it's not necessarily pre training is over. It's kind of like What got us here won't get us there. In his email, he predicted like 10x growth every two years or something like that. And I think maybe now it's like, you know, you can 10x the chips again, but[00:24:49] swyx: I think it's 10x per year. Was it? I don't know.[00:24:52] Alessio: Exactly. And Moore's law is like 2x. So it's like, you know, much faster than that. And yeah, I like the fossil fuel of AI [00:25:00] analogy. It's kind of like, you know, the little background tokens thing. So the OpenAI reinforcement fine tuning is basically like, instead of fine tuning on data, you fine tune on a reward model.[00:25:09] Alessio: So it's basically like, instead of being data driven, it's like task driven. And I think people have tasks to do, they don't really have a lot of data. So I'm curious to see how that changes, how many people fine tune, because I think this is what people run into. It's like, Oh, you can fine tune llama. And it's like, okay, where do I get the data?[00:25:27] Alessio: To fine tune it on, you know, so it's great that we're moving the thing. And then I really like he had this chart where like, you know, the brain mass and the body mass thing is basically like mammals that scaled linearly by brain and body size, and then humans kind of like broke off the slope. So it's almost like maybe the mammal slope is like the pre training slope.[00:25:46] Alessio: And then the post training slope is like the, the human one.[00:25:49] swyx: Yeah. I wonder what the. I mean, we'll know in 10 years, but I wonder what the y axis is for, for Ilya's SSI. We'll try to get them on.[00:25:57] Alessio: Ilya, if you're listening, you're [00:26:00] welcome here. Yeah, and then he had, you know, what comes next, like agent, synthetic data, inference, compute, I thought all of that was like that.[00:26:05] Alessio: I don't[00:26:05] swyx: think he was dropping any alpha there. Yeah, yeah, yeah.[00:26:07] Alessio: Yeah. Any other new reps? Highlights?[00:26:10] swyx: I think that there was comparatively a lot more work. Oh, by the way, I need to plug that, uh, my friend Yi made this, like, little nice paper. Yeah, that was really[00:26:20] swyx: nice.[00:26:20] swyx: Uh, of, uh, of, like, all the, he's, she called it must read papers of 2024.[00:26:26] swyx: So I laid out some of these at NeurIPS, and it was just gone. Like, everyone just picked it up. Because people are dying for, like, little guidance and visualizations And so, uh, I thought it was really super nice that we got there.[00:26:38] Alessio: Should we do a late in space book for each year? Uh, I thought about it. For each year we should.[00:26:42] Alessio: Coffee table book. Yeah. Yeah. Okay. Put it in the will. Hi, Will. By the way, we haven't introduced you. He's our new, you know, general organist, Jamie. You need to[00:26:52] swyx: pull up more things. One thing I saw that, uh, Okay, one fun one, and then one [00:27:00] more general one. So the fun one is this paper on agent collusion. This is a paper on steganography.[00:27:06] swyx: This is secret collusion among AI agents, multi agent deception via steganography. I tried to go to NeurIPS in order to find these kinds of papers because the real reason Like NeurIPS this year has a lottery system. A lot of people actually even go and don't buy tickets because they just go and attend the side events.[00:27:22] swyx: And then also the people who go and end up crowding around the most popular papers, which you already know and already read them before you showed up to NeurIPS. So the only reason you go there is to talk to the paper authors, but there's like something like 10, 000 other. All these papers out there that, you know, are just people's work that they, that they did on the air and they failed to get attention for one reason or another.[00:27:42] swyx: And this was one of them. Uh, it was like all the way at the back. And this is a deep mind paper that actually focuses on collusion between AI agents, uh, by hiding messages in the text that they generate. Uh, so that's what steganography is. So a very simple example would be the first letter of every word.[00:27:57] swyx: If you Pick that out, you know, and the code sends a [00:28:00] different message than that. But something I've always emphasized is to LLMs, we read left to right. LLMs can read up, down, sideways, you know, in random character order. And it's the same to them as it is to us. So if we were ever to get You know, self motivated, underlined LLMs that we're trying to collaborate to take over the planet.[00:28:19] swyx: This would be how they do it. They spread messages among us in the messages that we generate. And he developed a scaling law for that. So he marked, I'm showing it on screen right now, the emergence of this phenomenon. Basically, for example, for Cypher encoding, GPT 2, Lama 2, mixed trial, GPT 3. 5, zero capabilities, and sudden 4.[00:28:40] swyx: And this is the kind of Jason Wei type emergence properties that people kind of look for. I think what made this paper stand out as well, so he developed the benchmark for steganography collusion, and he also focused on shelling point collusion, which is very low coordination. For agreeing on a decoding encoding format, you kind of need to have some [00:29:00] agreement on that.[00:29:00] swyx: But, but shelling point means like very, very low or almost no coordination. So for example, if I, if I ask someone, if the only message I give you is meet me in New York and you're not aware. Or when you would probably meet me at Grand Central Station. That is the Grand Central Station is a shelling point.[00:29:16] swyx: And it's probably somewhere, somewhere during the day. That is the shelling point of New York is Grand Central. To that extent, shelling points for steganography are things like the, the, the common decoding methods that we talked about. It will be interesting at some point in the future when we are worried about alignment.[00:29:30] swyx: It is not interesting today, but it's interesting that DeepMind is already thinking about this.[00:29:36] Alessio: I think that's like one of the hardest things about NeurIPS. It's like the long tail. I[00:29:41] swyx: found a pricing guy. I'm going to feature him on the podcast. Basically, this guy from NVIDIA worked out the optimal pricing for language models.[00:29:51] swyx: It's basically an econometrics paper at NeurIPS, where everyone else is talking about GPUs. And the guy with the GPUs is[00:29:57] Alessio: talking[00:29:57] swyx: about economics instead. [00:30:00] That was the sort of fun one. So the focus I saw is that model papers at NeurIPS are kind of dead. No one really presents models anymore. It's just data sets.[00:30:12] swyx: This is all the grad students are working on. So like there was a data sets track and then I was looking around like, I was like, you don't need a data sets track because every paper is a data sets paper. And so data sets and benchmarks, they're kind of flip sides of the same thing. So Yeah. Cool. Yeah, if you're a grad student, you're a GPU boy, you kind of work on that.[00:30:30] swyx: And then the, the sort of big model that people walk around and pick the ones that they like, and then they use it in their models. And that's, that's kind of how it develops. I, I feel like, um, like, like you didn't last year, you had people like Hao Tian who worked on Lava, which is take Lama and add Vision.[00:30:47] swyx: And then obviously actually I hired him and he added Vision to Grok. Now he's the Vision Grok guy. This year, I don't think there was any of those.[00:30:55] Alessio: What were the most popular, like, orals? Last year it was like the [00:31:00] Mixed Monarch, I think, was like the most attended. Yeah, uh, I need to look it up. Yeah, I mean, if nothing comes to mind, that's also kind of like an answer in a way.[00:31:10] Alessio: But I think last year there was a lot of interest in, like, furthering models and, like, different architectures and all of that.[00:31:16] swyx: I will say that I felt the orals, oral picks this year were not very good. Either that or maybe it's just a So that's the highlight of how I have changed in terms of how I view papers.[00:31:29] swyx: So like, in my estimation, two of the best papers in this year for datasets or data comp and refined web or fine web. These are two actually industrially used papers, not highlighted for a while. I think DCLM got the spotlight, FineWeb didn't even get the spotlight. So like, it's just that the picks were different.[00:31:48] swyx: But one thing that does get a lot of play that a lot of people are debating is the role that's scheduled. This is the schedule free optimizer paper from Meta from Aaron DeFazio. And this [00:32:00] year in the ML community, there's been a lot of chat about shampoo, soap, all the bathroom amenities for optimizing your learning rates.[00:32:08] swyx: And, uh, most people at the big labs are. Who I asked about this, um, say that it's cute, but it's not something that matters. I don't know, but it's something that was discussed and very, very popular. 4Wars[00:32:19] Alessio: of AI recap maybe, just quickly. Um, where do you want to start? Data?[00:32:26] swyx: So to remind people, this is the 4Wars piece that we did as one of our earlier recaps of this year.[00:32:31] swyx: And the belligerents are on the left, journalists, writers, artists, anyone who owns IP basically, New York Times, Stack Overflow, Reddit, Getty, Sarah Silverman, George RR Martin. Yeah, and I think this year we can add Scarlett Johansson to that side of the fence. So anyone suing, open the eye, basically. I actually wanted to get a snapshot of all the lawsuits.[00:32:52] swyx: I'm sure some lawyer can do it. That's the data quality war. On the right hand side, we have the synthetic data people, and I think we talked about Lumna's talk, you know, [00:33:00] really showing how much synthetic data has come along this year. I think there was a bit of a fight between scale. ai and the synthetic data community, because scale.[00:33:09] swyx: ai published a paper saying that synthetic data doesn't work. Surprise, surprise, scale. ai is the leading vendor of non synthetic data. Only[00:33:17] Alessio: cage free annotated data is useful.[00:33:21] swyx: So I think there's some debate going on there, but I don't think it's much debate anymore that at least synthetic data, for the reasons that are blessed in Luna's talk, Makes sense.[00:33:32] swyx: I don't know if you have any perspectives there.[00:33:34] Alessio: I think, again, going back to the reinforcement fine tuning, I think that will change a little bit how people think about it. I think today people mostly use synthetic data, yeah, for distillation and kind of like fine tuning a smaller model from like a larger model.[00:33:46] Alessio: I'm not super aware of how the frontier labs use it outside of like the rephrase, the web thing that Apple also did. But yeah, I think it'll be. Useful. I think like whether or not that gets us the big [00:34:00] next step, I think that's maybe like TBD, you know, I think people love talking about data because it's like a GPU poor, you know, I think, uh, synthetic data is like something that people can do, you know, so they feel more opinionated about it compared to, yeah, the optimizers stuff, which is like,[00:34:17] swyx: they don't[00:34:17] Alessio: really work[00:34:18] swyx: on.[00:34:18] swyx: I think that there is an angle to the reasoning synthetic data. So this year, we covered in the paper club, the star series of papers. So that's star, Q star, V star. It basically helps you to synthesize reasoning steps, or at least distill reasoning steps from a verifier. And if you look at the OpenAI RFT, API that they released, or that they announced, basically they're asking you to submit graders, or they choose from a preset list of graders.[00:34:49] swyx: Basically It feels like a way to create valid synthetic data for them to fine tune their reasoning paths on. Um, so I think that is another angle where it starts to make sense. And [00:35:00] so like, it's very funny that basically all the data quality wars between Let's say the music industry or like the newspaper publishing industry or the textbooks industry on the big labs.[00:35:11] swyx: It's all of the pre training era. And then like the new era, like the reasoning era, like nobody has any problem with all the reasoning, especially because it's all like sort of math and science oriented with, with very reasonable graders. I think the more interesting next step is how does it generalize beyond STEM?[00:35:27] swyx: We've been using O1 for And I would say like for summarization and creative writing and instruction following, I think it's underrated. I started using O1 in our intro songs before we killed the intro songs, but it's very good at writing lyrics. You know, I can actually say like, I think one of the O1 pro demos.[00:35:46] swyx: All of these things that Noam was showing was that, you know, you can write an entire paragraph or three paragraphs without using the letter A, right?[00:35:53] Creative Writing with AI[00:35:53] swyx: So like, like literally just anything instead of token, like not even token level, character level manipulation and [00:36:00] counting and instruction following. It's, uh, it's very, very strong.[00:36:02] swyx: And so no surprises when I ask it to rhyme, uh, and to, to create song lyrics, it's going to do that very much better than in previous models. So I think it's underrated for creative writing.[00:36:11] Alessio: Yeah.[00:36:12] Legal and Ethical Issues in AI[00:36:12] Alessio: What do you think is the rationale that they're going to have in court when they don't show you the thinking traces of O1, but then they want us to, like, they're getting sued for using other publishers data, you know, but then on their end, they're like, well, you shouldn't be using my data to then train your model.[00:36:29] Alessio: So I'm curious to see how that kind of comes. Yeah, I mean, OPA has[00:36:32] swyx: many ways to publish, to punish people without bringing, taking them to court. Already banned ByteDance for distilling their, their info. And so anyone caught distilling the chain of thought will be just disallowed to continue on, on, on the API.[00:36:44] swyx: And it's fine. It's no big deal. Like, I don't even think that's an issue at all, just because the chain of thoughts are pretty well hidden. Like you have to work very, very hard to, to get it to leak. And then even when it leaks the chain of thought, you don't know if it's, if it's [00:37:00] The bigger concern is actually that there's not that much IP hiding behind it, that Cosign, which we talked about, we talked to him on Dev Day, can just fine tune 4.[00:37:13] swyx: 0 to beat 0. 1 Cloud SONET so far is beating O1 on coding tasks without, at least O1 preview, without being a reasoning model, same for Gemini Pro or Gemini 2. 0. So like, how much is reasoning important? How much of a moat is there in this, like, All of these are proprietary sort of training data that they've presumably accomplished.[00:37:34] swyx: Because even DeepSeek was able to do it. And they had, you know, two months notice to do this, to do R1. So, it's actually unclear how much moat there is. Obviously, you know, if you talk to the Strawberry team, they'll be like, yeah, I mean, we spent the last two years doing this. So, we don't know. And it's going to be Interesting because there'll be a lot of noise from people who say they have inference time compute and actually don't because they just have fancy chain of thought.[00:38:00][00:38:00] swyx: And then there's other people who actually do have very good chain of thought. And you will not see them on the same level as OpenAI because OpenAI has invested a lot in building up the mythology of their team. Um, which makes sense. Like the real answer is somewhere in between.[00:38:13] Alessio: Yeah, I think that's kind of like the main data war story developing.[00:38:18] The Data War: GPU Poor vs. GPU Rich[00:38:18] Alessio: GPU poor versus GPU rich. Yeah. Where do you think we are? I think there was, again, going back to like the small model thing, there was like a time in which the GPU poor were kind of like the rebel faction working on like these models that were like open and small and cheap. And I think today people don't really care as much about GPUs anymore.[00:38:37] Alessio: You also see it in the price of the GPUs. Like, you know, that market is kind of like plummeted because there's people don't want to be, they want to be GPU free. They don't even want to be poor. They just want to be, you know, completely without them. Yeah. How do you think about this war? You[00:38:52] swyx: can tell me about this, but like, I feel like the, the appetite for GPU rich startups, like the, you know, the, the funding plan is we will raise 60 million and [00:39:00] we'll give 50 of that to NVIDIA.[00:39:01] swyx: That is gone, right? Like, no one's, no one's pitching that. This was literally the plan, the exact plan of like, I can name like four or five startups, you know, this time last year. So yeah, GPU rich startups gone.[00:39:12] The Rise of GPU Ultra Rich[00:39:12] swyx: But I think like, The GPU ultra rich, the GPU ultra high net worth is still going. So, um, now we're, you know, we had Leopold's essay on the trillion dollar cluster.[00:39:23] swyx: We're not quite there yet. We have multiple labs, um, you know, XAI very famously, you know, Jensen Huang praising them for being. Best boy number one in spinning up 100, 000 GPU cluster in like 12 days or something. So likewise at Meta, likewise at OpenAI, likewise at the other labs as well. So like the GPU ultra rich are going to keep doing that because I think partially it's an article of faith now that you just need it.[00:39:46] swyx: Like you don't even know what it's going to, what you're going to use it for. You just, you just need it. And it makes sense that if, especially if we're going into. More researchy territory than we are. So let's say 2020 to 2023 was [00:40:00] let's scale big models territory because we had GPT 3 in 2020 and we were like, okay, we'll go from 1.[00:40:05] swyx: 75b to 1. 8b, 1. 8t. And that was GPT 3 to GPT 4. Okay, that's done. As far as everyone is concerned, Opus 3. 5 is not coming out, GPT 4. 5 is not coming out, and Gemini 2, we don't have Pro, whatever. We've hit that wall. Maybe I'll call it the 2 trillion perimeter wall. We're not going to 10 trillion. No one thinks it's a good idea, at least from training costs, from the amount of data, or at least the inference.[00:40:36] swyx: Would you pay 10x the price of GPT Probably not. Like, like you want something else that, that is at least more useful. So it makes sense that people are pivoting in terms of their inference paradigm.[00:40:47] Emerging Trends in AI Models[00:40:47] swyx: And so when it's more researchy, then you actually need more just general purpose compute to mess around with, uh, at the exact same time that production deployments of the old, the previous paradigm is still ramping up,[00:40:58] swyx: um,[00:40:58] swyx: uh, pretty aggressively.[00:40:59] swyx: So [00:41:00] it makes sense that the GPU rich are growing. We have now interviewed both together and fireworks and replicates. Uh, we haven't done any scale yet. But I think Amazon, maybe kind of a sleeper one, Amazon, in a sense of like they, at reInvent, I wasn't expecting them to do so well, but they are now a foundation model lab.[00:41:18] swyx: It's kind of interesting. Um, I think, uh, you know, David went over there and started just creating models.[00:41:25] Alessio: Yeah, I mean, that's the power of prepaid contracts. I think like a lot of AWS customers, you know, they do this big reserve instance contracts and now they got to use their money. That's why so many startups.[00:41:37] Alessio: Get bought through the AWS marketplace so they can kind of bundle them together and prefer pricing.[00:41:42] swyx: Okay, so maybe GPU super rich doing very well, GPU middle class dead, and then GPU[00:41:48] Alessio: poor. I mean, my thing is like, everybody should just be GPU rich. There shouldn't really be, even the GPU poorest, it's like, does it really make sense to be GPU poor?[00:41:57] Alessio: Like, if you're GPU poor, you should just use the [00:42:00] cloud. Yes, you know, and I think there might be a future once we kind of like figure out what the size and shape of these models is where like the tiny box and these things come to fruition where like you can be GPU poor at home. But I think today is like, why are you working so hard to like get these models to run on like very small clusters where it's like, It's so cheap to run them.[00:42:21] Alessio: Yeah, yeah,[00:42:22] swyx: yeah. I think mostly people think it's cool. People think it's a stepping stone to scaling up. So they aspire to be GPU rich one day and they're working on new methods. Like news research, like probably the most deep tech thing they've done this year is Distro or whatever the new name is.[00:42:38] swyx: There's a lot of interest in heterogeneous computing, distributed computing. I tend generally to de emphasize that historically, but it may be coming to a time where it is starting to be relevant. I don't know. You know, SF compute launched their compute marketplace this year, and like, who's really using that?[00:42:53] swyx: Like, it's a bunch of small clusters, disparate types of compute, and if you can make that [00:43:00] useful, then that will be very beneficial to the broader community, but maybe still not the source of frontier models. It's just going to be a second tier of compute that is unlocked for people, and that's fine. But yeah, I mean, I think this year, I would say a lot more on device, We are, I now have Apple intelligence on my phone.[00:43:19] swyx: Doesn't do anything apart from summarize my notifications. But still, not bad. Like, it's multi modal.[00:43:25] Alessio: Yeah, the notification summaries are so and so in my experience.[00:43:29] swyx: Yeah, but they add, they add juice to life. And then, um, Chrome Nano, uh, Gemini Nano is coming out in Chrome. Uh, they're still feature flagged, but you can, you can try it now if you, if you use the, uh, the alpha.[00:43:40] swyx: And so, like, I, I think, like, you know, We're getting the sort of GPU poor version of a lot of these things coming out, and I think it's like quite useful. Like Windows as well, rolling out RWKB in sort of every Windows department is super cool. And I think the last thing that I never put in this GPU poor war, that I think I should now, [00:44:00] is the number of startups that are GPU poor but still scaling very well, as sort of wrappers on top of either a foundation model lab, or GPU Cloud.[00:44:10] swyx: GPU Cloud, it would be Suno. Suno, Ramp has rated as one of the top ranked, fastest growing startups of the year. Um, I think the last public number is like zero to 20 million this year in ARR and Suno runs on Moto. So Suno itself is not GPU rich, but they're just doing the training on, on Moto, uh, who we've also talked to on, on the podcast.[00:44:31] swyx: The other one would be Bolt, straight cloud wrapper. And, and, um, Again, another, now they've announced 20 million ARR, which is another step up from our 8 million that we put on the title. So yeah, I mean, it's crazy that all these GPU pores are finding a way while the GPU riches are also finding a way. And then the only failures, I kind of call this the GPU smiling curve, where the edges do well, because you're either close to the machines, and you're like [00:45:00] number one on the machines, or you're like close to the customers, and you're number one on the customer side.[00:45:03] swyx: And the people who are in the middle. Inflection, um, character, didn't do that great. I think character did the best of all of them. Like, you have a note in here that we apparently said that character's price tag was[00:45:15] Alessio: 1B.[00:45:15] swyx: Did I say that?[00:45:16] Alessio: Yeah. You said Google should just buy them for 1B. I thought it was a crazy number.[00:45:20] Alessio: Then they paid 2. 7 billion. I mean, for like,[00:45:22] swyx: yeah.[00:45:22] Alessio: What do you pay for node? Like, I don't know what the game world was like. Maybe the starting price was 1B. I mean, whatever it was, it worked out for everybody involved.[00:45:31] The Multi-Modality War[00:45:31] Alessio: Multimodality war. And this one, we never had text to video in the first version, which now is the hottest.[00:45:37] swyx: Yeah, I would say it's a subset of image, but yes.[00:45:40] Alessio: Yeah, well, but I think at the time it wasn't really something people were doing, and now we had VO2 just came out yesterday. Uh, Sora was released last month, last week. I've not tried Sora, because the day that I tried, it wasn't, yeah. I[00:45:54] swyx: think it's generally available now, you can go to Sora.[00:45:56] swyx: com and try it. Yeah, they had[00:45:58] Alessio: the outage. Which I [00:46:00] think also played a part into it. Small things. Yeah. What's the other model that you posted today that was on Replicate? Video or OneLive?[00:46:08] swyx: Yeah. Very, very nondescript name, but it is from Minimax, which I think is a Chinese lab. The Chinese labs do surprisingly well at the video models.[00:46:20] swyx: I'm not sure it's actually Chinese. I don't know. Hold me up to that. Yep. China. It's good. Yeah, the Chinese love video. What can I say? They have a lot of training data for video. Or a more relaxed regulatory environment.[00:46:37] Alessio: Uh, well, sure, in some way. Yeah, I don't think there's much else there. I think like, you know, on the image side, I think it's still open.[00:46:45] Alessio: Yeah, I mean,[00:46:46] swyx: 11labs is now a unicorn. So basically, what is multi modality war? Multi modality war is, do you specialize in a single modality, right? Or do you have GodModel that does all the modalities? So this is [00:47:00] definitely still going, in a sense of 11 labs, you know, now Unicorn, PicoLabs doing well, they launched Pico 2.[00:47:06] swyx: 0 recently, HeyGen, I think has reached 100 million ARR, Assembly, I don't know, but they have billboards all over the place, so I assume they're doing very, very well. So these are all specialist models, specialist models and specialist startups. And then there's the big labs who are doing the sort of all in one play.[00:47:24] swyx: And then here I would highlight Gemini 2 for having native image output. Have you seen the demos? Um, yeah, it's, it's hard to keep up. Literally they launched this last week and a shout out to Paige Bailey, who came to the Latent Space event to demo on the day of launch. And she wasn't prepared. She was just like, I'm just going to show you.[00:47:43] swyx: So they have voice. They have, you know, obviously image input, and then they obviously can code gen and all that. But the new one that OpenAI and Meta both have but they haven't launched yet is image output. So you can literally, um, I think their demo video was that you put in an image of a [00:48:00] car, and you ask for minor modifications to that car.[00:48:02] swyx: They can generate you that modification exactly as you asked. So there's no need for the stable diffusion or comfy UI workflow of like mask here and then like infill there in paint there and all that, all that stuff. This is small model nonsense. Big model people are like, huh, we got you in as everything in the transformer.[00:48:21] swyx: This is the multimodality war, which is, do you, do you bet on the God model or do you string together a whole bunch of, uh, Small models like a, like a chump. Yeah,[00:48:29] Alessio: I don't know, man. Yeah, that would be interesting. I mean, obviously I use Midjourney for all of our thumbnails. Um, they've been doing a ton on the product, I would say.[00:48:38] Alessio: They launched a new Midjourney editor thing. They've been doing a ton. Because I think, yeah, the motto is kind of like, Maybe, you know, people say black forest, the black forest models are better than mid journey on a pixel by pixel basis. But I think when you put it, put it together, have you tried[00:48:53] swyx: the same problems on black forest?[00:48:55] Alessio: Yes. But the problem is just like, you know, on black forest, it generates one image. And then it's like, you got to [00:49:00] regenerate. You don't have all these like UI things. Like what I do, no, but it's like time issue, you know, it's like a mid[00:49:06] swyx: journey. Call the API four times.[00:49:08] Alessio: No, but then there's no like variate.[00:49:10] Alessio: Like the good thing about mid journey is like, you just go in there and you're cooking. There's a lot of stuff that just makes it really easy. And I think people underestimate that. Like, it's not really a skill issue, because I'm paying mid journey, so it's a Black Forest skill issue, because I'm not paying them, you know?[00:49:24] Alessio: Yeah,[00:49:25] swyx: so, okay, so, uh, this is a UX thing, right? Like, you, you, you understand that, at least, we think that Black Forest should be able to do all that stuff. I will also shout out, ReCraft has come out, uh, on top of the image arena that, uh, artificial analysis has done, has apparently, uh, Flux's place. Is this still true?[00:49:41] swyx: So, Artificial Analysis is now a company. I highlighted them I think in one of the early AI Newses of the year. And they have launched a whole bunch of arenas. So, they're trying to take on LM Arena, Anastasios and crew. And they have an image arena. Oh yeah, Recraft v3 is now beating Flux 1. 1. Which is very surprising [00:50:00] because Flux And Black Forest Labs are the old stable diffusion crew who left stability after, um, the management issues.[00:50:06] swyx: So Recurve has come from nowhere to be the top image model. Uh, very, very strange. I would also highlight that Grok has now launched Aurora, which is, it's very interesting dynamics between Grok and Black Forest Labs because Grok's images were originally launched, uh, in partnership with Black Forest Labs as a, as a thin wrapper.[00:50:24] swyx: And then Grok was like, no, we'll make our own. And so they've made their own. I don't know, there are no APIs or benchmarks about it. They just announced it. So yeah, that's the multi modality war. I would say that so far, the small model, the dedicated model people are winning, because they are just focused on their tasks.[00:50:42] swyx: But the big model, People are always catching up. And the moment I saw the Gemini 2 demo of image editing, where I can put in an image and just request it and it does, that's how AI should work. Not like a whole bunch of complicated steps. So it really is something. And I think one frontier that we haven't [00:51:00] seen this year, like obviously video has done very well, and it will continue to grow.[00:51:03] swyx: You know, we only have Sora Turbo today, but at some point we'll get full Sora. Oh, at least the Hollywood Labs will get Fulsora. We haven't seen video to audio, or video synced to audio. And so the researchers that I talked to are already starting to talk about that as the next frontier. But there's still maybe like five more years of video left to actually be Soda.[00:51:23] swyx: I would say that Gemini's approach Compared to OpenAI, Gemini seems, or DeepMind's approach to video seems a lot more fully fledged than OpenAI. Because if you look at the ICML recap that I published that so far nobody has listened to, um, that people have listened to it. It's just a different, definitely different audience.[00:51:43] swyx: It's only seven hours long. Why are people not listening? It's like everything in Uh, so, so DeepMind has, is working on Genie. They also launched Genie 2 and VideoPoet. So, like, they have maybe four years advantage on world modeling that OpenAI does not have. Because OpenAI basically only started [00:52:00] Diffusion Transformers last year, you know, when they hired, uh, Bill Peebles.[00:52:03] swyx: So, DeepMind has, has a bit of advantage here, I would say, in, in, in showing, like, the reason that VO2, while one, They cherry pick their videos. So obviously it looks better than Sora, but the reason I would believe that VO2, uh, when it's fully launched will do very well is because they have all this background work in video that they've done for years.[00:52:22] swyx: Like, like last year's NeurIPS, I already was interviewing some of their video people. I forget their model name, but for, for people who are dedicated fans, they can go to NeurIPS 2023 and see, see that paper.[00:52:32] Alessio: And then last but not least, the LLMOS. We renamed it to Ragops, formerly known as[00:52:39] swyx: Ragops War. I put the latest chart on the Braintrust episode.[00:52:43] swyx: I think I'm going to separate these essays from the episode notes. So the reason I used to do that, by the way, is because I wanted to show up on Hacker News. I wanted the podcast to show up on Hacker News. So I always put an essay inside of there because Hacker News people like to read and not listen.[00:52:58] Alessio: So episode essays,[00:52:59] swyx: I remember [00:53:00] purchasing them separately. You say Lanchain Llama Index is still growing.[00:53:03] Alessio: Yeah, so I looked at the PyPy stats, you know. I don't care about stars. On PyPy you see Do you want to share your screen? Yes. I prefer to look at actual downloads, not at stars on GitHub. So if you look at, you know, Lanchain still growing.[00:53:20] Alessio: These are the last six months. Llama Index still growing. What I've basically seen is like things that, One, obviously these things have A commercial product. So there's like people buying this and sticking with it versus kind of hopping in between things versus, you know, for example, crew AI, not really growing as much.[00:53:38] Alessio: The stars are growing. If you look on GitHub, like the stars are growing, but kind of like the usage is kind of like flat. In the last six months, have they done some[00:53:4
Happy holidays! We'll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production!For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver.Of perennial interest, particularly at academic conferences, is scaled-up architecture research as people hunt for the next Attention Is All You Need. We have many names for them: “efficient models”, “retentive networks”, “subquadratic attention” or “linear attention” but some of them don't even have any lineage with attention - one of the best papers of this NeurIPS was Sepp Hochreiter's xLSTM, which has a particularly poetic significance as one of the creators of the LSTM returning to update and challenge the OG language model architecture:So, for lack of a better term, we decided to call this segment “the State of Post-Transformers” and fortunately everyone rolled with it.We are fortunate to have two powerful friends of the pod to give us an update here:* Together AI: with CEO Vipul Ved Prakash and CTO Ce Zhang joining us to talk about how they are building Together together as a quote unquote full stack AI startup, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms, with notable industry contributions from RedPajama v2, Flash Attention 3, Mamba 2, Mixture of Agents, BASED, Sequoia, Evo, Dragonfly, Dan Fu's ThunderKittens and many more research projects this year* Recursal AI: with CEO Eugene Cheah who has helped lead the independent RWKV project while also running Featherless AI. This year, the team has shipped RWKV v5, codenamed Eagle, to 1.5 billion Windows 10 and Windows 11 machines worldwide, to support Microsoft's on-device, energy-usage-sensitive Windows Copilot usecases, and has launched the first updates on RWKV v6, codenamed Finch and GoldFinch. On the morning of Latent Space Live, they also announced QRWKV6, a Qwen 32B model modified with RWKV linear attention layers. We were looking to host a debate between our speakers, but given that both of them were working on post-transformers alternativesFull Talk on YoutubePlease like and subscribe!LinksAll the models and papers they picked:* Earlier Cited Work* Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* Hungry hungry hippos: Towards language modeling with state space models* Hyena hierarchy: Towards larger convolutional language models* Mamba: Linear-Time Sequence Modeling with Selective State Spaces* S4: Efficiently Modeling Long Sequences with Structured State Spaces* Just Read Twice (Arora et al)* Recurrent large language models that compete with Transformers in language modeling perplexity are emerging at a rapid rate (e.g., Mamba, RWKV). Excitingly, these architectures use a constant amount of memory during inference. However, due to the limited memory, recurrent LMs cannot recall and use all the information in long contexts leading to brittle in-context learning (ICL) quality. A key challenge for efficient LMs is selecting what information to store versus discard. In this work, we observe the order in which information is shown to the LM impacts the selection difficulty. * To formalize this, we show that the hardness of information recall reduces to the hardness of a problem called set disjointness (SD), a quintessential problem in communication complexity that requires a streaming algorithm (e.g., recurrent model) to decide whether inputted sets are disjoint. We empirically and theoretically show that the recurrent memory required to solve SD changes with set order, i.e., whether the smaller set appears first in-context. * Our analysis suggests, to mitigate the reliance on data order, we can put information in the right order in-context or process prompts non-causally. Towards that end, we propose: (1) JRT-Prompt, where context gets repeated multiple times in the prompt, effectively showing the model all data orders. This gives 11.0±1.3 points of improvement, averaged across 16 recurrent LMs and the 6 ICL tasks, with 11.9× higher throughput than FlashAttention-2 for generation prefill (length 32k, batch size 16, NVidia H100). We then propose (2) JRT-RNN, which uses non-causal prefix-linear-attention to process prompts and provides 99% of Transformer quality at 360M params., 30B tokens and 96% at 1.3B params., 50B tokens on average across the tasks, with 19.2× higher throughput for prefill than FA2.* Jamba: A 52B Hybrid Transformer-Mamba Language Model* We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. * Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. * This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU.* Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. * We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.* SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers* We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: * (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. * (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. * (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. * (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. * As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. * RWKV: Reinventing RNNs for the Transformer Era* Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. * We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.* Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. * We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.* LoLCATs: On Low-Rank Linearizing of Large Language Models* Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. * We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute. * We base these steps on two findings. * First, we can replace an LLM's softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss ("attention transfer").* Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA). * LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU. * Furthermore, LoLCATs does so with only 0.2% of past methods' model parameters and 0.4% of their training tokens. * Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work). * When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.Timestamps* [00:02:27] Intros* [00:03:16] Why Scale Context Lengths? or work on Efficient Models* [00:06:07] The Story of SSMs* [00:09:33] Idea 1: Approximation -> Principled Modeling* [00:12:14] Idea 3: Selection* [00:15:07] Just Read Twice* [00:16:51] Idea 4: Test Time Compute* [00:17:32] Idea 2: Hardware & Kernel Support* [00:19:49] RWKV vs SSMs* [00:24:24] RWKV Arch* [00:26:15] QWRKWv6 launch* [00:30:00] What's next* [00:33:21] Hot Takes - does anyone really need long context?Transcript[00:00:00] AI Charlie: We're back at Latent Space Live, our first mini conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co host. As a special treat this week, we're recapping the best of 2024 going domain by domain. We sent out a survey to the over 900 of you who told us what you wanted, and then invited the best speakers in the Latent Space Network to cover each field.[00:00:24] AI Charlie: 200 of you joined us in person throughout the day, with over 2200 watching live online. Thanks Our next keynote covers the State of Transformers alternative architectures, with a special joint presentation with Dan Fu of Together AI and Eugene Chia of Recursal AI and Featherless AI. We've featured both Together and Recursal on the pod before, with CEO Veepal Vedprakash introducing them.[00:00:49] AI Charlie: And CTO CE Zhang joining us to talk about how they are building together together as a quote unquote full stack AI startup from the lowest level kernel and systems [00:01:00] programming to the highest level mathematical abstractions driving new model architectures and inference algorithms with notable industry contributions from Red Pajama V2, Flash Attention 3, Mamba 2, Mixture of Agents.[00:01:15] AI Charlie: Based, Sequoia, Evo, Dragonfly, Danfoo's Thunder Kittens, and many more research projects this year. As for Recursal and Featherless, we were the first podcast to feature RWKV last year, and this year the team has shipped RWKV v5, codenamed Eagle, to 1. 5 billion Windows 10 and Windows 11 machines worldwide to support Microsoft's on device, end Energy Usage Sensitive Windows Copilot Use Cases and has launched the first updates on RWKV v6, codenamed Finch and Goldfinch.[00:01:53] AI Charlie: On the morning of Latent Space Live, they also announced QRdata UKv6, a QEN32B model [00:02:00] modified with RDWKV linear attention layers. Eugene has also written the most single most popular guest post on the Latent Space blog this year. Yes, we do take guest posts on what he has discovered about the H100 GPU inference NeoCloud market since the successful launch of Featherless AI this year.[00:02:20] AI Charlie: As always, don't forget to check the show notes for the YouTube link to their talk as well as their slides. Watch out and take care.[00:02:27] Intros[00:02:27] Dan Fu: Yeah, so thanks so much for having us. So this is going to be a little bit of a two part presentation. My name is Dan. I'm at Together AI, and I'll be joining UCSD as faculty in about a year. And Eugene, you want to introduce yourself?[00:02:46] Eugene Cheah: Eugene, I lead the art activity team, and I, I'm CEO of Featherless, and we both work on this new post transformer architecture space.[00:02:55] Dan Fu: Yeah, so yeah, so today we're really excited to talk to you a little bit [00:03:00] about that. So first I'm going to give a broad overview of kind of the last few years of progress in non post transformer architectures. And then afterwards Eugene will tell us a little bit about the latest and the greatest and the latest frontier models in this space.[00:03:16] Why Scale Context Lengths? or work on Efficient Models[00:03:16] Dan Fu: So, the story starts with Scaling. So this is probably a figure or something like this that you've seen very recently. Over the last five to six years, we've seen models really scale up in parameter size, and that's brought with it a bunch of new capabilities, like the ability to talk to you and tell you sometimes how to use your Colab screens.[00:03:35] Dan Fu: But another place where we've seen scaling especially recently is scaling in context length. So this can mean Having more text inputs for your models, but it can also mean things like taking a lot of visual token inputs image inputs to your models or generating lots of outputs. And one thing that's been really exciting over the last few months or so is that we're, we're seeing scaling, not only during training time, but also [00:04:00] during test time.[00:04:00] Dan Fu: So this is one of the, the, this is the iconic image from the OpenAI 01 release. Not only are we starting to scale train time compute, but we're also starting to scale test time compute. Now if you're familiar with our attention and our transformer architectures today, this graph on the right might look a little bit scary.[00:04:19] Dan Fu: And one of the reasons is that the implications are a little bit Interesting. So what does it mean if we want to continue having smarter and smarter models? Do we just need to start building bigger, bigger data centers, spending more flops? Is this this little Dolly 3, we need more flops, guys? Is this going to be the future of all of AI?[00:04:39] Dan Fu: Or is there a better way, another path forward? Maybe we can get the same capabilities that we've gotten used to, But for a lot less compute, a lot less flops. And one of the things that we're going to talk about today is specifically looking at that core attention operator in some of these models.[00:04:57] Dan Fu: And the reason is that so this is just some, some [00:05:00] basic you know, scaling curves, but attention has compute that scales quadratically in the context length. So that means that if you're doing something like test time compute and you want to spend a bunch of tokens thinking about what comes next, the longer that that goes the, the, the more tokens you spend on that, that compute grows quadratically in that.[00:05:19] Dan Fu: One of the questions that we're interested in is, can we take that basic sequence model, that basic sequence primitive at the bottom, and get it to scale better? Can we scale in, let's say, n to the 3 halves or n log n? So in, in the first part of the talk, so we just went over the introduction. What I'm gonna do over the next few slides is just talk about some of the key advances and ideas that have shown over the past few years since maybe early 2020 to, to now that shown promise that this might actually be possible.[00:05:48] Dan Fu: That you can actually get potentially the same quality that we want while scale, while scaling better. So to do that, we're and, and basically the, the story that we're gonna look is we're gonna start to see [00:06:00] how. So this is a basic graph of just the past couple years of progress of perplexity where that blue line, that dotted blue line, is attention.[00:06:07] The Story of SSMs[00:06:07] Dan Fu: It's your basic transformer, full dense attention. And then the dots coming down are some of the methods that you'll see in this presentation today. We're going to turn the clock back all the way to 2020. So this, this, this question of can we make attention subquadratic? Basically, as soon as we said attention is all you need, People started asking this question.[00:06:28] Dan Fu: So we have this quadratic attention operator. Can we do better? I'll briefly talk about why attention is quadratic. And the basic thing that happens, if you're not familiar, is that you have these inputs, these keys and queries. And what you do in this attention matrix, this S matrix over here, is that you're using, you're comparing every token in your input to every other token.[00:06:49] Dan Fu: So when I try to do something like upload a whole book to Gemini, what happens beyond the Maybe not Gemini, because we don't necessarily know what architecture is. But let's say we upload it to LLAMA, what happens beyond [00:07:00] the scenes, behind the scenes, is that it's going to take every single word in that book and compare it to every other word.[00:07:05] Dan Fu: And this has been a really, it's, it's led to some pretty impressive things. But it's kind of a brute forcing of the way that you would try to interpret a interpret something. And what attention does in particular is the, and then what attention, sorry, don't want to. Okay, no, no laser pointer. What, what attention does afterwards is that instead of always operating in this quadratic thing, it takes a row wise softmax over this matrix, and then multiplies it by this values matrix.[00:07:32] Dan Fu: So, one of the key points to notice is that the output size is always going to be the same as the inputs, at least in standard self attention. So one of the first things that folks tried to do around 2020 is this thing called linear attention, which is just, just noticing that if we take out this softmax from here, if we take out this non linearity in the middle of the attention operation, and then if you compute the keys and the values operation first, you actually never hit this quadratic bottleneck.[00:07:57] Dan Fu: So that, that's potentially a way [00:08:00] to get a lot more computationally efficient. And there are various ways to do this by basically using feature maps or try to approximate this overall attention computation. But some of this work sort of started to hit a wall in 2020. And the basic challenges were, were two.[00:08:16] Dan Fu: So one was quality. It was back then, it was kind of hard to, to get good quality with these linear attention operators. The other one was actually hardware efficiency. So these, this feature map that was just shown by a simplify simplify here. Actually ends up being quite computationally expensive if you just implement it naively.[00:08:34] Dan Fu: So you started having these operators that not only were you sure, you're not really sure if they have the same quality, but also they're actually just wall clock slower. So you kind of end up getting the worst of both worlds. So this was the the stage. So that kind of sets the stage for four years ago.[00:08:49] Dan Fu: Keep this in mind because linear attention is actually going to come back in a few years once we have a better understanding. But one of the works that started kicking off this, this [00:09:00] mini revolution in post transformer architectures was this idea called states based model. So here the seminal work is, is one about our work queue in 2022.[00:09:09] Dan Fu: And this, this piece of work really brought together a few ideas from, from some long running research research lines of work. The first one was, and this is really one of the keys to, to closing the gap in quality was just using things that, that if you talk to a, a, an electrical engineer off the street, they might know off, off the, like the back of their hand.[00:09:33] Idea 1: Approximation -> Principled Modeling[00:09:33] Dan Fu: But taking some of those properties with how we model dynamical systems in signal processing and then using those ideas to model the inputs, the, the text tokens in, for example a transformer like Next Token Prediction Architecture. So some of those early states-based model papers were looking at this relatively, relatively simple recurrent update model that comes from maybe chapter one of a signal processing class.[00:09:59] Dan Fu: But then using [00:10:00] some principle theory about how you should do that recurrent update in order to really get the most that you can out of your hidden state, out of your out of your sequence. So that, that was one key idea for quality and. When this was eventually realized, you started to see a bunch of benchmarks that were pretty sticky for a few years.[00:10:20] Dan Fu: Things like long range arena, some long sequence evaluation benchmarks, There was stuff in time series, time series analysis. They started to, you started to see the quality tick up in meaningful ways. But the other key thing that What's so influential about these states based models is that they also had a key idea about how you can compute these things efficiently.[00:10:45] Dan Fu: So if you go back to your machine learning 101 class where you learned about RNNs, one thing that you may have learned is that they don't paralyze as well as detention, because if you just run them naively, you have to do this kind of sequential update to process new tokens, [00:11:00] whereas in attention, you can process all the tokens in parallel at one time.[00:11:04] Dan Fu: One of the key insights behind the S4 paper was that these recurrent models, you could take them and you could also formulate them as a convolution. And in particular, with a convolution, you could, instead of using a PyTorch conv1d operation, you can compute that with the FFT. And that would give you n log n compute in the in the sequence length n with an operator that was relatively well optimized for modern hardware.[00:11:28] Dan Fu: So those are really, I'd say, the two key ideas in 2022 that started allowing these breakthroughs to happen in these non transformer architectures. So, these ideas about how to principally model sorry, how to model the recurrent updates of a mo of, of a sequence in a principled way, and also these key ideas in how you can compute it efficiently by turning it into a convolution and then scaling it up with the FFT.[00:11:53] Dan Fu: Along those same lines, so afterwards we started putting out some work on specialized kernels, so just [00:12:00] like we have flash attention for transformers, we also have works like flash fft conf, and if you look at these lines of work oftentimes when, whenever you see a new architecture, you see a new primitive one of the, one of the table stakes now is, do you have an efficient kernel so that you can actually get wall clock speed up?[00:12:14] Idea 3: Selection[00:12:14] Dan Fu: So by 2022, We are starting to have these models that had promising quality primitives, but and, and also promising wall clocks. So you could actually see regimes where they were better than transformers in meaningful ways. That being said, there were, there's still sometimes a quality gap, particularly for language modeling.[00:12:33] Dan Fu: And because languages, It's so core to what we do in sequence modeling these days the, the next, the next key idea that I'm going to talk about is this idea of selection mechanisms. And this is basically an idea of, so you have this recurrent state that you're keeping around that just summarizes everything that, that came before.[00:12:50] Dan Fu: And to get a good sequence model, one of the things that you really need to be able to do is have the model learn what's the best way to pick out pieces from that recurrent [00:13:00] state. So one of the, one of the major ideas here in a line of work called H3, Hungry Hungry Hippos, and also these hyena models were One way you can do this is by just adding some simple element wise gates.[00:13:13] Dan Fu: So versions of these ideas have been around for decades. If you squint at the LSTM paper you, you can probably find, find this gating mechanism. But turns out you can take those old ideas, add them into these new. state space models, and then you can see quality start to pick up. If you've heard of the Mamba model, this also takes the selection to the next level by actually making some changes in that fundamental recurrent state space.[00:13:40] Dan Fu: So, it's not only just this gating that happens around the SSM layer, but also you can actually make The ABCD matrices of your state space model, you can make them data dependent, which will allow you to even better select out different pieces from your hidden state depending on what you're seeing. I'll also point out if you look at the [00:14:00] bottom right of this figure, there's this little triangle with a GPU SRAM, GPU HBM, and this, this is just continuing that trend of when you have a new architecture you, you, you also release it with a kernel to, to, to show that it is hardware efficient, that it, that it can be hardware efficient on modern hardware.[00:14:17] Dan Fu: The, the, one of the next cool things that happened is once we had this understanding of these are the basic pieces, these are the basic principles behind some of the sequence models linear attention actually started to come back. So in earlier this year, there was a model called BASED the, from Simran Arora and, and some other folks, that combined a more principled version of linear attention that basically the, the, the, the two second summary is that it used a Taylor approximation of the softmax attention, combined that with a simple sliding window attention and was starting to able, starting to be able to expand the Pareto frontier of how much data can you recall from your sequence, versus how small is your recurrent state size.[00:14:58] Dan Fu: So those orange dots [00:15:00] are, at the top there, are just showing smaller sequences that can recall more memory.[00:15:07] Just Read Twice[00:15:07] Dan Fu: And the last major idea I think that has been influential in this line of work and is very relatively late breaking just a few months ago, is just the basic idea that when you have these models that are fundamentally more efficient in the sequence length, you maybe don't want to prompt them or use them in exactly the same way.[00:15:26] Dan Fu: So this was a really cool paper called Just Read Twice, also from Simran. That basically said, hey, all these efficient models can process tokens so much more efficiently than transformers that they can sometimes have unfair advantages compared to a simple transformer token. So, or sorry, a simple transformer model.[00:15:44] Dan Fu: So take, for example the standard, the standard use case of you have some long document, you're going to pass it in as input, and then you're going to ask some question about it. One problem you might imagine for a recurrent model where you have a fixed state size is, let's say that [00:16:00] you're. Article is very long, and you're trying to ask about some really niche thing.[00:16:04] Dan Fu: You can imagine it might be hard for the model to know ahead of time what information to put into the hidden state. But these, these, these models are so much more efficient that you can do something really stupid, like, you can just put the document write down the document, write down the question, write down the document again, and then write down the question again, and then this time, the second time that you go over that document, you know exactly what to look for.[00:16:25] Dan Fu: And the cool thing about this is, so this is, And this this results in better quality, especially on these recall intensive tasks. But the other interesting thing is it really takes advantage of the more efficient architectures that, that we're having here. So one of the other, I think, influential ideas in this line of work is if you change the fundamental compute capabilities of your model and the way that it scales, you can actually start to query it at test time differently.[00:16:51] Idea 4: Test Time Compute[00:16:51] Dan Fu: And this actually, of course, goes back to those slides on test time compute. So while everybody's looking at, say, test time compute for big transformer models, [00:17:00] I think potentially a really interesting research question is, how can you take those and how does it change with this new next generation of models?[00:17:09] Dan Fu: So the, I'll just briefly summarize what some of those key ideas were and then talk and then show you briefly kind of what the state of the art is today. So, so the four key ideas are instead of just doing a simple linear attention approximation, instead take ideas that we know from other fields like signal processing, do a more principled approach to your modeling of the sequence.[00:17:32] Idea 2: Hardware & Kernel Support[00:17:32] Dan Fu: Another key idea throughout all these lines of work is you really want. Hardware and kernel support from day one. So, so even if your model is theoretically more efficient if somebody goes and runs it and it's two times slower one of the things that, that we've learned is that if, if you're in that situation, it's, it's just gonna be dead on arrival.[00:17:49] Dan Fu: So you want to be designing your architectures one of the key, key machine learning ideas that has been important for the quality is just making sure that you encode different ways that you can [00:18:00] select from your hidden state and, and really focus on that as a key decider of quality. And finally, I think one of the, the, the emerging new, new things for, for this line of work and something that's quite interesting is, What are the right test time paradigms for these models?[00:18:15] Dan Fu: How do they change relative to relative to what you might do for a standard transformer? I'll briefly end this section. So I've labeled this slide where we are yesterday because Eugene is going to talk about some new models that he released literally this morning. But as of yesterday, some of the really cool results out of the, these efficient alternative models were so AI2 trained this hybrid MOE called Jamba.[00:18:40] Dan Fu: That, that, that seems, that is currently the state of the art for these non transformer architectures. There's this NVIDIA and MIT put out this new diffusion model called SANA recently that one of their key key observations is that you can take a standard diffusion transformer diffusion model, replace the layers with linear [00:19:00] attention, and then that lets you scale to much larger much larger images, much, much Much larger sequences more efficiently.[00:19:07] Dan Fu: And and one thing that I don't think anybody would have called when a few years ago is that one of those gated SSM, gated states based models ended up on the cover of Science because a great group of folks went and trained some DNA models. So that's Michael Polley, Eric Yuen from from Stanford and the Arc Institute.[00:19:26] Dan Fu: So it's, we're really at an exciting time in 2024 where these non transformer, post transformer architectures are showing promise across a wide range. Across a wide range of, of modalities, of applications, and, and of tasks. And with that, I'll pass it on to Eugene, who can tell you a little bit about the latest and greatest with RWKV.[00:19:49] RWKV vs SSMs[00:19:49] Eugene Cheah: So, that's useful? Yeah. You're talking to here. Oh, I'm talking to here. Okay. So, yeah, two streams. Yeah. So, I think one common questions that we tend to get asked, right, is what's the difference between [00:20:00] RWKV and state space? So I think one of the key things to really understand, right the difference between the two groups, right, is that we are actually more like an open source, random internet meets academia kind of situation.[00:20:11] Eugene Cheah: Like, most of us never wrote any paper, but we, we basically look at RNNs and linear intention when intention is all you need came out, and then we decided to like, hey there is a quadratic scaling problem. Why don't we try fixing that instead? So, so, so we end up developing our own branch, but we end up sharing ideas back and forth.[00:20:30] Eugene Cheah: So, and, and we do all this actively in Discord, GitHub, etc. This was so bad for a few years, right, that basically, the average group's H index was so close to zero, right, Illuter. ai actually came in and helped us write our first paper. Great, now our H index is now three, apparently. So, so, so, but, but the thing is, like, a lot of these experiments led to results, and, and, essentially, essentially, we we took the same ideas from linear attention, [00:21:00] and we built on it.[00:21:01] Eugene Cheah: So, to take a step back into, like, how does RWKB handle its own attention mechanic and achieve the same goals of, like, O and compute, respectively, and in focus of our overall goal to make AI accessible to everyone, regardless of language, nation, or compute, that's our goal. We actually train our models primarily on over a hundred languages, which is another topic altogether.[00:21:23] Eugene Cheah: And our goal is to train to even 200 languages to cover all languages in the world. But at the same time, we work on this architecture, To lower the compute cost so that people can run it on Raspberry Pis and on anything. So, how did RWKB break the dependency of LSTM token flow? Because I think to understand architecture, right, it's probably easier to understand it from the RNN lens.[00:21:46] Eugene Cheah: Because that's where we built on. We all, we all state space kind of like try to, try to start anew and took lessons from that and say, So there's a little bit of divergence there. And AKA, this our version of linear attention. So to take step back [00:22:00] all foundation models, be it transformers or non transformers at a very high level, right?[00:22:05] Eugene Cheah: Pumps in the token. I mean, text that things into embeddings and go through a lot of layers. Generate a lot of states where the QKV cache or be iron in states or RW KB states. And outputs and embedding, they are not the same thing. And we just take more layers and more embeddings. And somehow that magically works.[00:22:23] Eugene Cheah: So, if you, if you remember your ancient RNN lessons which we, which we, which we we call best learning these days the general idea is that you have the embedding information flowing all the way up, and when, and you take that information and you flow it back down, and then you process it as part of your LSTM layers.[00:22:41] Eugene Cheah: So, this is how it generally works. Kapati is quoted saying that RNNs are actually unreasonably effective. The problem is this is not scalable. To start doing work on the second token, you need to wait for the first token. And then you need to, and likewise for the third token and fourth token, yada yada.[00:22:55] Eugene Cheah: That is CPU land, not GPU land. So, so, so, you [00:23:00] can have a H100 and you can't even use 1 percent of it. So, so that's kind of why RNNs didn't really take off in the direction that we wanted, like, billions of parameters when it comes to training. So, what did RDAP KV version 0 do? Boom. We just did the dumbest, lamest thing.[00:23:13] Eugene Cheah: Sorry, this is the bottleneck for RNN. We did the dumb thing of removing that line. And it kind of worked. It trained. It sucked, but it kind of worked. Then we were like, hey, then no one cared because the loss was crap, but how do we improve that? And that's essentially where we move forward, because if you see this kind of flow, right, you can actually get your GPU saturated quickly, where it essentially cascades respectively.[00:23:41] Eugene Cheah: So I'm just waiting for this to loop again. So it's like, once you get your first layer, your token to be computed finish. You start to cascade your compute all the way until you are, Hey, I'm using 100 percent of the GPU. So we, we worked on it, and we started going along the principle of that as long as we keep this general architecture [00:24:00] where, where we can cascade and, and be highly efficient with our architecture, nothing is sacred in our architecture.[00:24:06] Eugene Cheah: And we have done some crazy ideas. In fact, you ask us, if you ask me to explain some things in the paper, right, officially in the paper, I'll say we had this idea and we wrote it this way. The reality is someone came with a code, we tested it, it worked, and then we rationalized later. So, so the general[00:24:24] RWKV Arch[00:24:24] Eugene Cheah: The idea behind rwkbr is that we generally have two major blocks that we do.[00:24:30] Eugene Cheah: We call time mix and channel mix. And time mix generally handles handles long term memory states, where essentially, where essentially where we apply the matrix multiplication and Cilu activation functions into processing an input embedding and an output embedding. I'm oversimplifying it because this, This calculation changed every version and we have, like, version 7 right now.[00:24:50] Eugene Cheah: ChannelMix is similar to Base in the sense that it does shorter term attention, where it just looks at the sister token, or the token before it, because [00:25:00] there's a shift in the token shift matrix. I don't really want to go too much into the papers itself, because, like, we do have three papers on this.[00:25:09] Eugene Cheah: Basically, RWKB, RNN for the transformer, ERA, Ego and Pinch, RWKB, Matrix Value State. This is the updated version 5, version 6. And Goldfinch is our, is, is, is, is our hybrid model respectively. We are writing the paper already for V seven and which is, which is for R wk V seven. Called, named Goose, or architectures are named by Bird.[00:25:30] Eugene Cheah: And, I'm going to cover as well, qrwkb, and mama100k, and rwkb, and Where did that lead to? Great! Because we are all GPU poor and to be clear, like, most of this research is done, like, only on a handful H100s, which I had one Google researcher told me that was, like, his experiment budget for a single researcher.[00:25:48] Eugene Cheah: So, our entire organization has less compute than a single researcher in Google. So We, we, one of the things that we explored into was to how do we convert transformer models instead? Because [00:26:00] someone already paid that billion dollars, a million dollars onto training, so why don't we take advantage of those weights?[00:26:05] Eugene Cheah: And, and to, I believe, together AI worked on the lockets for, for the Lambda side of things, and, and we took some ideas from there as well, and we essentially did that for RWKB.[00:26:15] QWRKWv6 launch[00:26:15] Eugene Cheah: And that led to, Q RWKB6, which we just dropped today, a 32 bit instruct preview model, where we took the Quen 32 bit instruct model, freeze the feedforward layer, remove the QKB attention layer, and replace it with RWKB linear layers.[00:26:32] Eugene Cheah: So to be clear, this means we do not have the rwkv channel mix layer, we only have the time mix layer. But but once we do that, we train the rwkv layer. Important is that the feedforward layer needs to be frozen, so the new attention can be learned. And then we unfreeze the feedforward layer, and train all the layers together with a custom learning rate schedule, so that they can learn how to work together.[00:26:54] Eugene Cheah: The end result, surprisingly, And, to be honest, to the frustration of the R. W. [00:27:00] KV MOE team, which ended up releasing the model on the same day, was that, with just a few hours of training on two nodes, we managed to get it to be on par, kind of, with the original QUAN32B model. So, in fact, when the first run, right, that completely confused us, it was like, and I was telling Daniel Goldstein, Smirky, who kind of leads most of our research coordination, When you pitched me this idea, you told me at best you'll get the same level of performance.[00:27:26] Eugene Cheah: You didn't tell me the challenge and score and Winograd score will shoot up. I don't know what's happening there. But it did. MMLU score dropping, that was expected. Because if you think about it, when we were training all the layers, right, we were essentially Like, Frankenstein this thing, and we did brain damage to the feedforward network layer 2 with the new RWKB layers.[00:27:47] Eugene Cheah: But, 76%, hey, somehow it's retained, and we can probably further train this. We didn't even spend more than 3 days training this, so there's a lot more that can be done, hence the preview. This brings up [00:28:00] a big question, because We are already now in the process of converting to 7TB. We are now, this is actually extremely compute efficient to test our attention mechanic.[00:28:10] Eugene Cheah: It's like, it becomes a shortcut. We can, we are already planning to do our version 7 and our hybrid architecture for it. Because we don't need to train from scratch. And we get a really good model out of it. And the other thing that is uncomfortable to say is that because we are doing right now on the 70b is that if this scales correctly to 128k context length, I'm not even talking about a million 128, majority of enterprise workload today is just on 70b at under 32k context length.[00:28:41] Eugene Cheah: That means if this works and the benchmark matches it, It means we can replace the vast majority of current AI workload, unless you want super long context. And then sorry, can someone give us more GPUs? Because we do need the VRAM for super long context, sadly. So yeah, that's what we are working on, and essentially, [00:29:00] we are excited about this to just push it further.[00:29:02] Eugene Cheah: And this conversion process, to be clear, I don't think it's going to be exclusive to RWKB. It probably will work for Mamba as well, I don't see why not. And we will probably see more ideas, or more experiments, or more hybrids, or Yeah, like, one of the weirdest things that I wanted to say outright, and I confirmed this with the Black Mamba team and the Jamba team, which because we did the GoFinch hybrid model, is that none of us understand why a hard hybrid with a state based model to be R.[00:29:28] Eugene Cheah: QA state space and transformer performs better when, than the baseline of both. It's like, it's like when you train one, you expect, and then you replace, you expect the same results. That's our pitch. That's our claim. But somehow when we jam both together, it outperforms both. And that's like one area of emulation that, like, we only have four experiments, plus four teams, that a lot more needs to be done.[00:29:51] Eugene Cheah: But, but these are things that excite me, essentially, because that is what it's potentially we can move ahead for. Which brings us to what comes next.[00:30:00] What's next[00:30:00] [00:30:00][00:30:00] Dan Fu: So, this part is kind of just some, where we'll talk a little bit about stuff that, that we're excited about. Maybe have some wild speculation on, on what, what's, what's coming next.[00:30:12] Dan Fu: And, of course this is also the part that will be more open to questions. So, a couple things that, that I'm excited about is continued hardware model co design for, for these models. So one of the things that we've put out recently is this library called ThunderKittens. It's a CUDA library.[00:30:29] Dan Fu: And one of the things that, that we found frustrating is every time that we built one of these new architectures, and I'm sure you had the exact same experience, we'd have to go and spend two months in CUDA land, like writing these, these new efficient things. And. If we decided to change one thing in PyTorch, like one line of PyTorch code is like a week of CUDA code at least.[00:30:47] Dan Fu: So one of our goals with, with a library like Thunderkitten, so we, we just broke down what are the key principles, what are the key hardware things what are the key, Compute pieces that you get from the hardware. So for example on [00:31:00] H100 everything is really revolves around a warp group matrix multiply operation.[00:31:06] Dan Fu: So you really want your operation to be able to split into relatively small matrix, matrix multiply operations. So like multiplying two 64 by 64 matrices, for example. And so if you know that ahead of time when you're designing your model, that probably gives you you know, some information about how you set the state sizes, how you set the update, how you set the update function.[00:31:27] Dan Fu: So with Thunderkittens we basically built a whole library just around this basic idea that all your basic compute primitives should not be a float, but it should be a matrix, and everything should just be matrix compute. And we've been using that to, to try to both re implement some existing architectures, and also start to design code.[00:31:44] Dan Fu: Some new ones that are really designed with this core with a tensor core primitive in mind. Another thing that that we're, that at least I'm excited about is we, over the last four or five years, we've really been looking at language models as the next thing. But if you've been paying [00:32:00] attention to Twitter there's been a bunch of new next generation models that are coming out.[00:32:04] Dan Fu: So there, there are. So, video generation models that can run real time, that are supported by your mouse and your keyboard, that I'm told if you play with them that, you know, that they only have a few seconds of memory. Can we take that model, can we give it a very long context length so that you could actually maybe generate an entire game state at a time?[00:32:25] Dan Fu: What does that look like for the model? You're certainly not going to do a giant quadratic attention computation to try to run that. Maybe, maybe use some of these new models, or some of these new video generation models that came out. So Sora came out I don't know, two days ago now. But with super long queue times and super long generation times.[00:32:43] Dan Fu: So that's probably a quadratic attention operation at the, at the bottom of it. What if we could remove that and get the same quality, but a lot faster generation time? Or some of the demos that we saw from Paige earlier today. You know, if I have a super long conversation with my [00:33:00] Gemini bot, what if I wanted to remember everything that it's seen in the last week?[00:33:06] Dan Fu: I mean, maybe you don't for personal reasons, but what if I did, you know? What does that mean for the architecture? And I think, you know, that's certainly something I'm pretty excited about. I'm sure you're excited about it too. So, I think we were supposed to have some hot takes, but I honestly don't remember what our hot takes were.[00:33:21] Hot Takes - does anyone really need long context?[00:33:21] Eugene Cheah: Yeah, including the next slide. Hot takes, yes, these are our[00:33:25] Dan Fu: hot takes.[00:33:25] Eugene Cheah: I think the big one on Twitter that we saw, that we shared, was the question is like, is RAG relevant? In the case of, like, the future of, like, state based models?[00:33:38] Dan Fu: Let's see, I haven't played too much with RAG. But when I have. I'll say I found it was a little bit challenging to do research on it because we had this experience over and over again, where you could have any, an embedding model of any quality, so you could have a really, really bad embedding model, or you could have a really, really [00:34:00] good one, By any measure of good.[00:34:03] Dan Fu: And for the final RAG application, it kind of didn't matter. That's what I'll say about RAG while I'm being recorded. I know it doesn't actually answer the question, but[00:34:13] Eugene Cheah: Yeah, so I think a lot of folks are like, extremely excited of the idea of RWKB or State Space potentially having infinite context.[00:34:21] Eugene Cheah: But I think the reality is that when we say infinite context, we just mean a different kind of infinite context, or you, or as it's previously covered, you need to test the model differently. So, think of it more along the lines of the human. Like, I don't remember what I ate for breakfast yesterday.[00:34:37] Eugene Cheah: Yeah, that's the statement that I'll say. And And we humans are not quadratic transformers. If we did, if let's say we increased our brain size for every second we live, we would have exploded by the time we are 5 years old or something like that. And, and I think, I think basically fundamentally for us, right, be it whether we, regardless of whether RWKB, statespace, XLSTM, [00:35:00] etc, our general idea is that instead of that expanding state, that increase in computational cost, what if we have a fixed state size?[00:35:08] Eugene Cheah: And Information theory detects that that fixed state size will have a limit. Just how big of a limit is a question, like, we, like, RWKB is running at 40 megabytes for, for its state. Its future version might run into 400 megabytes. That is like millions of tokens in, if you're talking about mathematically, the maximum possibility.[00:35:29] Eugene Cheah: It's just that I guess we were all more inefficient about it, so maybe we hit 100, 000. And that's kind of like the work we are doing, trying to like push it and maximize it. And that's where the models will start differing, because it will choose to forget things, it will choose to remember things. And that's why I think that there might be some element of right, but it may not be the same right.[00:35:49] Eugene Cheah: It may be the model learn things, and it's like, hmm, I can't remember that, that article. Let me do a database search, to search. Just like us humans, when we can't remember the article in the company. We do a search on Notion. [00:36:00][00:36:00] Dan Fu: I think something that would be really interesting is if you could have facts that are, so right now, the one intuition about language models is that all those parameters are around just to store random facts about the world.[00:36:14] Dan Fu: And this intuition comes from the observation that if you take a really small language model, it can do things like talk to you, or kind of has like the The style of conversation, it can learn that, but where it will usually fall over compared to a much larger one is it'll just be a lot less factual about things that it knows or that it can do.[00:36:32] Dan Fu: But that points to all those weights that we're spending, all that SGD that we're spending to train these models are just being used to store facts. And we have things like databases that are pretty good at storing facts. So I think one thing that would be really interesting is if we could actually have some sort of outside data store that a language model can can look at that that maybe is you know, has has some sort of gradient descent in it, but but would be quite interesting.[00:36:58] Dan Fu: And then maybe you could edit it, delete [00:37:00] facts, you know, change who's president so that it doesn't, it doesn't get lost.[00:37:04] Vibhu: Can we open up Q& A and hot takes for the audience? I have a hot take Q& A. Do these scale? When, when 405B state space model, RAG exists, no one does long context, who's throwing in 2 million token questions, hot takes?[00:37:24] Dan Fu: The, the who's throwing in 2 million token question, I think, is, is a really good question. So I actually, I was going to offer that as a hot take. I mean, my hot take was going to be that long context doesn't matter. I know I just gave a whole talk about it, but you know, what, what's the point of doing research if you can't, you know, play both sides.[00:37:40] Dan Fu: But I think one of the, so I think for both of us, the reason that we first got into this was just from the first principled questions of there's this quadratic thing. Clearly intelligence doesn't need to be quadratic. What is going on? Can we understand it better? You know, since then it's kind of turned into a race, which has [00:38:00] been exciting to watch, like, how much context you can take in.[00:38:03] Dan Fu: But I think it's right. Nobody is actually putting in a two million context prompt into these models. And, and, you know, if they are, maybe we can go, go You know, design a better model to do that particular thing. Yeah, what do you think about that? So you've also been working on this. Do you think long context matters?[00:38:19] Eugene Cheah: So I'm going to burn a bit. How many of you remember the news of Google Gemini supporting 3 million contacts, right? Raise your hand.[00:38:28] Vibhu: Yeah, 2 million.[00:38:29] Eugene Cheah: Oh, it's 2 million.[00:38:31] Eugene Cheah: Yeah, how many of you actually tried that? See?[00:38:34] Vibhu: I use it a lot. You? You work for MindsTV. I use it a lot.[00:38:41] Eugene Cheah: So, for some people that has used, and I think, I think that's the, that's might be, like, this is where my opinion starts to differ, because I think the big labs may have a bigger role in this, because Like, even for RWKB, even when we train non contacts, the reason why I say VRAM is a problem is that because when we did the, we need to backprop [00:39:00] against the states, we actually need to maintain the state in between the tokens by the token length.[00:39:05] Eugene Cheah: So that means we need to actually roll out the whole 1 million contacts if we are actually training 1 million. Which is the same for transformers, actually, but it just means we don't magically reuse the VRAM consumption in the training time space. So that is one of the VRAM bottlenecks, and I'm neither OpenAI nor Google, so donate GPUs if you have too much of them.[00:39:27] Eugene Cheah: But then, putting it back to another paradigm, right, is that I think O1 style reasoning might be actually pushing that direction downwards. In my opinion, this is my partial hot take is that if, let's say you have a super big model, And let's say you have a 70B model that may take double the tokens, but gets the same result.[00:39:51] Eugene Cheah: Strictly speaking, a 70B, and this is even for transformer or non transformer, right? We we'll take less less resources than that 400 B [00:40:00] model, even if it did double the amount thinking. And if that's the case, and we are still all trying to figure this out, maybe the direction for us is really getting the sub 200 B to be as fast as efficient as possible.[00:40:11] Eugene Cheah: We a very efficient architecture that some folks happen to be working on to, to just reason it out over larger and larger context thing.[00:40:20] Question: Yeah. One thing I'm super interested in is. Models that can watch forever? Obviously you cannot train something on infinite context length. How are y'all thinking about that, where you run on a much longer context length than is possible to train on?[00:40:38] Dan Fu: Yeah, it's a, it's a great question. So I think when I think you guys probably had tweets along these lines, too. When we first started doing these things, because these are all recurrent models in theory you could just run it forever. You could just run it forever. And at the very least it won't, it won't like error out on your crash.[00:40:57] Dan Fu: There's another question of whether it can actually [00:41:00] use what it's seen in that infinite context. And I think there, so one place where probably the research and architectures ran faster Then another research is actually the benchmarks for long context. So you turn it on forever. You want to do everything or watch everything.[00:41:16] Dan Fu: What is it that you actually wanted to do? Can we actually build some benchmarks for that? Then measure what's happening. And then ask the question, can the models do it? Is there something else that they need? Yeah, I think that if I were to turn back the clock to 2022, that's probably one of the things I would have done differently, which would have been actually get some long context benchmarks out at the same time as we started pushing context length on all these models.[00:41:41] Eugene Cheah: I will also say the use case. So like, I think we both agree that there's no Infinite memory and the model needs to be able to learn and decide. I think what we have observed for, I think this also fits the state space model, is that one of the key advantages of this alternate attention mechanic that is not based on token position is that the model don't suddenly become crazy when you go past the [00:42:00] 8k training context tank, or a million context tank.[00:42:03] Eugene Cheah: It's actually still stable. It's still able to run, it's still able to rationalize. It just starts forgetting things. But some of these things are still there in latent memory. Some of these things are still somewhat there. That's the whole point of why reading twice works. Things like that. And one of the biggest pushes in this direction is that I think both Statespace and RWKB have Separate papers by other researchers where they use this architecture for time series data.[00:42:26] Eugene Cheah: Weather modeling. So, you are not asking what was the weather five days ago. You're asking what's the weather tomorrow based on the infinite length that we, as long as this Earth and the computer will keep running. So, so, and they found that it is like, better than existing, like, transformer or existing architecture in modeling this weather data.[00:42:47] Eugene Cheah: Control for the param size and stuff. I'm quite sure there are people with larger models. So, so there are things that, that in this case, right, there is future applications if your question is just what's next and not what's 10 years ago.[00:42:59] Dan Fu: Thanks so [00:43:00] much for having us. Get full access to Latent Space at www.latent.space/subscribe
Our 192nd episode with a summary and discussion of last week's* big AI news! *and sometimes last last week's Note: this one was recorded on 12/04 , so the news is a bit outdated... Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai Read out our text newsletter and comment on the podcast at https://lastweekin.ai/. Sponsors: The Generator - An interdisciplinary AI lab empowering innovators from all fields to bring visionary ideas to life by harnessing the capabilities of artificial intelligence. The AI safety book “Uncontrollable" which is not a doomer book, but instead lays out the reasonable case for AI safety and what we can do about it. Max TEGMARK said that “Uncontrollable” is a captivating, balanced, and remarkably up-to-date book on the most important issue of our time" - find it on Amazon today! In this episode: OpenAI launches a $200 ChatGPT Pro subscription with advanced capabilities, while Amazon unveils cost-effective Nova multimodal models at the re:Invent conference. Meta releases LLAMA 3.3 70B model, showing significant gains through post-training techniques, and Alibaba introduces QWQ, a reasoning model rivaling OpenAI's O1. Amazon collaborates with Anthropic on a massive AI supercomputer project, and Black Forest Labs eyes a $200 million funding round for growth in AI tools. New research from DeepMind's Genie 2 generates interactive 3D worlds from text and images, progressing AI's understanding of world models and interactive environments. If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form. Timestamps + Links: (00:00:00) Intro / Banter (00:02:34) Sponsor Break Tools & Apps (00:04:19) OpenAI confirms new $200 monthly subscription, which includes its o1 reasoning model (00:10:40) Amazon announces Nova, a new family of multimodal AI models (00:17:13) ElevenLabs launches GenFM to turn user content into AI-powered podcasts (00:20:21) Google's new generative AI video model is now available Applications & Business (00:23:56) Elon Musk files for injunction to halt OpenAI's transition to a for-profit (00:29:40) Amazon Is Building a Mega AI Supercomputer With Anthropic (00:34:15) It Sounds an Awful Lot Like OpenAI Is Adding Ads to ChatGPT (00:38:23) A16z in Talks to Lead $200 Million Round in Black Forest Labs, Startup Behind AI Images on Grok (00:41:10) Bezos Backs AI Chipmaker Vying With Nvidia at $2.6 Billion Value Projects & Open Source (00:45:25) Meta unveils a new, more efficient Llama model (00:50:00) Alibaba releases an ‘open' challenger to OpenAI's o1 reasoning model (00:55:21) DeMo: Decoupled Momentum Optimization (00:57:01) PRIME Intellect Releases INTELLECT-1 (Instruct + Base): The First 10B Parameter Language Model Collaboratively Trained Across the Globe (01:03:03) Tencent Launches HunyuanVideo, an Open-Source AI Video Model Research & Advancements (01:09:23) DeepMind's Genie 2 can generate interactive worlds that look like video games (01:16:43) Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (01:20:40) Densing Law of LLMs (01:25:59) Monet: Mixture of Monosemantic Experts for Transformers Policy & Safety (01:30:56) Commerce Strengthens Export Controls to Restrict China's Capability to Produce Advanced Semiconductors for Military Applications (01:37:33) China retaliates against latest US chip restrictions (01:40:52) OpenAI Is Working With Anduril to Supply the US Military With AI (01:43:24) On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback (01:47:52) AI Safety Researcher Quits OpenAI, Saying Its Trajectory Alarms Her (01:51:52) Meta Claims AI Content Was Less than 1% of Election Misinformation (01:55:05) Outro
Is o1 Pro worth the cost? In Episode 33 of Mixture of Experts, host Tim Hwang is joined by Marina Danilevsky, Kate Soule and Vyoma Gajjar. First, the experts debrief the 12 Days of OpenAI. Next, we review some of the top papers in NeurIPS, how are the experts keeping up with all these research papers? Then, we are back with another benchmark, can ARC Prize make AGI more tractable? Finally, Meta announced the launch of Llama 3.3 70B with the promise of 405B performance, can we have our cake and eat it too? Find out more on today's Mixture of Experts!The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
20-year-old Harshandeep Singh was pushed down a flight of stairs, shot in the back, and killed just three days into a new job as a security guard. The suspect charged by Edmonton Police is a dangerous, repeat offender with a history of violent crimes. Amidst public outrage, people are questioning the role Mr. Singh's employer, the Government of Alberta, and Ottawa played in this tragedy. 4:10 | Mike Byrne, founder of Scope Safety & Security, exposes the shortcomings in training and accountability for security guards in Canada. 23:30 | Rich LaForge, Chair at ASIS Chapter 156, says under-trained, inexperienced security guards are being sent into dangerous situations across the country. He tells us who's responsible, and what needs to change. 39:00 | Ryan shares his thoughts on judicial reform in Canada, and shares comments from the Real Talk Live Chat powered by Park Power. TELL US WHAT YOU THINK: talk@ryanjespersen.com 44:30 | We ask Grande Prairie Mayor Jackie Clayton about that $70B data centre Kevin O'Leary's promising to build a half hour outside her city. Mayor Clayton claps back at critics saying it'd be tough to attract an international workforce to GP. DETAILS ON WONDER VALLEY: https://rtrj.info/121124Max 1:19:30 | Order Real Talk merch by December 16, and we'll have it at your door by Christmas! SHOP NOW: https://www.ryanjespersen.com/merch FOLLOW US ON TIKTOK, X, & INSTAGRAM: @realtalkrj JOIN US ON FACEBOOK & LINKEDIN: @ryanjespersen REAL TALK MERCH: https://ryanjespersen.com/merch RECEIVE EXCLUSIVE PERKS - BECOME A REAL TALK PATRON: patreon.com/ryanjespersen THANK YOU FOR SUPPORTING OUR SPONSORS! https://ryanjespersen.com/sponsors The views and opinions expressed in this show are those of the host and guests and do not necessarily reflect the position of Relay Communications Group Inc. or any affiliates.
Kevin O'Leary - the TV star otherwise known as "Mr. Wonderful" - says he's bringing a $70 billion data centre to northern Alberta. Premier Danielle Smith is over the moon, celebrating what would be the largest real estate investment in the province's history. But critics say there's no chance the Wonder Valley project will ever materialize. 1:55 | Max Fawcett, lead columnist at Canada's National Observer, tells us why he's not sold on Kevin O'Leary's $70B plan. READ MAX'S COLUMN: https://www.nationalobserver.com/2024/12/10/opinion/kevin-oleary-playing-alberta-70-billion-fool 33:45 | Max gives us his take Luigi Mangione (allegedly) murdering UnitedHealth CEO Brian Thompson, and what it says about the healthcare conversation in the United States. 51:09 | Jasper's chilliest season often delivers the warmest memories! We highlight holiday favourites waiting for you and your loved ones in Jasper National Park in #MyJasper Memories presented by Tourism Jasper. CHECK OUT THE JASPER PONY EXPRESS: https://www.jasper.travel/ponyexpress/ 54:00 | How serious is President-Elect Trump about Canada as the "51st state"? 1:08:45 | We ask Max what he makes of Prime Minister Justin Trudeau's GST holiday, and the ensuing division between the PMO and Minister Chrystia Freeland. 1:14:30 | Does the fall of Bashar al-Assad mean peace for Syria? Experts say not necessarily. Max tells us what he's keeping an eye on. TELL US WHAT YOU THINK: talk@ryanjespersen.com FOLLOW US ON TIKTOK, TWITTER, & INSTAGRAM: @realtalkrj JOIN US ON FACEBOOK & LINKEDIN: @ryanjespersen REAL TALK MERCH: https://ryanjespersen.com/merch RECEIVE EXCLUSIVE PERKS - BECOME A REAL TALK PATRON: patreon.com/ryanjespersen THANK YOU FOR SUPPORTING OUR SPONSORS! https://ryanjespersen.com/sponsors The views and opinions expressed in this show are those of the host and guests and do not necessarily reflect the position of Relay Communications Group Inc. or any affiliates.
Gen Z and Millennials are addicted to the unwind. More travelers are saying that they enjoy a chill day on vacation with no plans- aka more sleep. So how did this Sleep Tourism trend become a $70B industry and how can you cash in? We got ideas. Plus: Macy's pares down its shoe offerings and Tesla is back to $1T. Join our hosts Jon Weigell and Cyan Zhong as they take you through our most interesting stories of the day. Follow us on social media: TikTok: https://www.tiktok.com/@thehustle.co Instagram: https://www.instagram.com/thehustledaily/ Thank You For Listening to The Hustle Daily Show. Don't forget to hit Subscribe or Follow us on Apple Podcasts so you never miss an episode! If you want this news delivered to your inbox, join millions of others and sign up for The Hustle Daily newsletter, here: https://thehustle.co/email/ Plus! Your engagement matters to us. If you are a fan of the show, be sure to leave us a 5-Star Review on Apple Podcasts https://podcasts.apple.com/us/podcast/the-hustle-daily-show/id1606449047 (and share your favorite episodes with your friends, clients, and colleagues).
Frank Slootman turns the 'founder mode vs. manager mode' debate on its head. Frank's track record in B2B land is iconic: He took Data Domain from pre-revenues to a $2.5B acquisition by EMC. He led the IPO at ServiceNow, and when he left the company, it was worth $34B. Frank then took Snowflake public, and the company was worth over $70B when he retired earlier this year. After three successful CEO stints, Frank isn't buying Silicon Valley's fairytales about founders. His leadership style combines a manager's prowess with a founder's passion. Frank epitomizes what some might call “owner mode!” (00:07) Frank's thoughts on 'founder mode' vs. 'manager mode' (00:47) The role of non-founder managers and CEOs (09:59) How to manage effectively without micro-managing (17:11) The importance of intellectual honesty (18:32) Frank's thoughts on being 'in the arena' (21:04) What it really takes to build a viable business (28:34) Contrasting ServiceNow and Snowflake (33:40) The impact of AI on business (39:01) The future of app ecosystems (44:50) Becoming a student of leadership (46:31) Managing investor relationships (48:04) Why Frank doesn't think about his legacy (50:17) Closing Thoughts
Chcesz uruchomić modele LLM (np. Llama, Mistral czy Bielika) na własnych warunkach? W tym odcinku dowiesz się o sprzęcie, oprogramowaniu i trikach, które to ułatwią. Konkretna i praktyczna wiedza, która Ci się przyda. Oglądaj na YouTube: https://youtu.be/_OKLzmaSmg0
Can Apple Intelligence compete with the AI market offerings? In Episode 20 of Mixture of Experts, host Tim Hwang is joined by Marina Danilevsky, Kate Soule and Maya Murad. Today, the experts chat Apple Intelligence, the performance of Reflection's 70B, and a new paper released on LLMs generating novel research ideas. Additionally, IBM soft launched the Bee Agent Framework to help build agentic workflows with leading open-source and proprietary models. Tune-in to hear our expert panel break down this week's AI news.The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
Series 9, Episode 9 - Homework and Sex News How do people actually use ChatGPT? How do people use ChatGPT? We analyzed real AI chatbot conversations - The Washington Post Make AI tools to reduce teacher workloads, tech companies urged https://www.theguardian.com/education/article/2024/aug/28/make-ai-tools-to-reduce-teacher-workloads-tech-companies-urged New AI in Teams - and it's all free https://www.linkedin.com/posts/miketholfsen_ai-edtech-microsoftteams-activity-7236817645625335809-08OC https://aka.ms/TeamsEDUAIQuickGuide Research Papers Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course https://arxiv.org/abs/2407.05216 Supporting Self-Reflection at Scale with Large Language Models: Insights from Randomized Field Experiments in Classrooms https://arxiv.org/abs/2406.07571 Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam https://arxiv.org/abs/2406.09671 Generative AI Can Harm Learning https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4895486 ChatGPT, Copilot, Gemini, SciSpace and Wolfram versus higher education assessments: an updated multi-institutional study of the academic integrity impacts of Generative Artificial Intelligence (GenAI) on assessment, teaching and learning in engineering https://www.tandfonline.com/doi/epdf/10.1080/22054952.2024.2372154 How critically can an AI think? A framework for evaluating the quality of thinking of generative artificial intelligence https://arxiv.org/abs/2406.14769 Analyzing Large Language Models for Classroom Discussion Assessment https://arxiv.org/abs/2406.08680 Student Perspectives on Using a Large Language Model (LLM) for an Assignment on Professional Ethics https://arxiv.org/abs/2406.11858 ChatGPT as Research Scientist: Probing GPT's Capabilities as a Research Librarian, Research Ethicist, Data Generator and Data Predictor https://arxiv.org/abs/2406.14765 70B-parameter large language models in Japanese medical question-answering https://arxiv.org/abs/2406.14882 I don't trust you (anymore)! -- The effect of students' LLM use on Lecturer-Student-Trust in Higher Education https://arxiv.org/abs/2406.14871 Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders https://arxiv.org/abs/2406.16510
AI Engineering is expanding! Join the first
Send us a text00:19 | FigureAI (humanoid robots)- AI and humanoid robots drive efficiency and will drive cost of goods/services to $0, drive unlimited GDP- $675m raise $2.6b valuation; OpenAI, Microsoft, Nvidia- 10b humanoid robots by 2040- Musk is projecting 16b to 32b humanoid robots, Telsa Optimus is a humanoid robot- humanoid robots fit easily into a human world to easily replace humans11:49 | Stripe- These companies are so big they're doing share buybacks!- Planning new tender offer to repurchase shares from employees- Entire offer financed with Stripe's own cash, a shift from external funding- Generated $615M in free cash flow in June quarter vs. $500M cash burn in 2022- Valuation at $70B (secondary), up from $50B in 2022 and $65B in last tender- Up to 8,000 employees can sell up to $50,000 of vested shares at $27.51/share- Expanding beyond core payments into billing software; segment on track for $500M annual revenue25:39 | xAI- xAI differentiators are becoming clear; real time data, most accurate answers, Musk effect (i.e. unlimited capital)- AI large language model platform business- Released Grok-2 and Grok-2 Mini beta LLMs on X platform- Enterprise API arriving later this month- Top-four position on LMSYS chatbot leaderboard- Grok-2 Mini: efficient, ideal for speed/resource-critical scenarios- Focus on expanding multimodal understanding- Available to Premium/Premium+ subscribers on X at $8/month- Secondary market valuation: $25.7B (+6.9% vs May 2024 round)
US equities were lower in Wednesday trading, though ended off session lows. Today's weakness was fairly broad-based, but there was nothing seemingly definitive behind the move with the market still in waiting mode ahead of Nvidia earnings, which were reported after the close. The treasury supply was again in focus with today's poorly-received $70B 5-year note auction, which tailed ~0.3 bp, following yesterday's solid $75B 2Y sale.
Disclaimer: We recorded this episode ~1.5 months ago, timing for the FastHTML release. It then got bottlenecked by Llama3.1, Winds of AI Winter, and SAM2 episodes, so we're a little late. Since then FastHTML was released, swyx is building an app in it for AINews, and Anthropic has also released their prompt caching API. Remember when Dylan Patel of SemiAnalysis coined the GPU Rich vs GPU Poor war? (if not, see our pod with him). The idea was that if you're GPU poor you shouldn't waste your time trying to solve GPU rich problems (i.e. pre-training large models) and are better off working on fine-tuning, optimized inference, etc. Jeremy Howard (see our “End of Finetuning” episode to catchup on his background) and Eric Ries founded Answer.AI to do exactly that: “Practical AI R&D”, which is very in-line with the GPU poor needs. For example, one of their first releases was a system based on FSDP + QLoRA that let anyone train a 70B model on two NVIDIA 4090s. Since then, they have come out with a long list of super useful projects (in no particular order, and non-exhaustive):* FSDP QDoRA: this is just as memory efficient and scalable as FSDP/QLoRA, and critically is also as accurate for continued pre-training as full weight training.* Cold Compress: a KV cache compression toolkit that lets you scale sequence length without impacting speed.* colbert-small: state of the art retriever at only 33M params* JaColBERTv2.5: a new state-of-the-art retrievers on all Japanese benchmarks.* gpu.cpp: portable GPU compute for C++ with WebGPU.* Claudette: a better Anthropic API SDK. They also recently released FastHTML, a new way to create modern interactive web apps. Jeremy recently released a 1 hour “Getting started” tutorial on YouTube; while this isn't AI related per se, but it's close to home for any AI Engineer who are looking to iterate quickly on new products: In this episode we broke down 1) how they recruit 2) how they organize what to research 3) and how the community comes together. At the end, Jeremy gave us a sneak peek at something new that he's working on that he calls dialogue engineering: So I've created a new approach. It's not called prompt engineering. I'm creating a system for doing dialogue engineering. It's currently called AI magic. I'm doing most of my work in this system and it's making me much more productive than I was before I used it.He explains it a bit more ~44:53 in the pod, but we'll just have to wait for the public release to figure out exactly what he means.Timestamps* [00:00:00] Intro by Suno AI* [00:03:02] Continuous Pre-Training is Here* [00:06:07] Schedule-Free Optimizers and Learning Rate Schedules* [00:07:08] Governance and Structural Issues within OpenAI and Other AI Labs* [00:13:01] How Answer.ai works* [00:23:40] How to Recruit Productive Researchers* [00:27:45] Building a new BERT* [00:31:57] FSDP, QLoRA, and QDoRA: Innovations in Fine-Tuning Large Models* [00:36:36] Research and Development on Model Inference Optimization* [00:39:49] FastHTML for Web Application Development* [00:46:53] AI Magic & Dialogue Engineering* [00:52:19] AI wishlist & predictionsShow Notes* Jeremy Howard* Previously on Latent Space: The End of Finetuning, NeurIPS Startups* Answer.ai* Fast.ai* FastHTML* answerai-colbert-small-v1* gpu.cpp* Eric Ries* Aaron DeFazio* Yi Tai* Less Wright* Benjamin Warner* Benjamin Clavié* Jono Whitaker* Austin Huang* Eric Gilliam* Tim Dettmers* Colin Raffel* Sebastian Raschka* Carson Gross* Simon Willison* Sepp Hochreiter* Llama3.1 episode* Snowflake Arctic* Ranger Optimizer* Gemma.cpp* HTMX* UL2* BERT* DeBERTa* Efficient finetuning of Llama 3 with FSDP QDoRA* xLSTMTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:14]: And today we're back with Jeremy Howard, I think your third appearance on Latent Space. Welcome.Jeremy [00:00:19]: Wait, third? Second?Swyx [00:00:21]: Well, I grabbed you at NeurIPS.Jeremy [00:00:23]: I see.Swyx [00:00:24]: Very fun, standing outside street episode.Jeremy [00:00:27]: I never heard that, by the way. You've got to send me a link. I've got to hear what it sounded like.Swyx [00:00:30]: Yeah. Yeah, it's a NeurIPS podcast.Alessio [00:00:32]: I think the two episodes are six hours, so there's plenty to listen, we'll make sure to send it over.Swyx [00:00:37]: Yeah, we're trying this thing where at the major ML conferences, we, you know, do a little audio tour of, give people a sense of what it's like. But the last time you were on, you declared the end of fine tuning. I hope that I sort of editorialized the title a little bit, and I know you were slightly uncomfortable with it, but you just own it anyway. I think you're very good at the hot takes. And we were just discussing in our pre-show that it's really happening, that the continued pre-training is really happening.Jeremy [00:01:02]: Yeah, absolutely. I think people are starting to understand that treating the three ULM FIT steps of like pre-training, you know, and then the kind of like what people now call instruction tuning, and then, I don't know if we've got a general term for this, DPO, RLHFE step, you know, or the task training, they're not actually as separate as we originally suggested they were in our paper, and when you treat it more as a continuum, and that you make sure that you have, you know, more of kind of the original data set incorporated into the later stages, and that, you know, we've also seen with LLAMA3, this idea that those later stages can be done for a lot longer. These are all of the things I was kind of trying to describe there. It wasn't the end of fine tuning, but more that we should treat it as a continuum, and we should have much higher expectations of how much you can do with an already trained model. You can really add a lot of behavior to it, you can change its behavior, you can do a lot. So a lot of our research has been around trying to figure out how to modify the model by a larger amount rather than starting from random weights, because I get very offended at the idea of starting from random weights.Swyx [00:02:14]: Yeah, I saw that in ICLR in Vienna, there was an outstanding paper about starting transformers from data-driven piers. I don't know if you saw that one, they called it sort of never trained from scratch, and I think it was kind of rebelling against like the sort of random initialization.Jeremy [00:02:28]: Yeah, I've, you know, that's been our kind of continuous message since we started Fast AI, is if you're training for random weights, you better have a really good reason, you know, because it seems so unlikely to me that nobody has ever trained on data that has any similarity whatsoever to the general class of data you're working with, and that's the only situation in which I think starting from random weights makes sense.Swyx [00:02:51]: The other trends since our last pod that I would point people to is I'm seeing a rise in multi-phase pre-training. So Snowflake released a large model called Snowflake Arctic, where they detailed three phases of training where they had like a different mixture of like, there was like 75% web in the first instance, and then they reduced the percentage of the web text by 10% each time and increased the amount of code in each phase. And I feel like multi-phase is being called out in papers more. I feel like it's always been a thing, like changing data mix is not something new, but calling it a distinct phase is new, and I wonder if there's something that you're seeingJeremy [00:03:32]: on your end. Well, so they're getting there, right? So the point at which they're doing proper continued pre-training is the point at which that becomes a continuum rather than a phase. So the only difference with what I was describing last time is to say like, oh, there's a function or whatever, which is happening every batch. It's not a huge difference. You know, I always used to get offended when people had learning rates that like jumped. And so one of the things I started doing early on in Fast.ai was to say to people like, no, you should actually have your learning rate schedule should be a function, not a list of numbers. So now I'm trying to give the same idea about training mix.Swyx [00:04:07]: There's been pretty public work from Meta on schedule-free optimizers. I don't know if you've been following Aaron DeFazio and what he's doing, just because you mentioned learning rate schedules, you know, what if you didn't have a schedule?Jeremy [00:04:18]: I don't care very much, honestly. I don't think that schedule-free optimizer is that exciting. It's fine. We've had non-scheduled optimizers for ages, like Less Wright, who's now at Meta, who was part of the Fast.ai community there, created something called the Ranger optimizer. I actually like having more hyperparameters. You know, as soon as you say schedule-free, then like, well, now I don't get to choose. And there isn't really a mathematically correct way of, like, I actually try to schedule more parameters rather than less. So like, I like scheduling my epsilon in my atom, for example. I schedule all the things. But then the other thing we always did with the Fast.ai library was make it so you don't have to set any schedules. So Fast.ai always supported, like, you didn't even have to pass a learning rate. Like, it would always just try to have good defaults and do the right thing. But to me, I like to have more parameters I can play with if I want to, but you don't have to.Alessio [00:05:08]: And then the more less technical side, I guess, of your issue, I guess, with the market was some of the large research labs taking all this innovation kind of behind closed doors and whether or not that's good, which it isn't. And now we could maybe make it more available to people. And then a month after we released the episode, there was the whole Sam Altman drama and like all the OpenAI governance issues. And maybe people started to think more, okay, what happens if some of these kind of labs, you know, start to break from within, so to speak? And the alignment of the humans is probably going to fall before the alignment of the models. So I'm curious, like, if you have any new thoughts and maybe we can also tie in some of the way that we've been building Answer as like a public benefit corp and some of those aspects.Jeremy [00:05:51]: Sure. So, yeah, I mean, it was kind of uncomfortable because two days before Altman got fired, I did a small public video interview in which I said, I'm quite sure that OpenAI's current governance structure can't continue and that it was definitely going to fall apart. And then it fell apart two days later and a bunch of people were like, what did you know, Jeremy?Alessio [00:06:13]: What did Jeremy see?Jeremy [00:06:15]: I didn't see anything. It's just obviously true. Yeah. So my friend Eric Ries and I spoke a lot before that about, you know, Eric's, I think probably most people would agree, the top expert in the world on startup and AI governance. And you know, we could both clearly see that this didn't make sense to have like a so-called non-profit where then there are people working at a company, a commercial company that's owned by or controlled nominally by the non-profit, where the people in the company are being given the equivalent of stock options, like everybody there was working there with expecting to make money largely from their equity. So the idea that then a board could exercise control by saying like, oh, we're worried about safety issues and so we're going to do something that decreases the profit of the company, when every stakeholder in the company, their remuneration pretty much is tied to their profit, it obviously couldn't work. So I mean, that was a huge oversight there by someone. I guess part of the problem is that the kind of people who work at non-profits and in this case the board, you know, who are kind of academics and, you know, people who are kind of true believers. I think it's hard for them to realize that 99.999% of the world is driven very heavily by money, especially huge amounts of money. So yeah, Eric and I had been talking for a long time before that about what could be done differently, because also companies are sociopathic by design and so the alignment problem as it relates to companies has not been solved. Like, companies become huge, they devour their founders, they devour their communities and they do things where even the CEOs, you know, often of big companies tell me like, I wish our company didn't do that thing. You know, I know that if I didn't do it, then I would just get fired and the board would put in somebody else and the board knows if they don't do it, then their shareholders can sue them because they're not maximizing profitability or whatever. So what Eric's spent a lot of time doing is trying to think about how do we make companies less sociopathic, you know, how to, or more, you know, maybe a better way to think of it is like, how do we make it so that the founders of companies can ensure that their companies continue to actually do the things they want them to do? You know, when we started a company, hey, we very explicitly decided we got to start a company, not a academic lab, not a nonprofit, you know, we created a Delaware Seacorp, you know, the most company kind of company. But when we did so, we told everybody, you know, including our first investors, which was you Alessio. They sound great. We are going to run this company on the basis of maximizing long-term value. And in fact, so when we did our second round, which was an angel round, we had everybody invest through a long-term SPV, which we set up where everybody had to agree to vote in line with long-term value principles. So like never enough just to say to people, okay, we're trying to create long-term value here for society as well as for ourselves and everybody's like, oh, yeah, yeah, I totally agree with that. But when it comes to like, okay, well, here's a specific decision we have to make, which will not maximize short-term value, people suddenly change their mind. So you know, it has to be written into the legal documents of everybody so that no question that that's the way the company has to be managed. So then you mentioned the PBC aspect, Public Benefit Corporation, which I never quite understood previously. And turns out it's incredibly simple, like it took, you know, like one paragraph added to our corporate documents to become a PBC. It was cheap, it was easy, but it's got this huge benefit, which is if you're not a public benefit corporation, then somebody can come along and offer to buy you with a stated description of like turning your company into the thing you most hate, right? And if they offer you more than the market value of your company and you don't accept it, then you are not necessarily meeting the kind of your fiduciary responsibilities. So the way like Eric always described it to me is like, if Philip Morris came along and said that you've got great technology for marketing cigarettes to children, so we're going to pivot your company to do that entirely, and we're going to pay you 50% more than the market value, you're going to have to say yes. If you have a PBC, then you are more than welcome to say no, if that offer is not in line with your stated public benefit. So our stated public benefit is to maximize the benefit to society through using AI. So given that more children smoking doesn't do that, then we can say like, no, we're not selling to you.Alessio [00:11:01]: I was looking back at some of our emails. You sent me an email on November 13th about talking and then on the 14th, I sent you an email working together to free AI was the subject line. And then that was kind of the start of the C round. And then two days later, someone got fired. So you know, you were having these thoughts even before we had like a public example of like why some of the current structures didn't work. So yeah, you were very ahead of the curve, so to speak. You know, people can read your awesome introduction blog and answer and the idea of having a R&D lab versus our lab and then a D lab somewhere else. I think to me, the most interesting thing has been hiring and some of the awesome people that you've been bringing on that maybe don't fit the central casting of Silicon Valley, so to speak. Like sometimes I got it like playing baseball cards, you know, people are like, oh, what teams was this person on, where did they work versus focusing on ability. So I would love for you to give a shout out to some of the awesome folks that you have on the team.Jeremy [00:11:58]: So, you know, there's like a graphic going around describing like the people at XAI, you know, Elon Musk thing. And like they are all connected to like multiple of Stanford, Meta, DeepMind, OpenAI, Berkeley, Oxford. Look, these are all great institutions and they have good people. And I'm definitely not at all against that, but damn, there's so many other people. And one of the things I found really interesting is almost any time I see something which I think like this is really high quality work and it's something I don't think would have been built if that person hadn't built the thing right now, I nearly always reach out to them and ask to chat. And I tend to dig in to find out like, okay, you know, why did you do that thing? Everybody else has done this other thing, your thing's much better, but it's not what other people are working on. And like 80% of the time, I find out the person has a really unusual background. So like often they'll have like, either they like came from poverty and didn't get an opportunity to go to a good school or had dyslexia and, you know, got kicked out of school in year 11, or they had a health issue that meant they couldn't go to university or something happened in their past and they ended up out of the mainstream. And then they kind of succeeded anyway. Those are the people that throughout my career, I've tended to kind of accidentally hire more of, but it's not exactly accidentally. It's like when I see somebody who's done, two people who have done extremely well, one of them did extremely well in exactly the normal way from the background entirely pointing in that direction and they achieved all the hurdles to get there. And like, okay, that's quite impressive, you know, but another person who did just as well, despite lots of constraints and doing things in really unusual ways and came up with different approaches. That's normally the person I'm likely to find useful to work with because they're often like risk-takers, they're often creative, they're often extremely tenacious, they're often very open-minded. So that's the kind of folks I tend to find myself hiring. So now at Answer.ai, it's a group of people that are strong enough that nearly every one of them has independently come to me in the past few weeks and told me that they have imposter syndrome and they're not convinced that they're good enough to be here. And I kind of heard it at the point where I was like, okay, I don't think it's possible that all of you are so far behind your peers that you shouldn't get to be here. But I think part of the problem is as an R&D lab, the great developers look at the great researchers and they're like, wow, these big-brained, crazy research people with all their math and s**t, they're too cool for me, oh my God. And then the researchers look at the developers and they're like, oh, they're killing it, making all this stuff with all these people using it and talking on Twitter about how great it is. I think they're both a bit intimidated by each other, you know. And so I have to kind of remind them like, okay, there are lots of things in this world where you suck compared to lots of other people in this company, but also vice versa, you know, for all things. And the reason you came here is because you wanted to learn about those other things from those other people and have an opportunity to like bring them all together into a single unit. You know, it's not reasonable to expect you're going to be better at everything than everybody else. I guess the other part of it is for nearly all of the people in the company, to be honest, they have nearly always been better than everybody else at nearly everything they're doing nearly everywhere they've been. So it's kind of weird to be in this situation now where it's like, gee, I can clearly see that I suck at this thing that I'm meant to be able to do compared to these other people where I'm like the worst in the company at this thing for some things. So I think that's a healthy place to be, you know, as long as you keep reminding each other about that's actually why we're here. And like, it's all a bit of an experiment, like we don't have any managers. We don't have any hierarchy from that point of view. So for example, I'm not a manager, which means I don't get to tell people what to do or how to do it or when to do it. Yeah, it's been a bit of an experiment to see how that would work out. And it's been great. So for instance, Ben Clavier, who you might have come across, he's the author of Ragatouille, he's the author of Rerankers, super strong information retrieval guy. And a few weeks ago, you know, this additional channel appeared on Discord, on our private Discord called Bert24. And these people started appearing, as in our collab sections, we have a collab section for like collaborating with outsiders. And these people started appearing, there are all these names that I recognize, like Bert24, and they're all talking about like the next generation of Bert. And I start following along, it's like, okay, Ben decided that I think, quite rightly, we need a new Bert. Because everybody, like so many people are still using Bert, and it's still the best at so many things, but it actually doesn't take advantage of lots of best practices. And so he just went out and found basically everybody who's created better Berts in the last four or five years, brought them all together, suddenly there's this huge collaboration going on. So yeah, I didn't tell him to do that. He didn't ask my permission to do that. And then, like, Benjamin Warner dived in, and he's like, oh, I created a whole transformers from scratch implementation designed to be maximally hackable. He originally did it largely as a teaching exercise to show other people, but he was like, I could, you know, use that to create a really hackable BERT implementation. In fact, he didn't say that. He said, I just did do that, you know, and I created a repo, and then everybody's like starts using it. They're like, oh my god, this is amazing. I can now implement all these other BERT things. And it's not just answer AI guys there, you know, there's lots of folks, you know, who have like contributed new data set mixes and blah, blah, blah. So, I mean, I can help in the same way that other people can help. So like, then Ben Clavier reached out to me at one point and said, can you help me, like, what have you learned over time about how to manage intimidatingly capable and large groups of people who you're nominally meant to be leading? And so, you know, I like to try to help, but I don't direct. Another great example was Kerem, who, after our FSTP QLORA work, decided quite correctly that it didn't really make sense to use LoRa in today's world. You want to use the normalized version, which is called Dora. Like two or three weeks after we did FSTP QLORA, he just popped up and said, okay, I've just converted the whole thing to Dora, and I've also created these VLLM extensions, and I've got all these benchmarks, and, you know, now I've got training of quantized models with adapters that are as fast as LoRa, and as actually better than, weirdly, fine tuning. Just like, okay, that's great, you know. And yeah, so the things we've done to try to help make these things happen as well is we don't have any required meetings, you know, but we do have a meeting for each pair of major time zones that everybody's invited to, and, you know, people see their colleagues doing stuff that looks really cool and say, like, oh, how can I help, you know, or how can I learn or whatever. So another example is Austin, who, you know, amazing background. He ran AI at Fidelity, he ran AI at Pfizer, he ran browsing and retrieval for Google's DeepMind stuff, created Jemma.cpp, and he's been working on a new system to make it easier to do web GPU programming, because, again, he quite correctly identified, yeah, so I said to him, like, okay, I want to learn about that. Not an area that I have much expertise in, so, you know, he's going to show me what he's working on and teach me a bit about it, and hopefully I can help contribute. I think one of the key things that's happened in all of these is everybody understands what Eric Gilliam, who wrote the second blog post in our series, the R&D historian, describes as a large yard with narrow fences. Everybody has total flexibility to do what they want. We all understand kind of roughly why we're here, you know, we agree with the premises around, like, everything's too expensive, everything's too complicated, people are building too many vanity foundation models rather than taking better advantage of fine-tuning, like, there's this kind of general, like, sense of we're all on the same wavelength about, you know, all the ways in which current research is fucked up, and, you know, all the ways in which we're worried about centralization. We all care a lot about not just research for the point of citations, but research that actually wouldn't have happened otherwise, and actually is going to lead to real-world outcomes. And so, yeah, with this kind of, like, shared vision, people understand, like, you know, so when I say, like, oh, well, you know, tell me, Ben, about BERT 24, what's that about? And he's like, you know, like, oh, well, you know, you can see from an accessibility point of view, or you can see from a kind of a actual practical impact point of view, there's far too much focus on decoder-only models, and, you know, like, BERT's used in all of these different places and industry, and so I can see, like, in terms of our basic principles, what we're trying to achieve, this seems like something important. And so I think that's, like, a really helpful that we have that kind of shared perspective, you know?Alessio [00:21:14]: Yeah. And before we maybe talk about some of the specific research, when you're, like, reaching out to people, interviewing them, what are some of the traits, like, how do these things come out, you know, usually? Is it working on side projects that you, you know, you're already familiar with? Is there anything, like, in the interview process that, like, helps you screen for people that are less pragmatic and more research-driven versus some of these folks that are just gonna do it, you know? They're not waiting for, like, the perfect process.Jeremy [00:21:40]: Everybody who comes through the recruiting is interviewed by everybody in the company. You know, our goal is 12 people, so it's not an unreasonable amount. So the other thing to say is everybody so far who's come into the recruiting pipeline, everybody bar one, has been hired. So which is to say our original curation has been good. And that's actually pretty easy, because nearly everybody who's come in through the recruiting pipeline are people I know pretty well. So Jono Whitaker and I, you know, he worked on the stable diffusion course we did. He's outrageously creative and talented, and he's super, like, enthusiastic tinkerer, just likes making things. Benjamin was one of the strongest parts of the fast.ai community, which is now the alumni. It's, like, hundreds of thousands of people. And you know, again, like, they're not people who a normal interview process would pick up, right? So Benjamin doesn't have any qualifications in math or computer science. Jono was living in Zimbabwe, you know, he was working on, like, helping some African startups, you know, but not FAANG kind of credentials. But yeah, I mean, when you actually see people doing real work and they stand out above, you know, we've got lots of Stanford graduates and open AI people and whatever in our alumni community as well. You know, when you stand out above all of those people anyway, obviously you've got something going for you. You know, Austin, him and I worked together on the masks study we did in the proceeding at the National Academy of Science. You know, we had worked together, and again, that was a group of, like, basically the 18 or 19 top experts in the world on public health and epidemiology and research design and so forth. And Austin, you know, one of the strongest people in that collaboration. So yeah, you know, like, I've been lucky enough to have had opportunities to work with some people who are great and, you know, I'm a very open-minded person, so I kind of am always happy to try working with pretty much anybody and some people stand out. You know, there have been some exceptions, people I haven't previously known, like Ben Clavier, actually, I didn't know before. But you know, with him, you just read his code, and I'm like, oh, that's really well-written code. And like, it's not written exactly the same way as everybody else's code, and it's not written to do exactly the same thing as everybody else's code. So yeah, and then when I chatted to him, it's just like, I don't know, I felt like we'd known each other for years, like we just were on the same wavelength, but I could pretty much tell that was going to happen just by reading his code. I think you express a lot in the code you choose to write and how you choose to write it, I guess. You know, or another example, a guy named Vic, who was previously the CEO of DataQuest, and like, in that case, you know, he's created a really successful startup. He won the first, basically, Kaggle NLP competition, which was automatic essay grading. He's got the current state-of-the-art OCR system, Surya. Again, he's just a guy who obviously just builds stuff, you know, he doesn't ask for permission, he doesn't need any, like, external resources. Actually, Karim's another great example of this, I mean, I already knew Karim very well because he was my best ever master's student, but it wasn't a surprise to me then when he then went off to create the world's state-of-the-art language model in Turkish on his own, in his spare time, with no budget, from scratch. This is not fine-tuning or whatever, he, like, went back to Common Crawl and did everything. Yeah, it's kind of, I don't know what I'd describe that process as, but it's not at all based on credentials.Swyx [00:25:17]: Assemble based on talent, yeah. We wanted to dive in a little bit more on, you know, turning from the people side of things into the technical bets that you're making. Just a little bit more on Bert. I was actually, we just did an interview with Yi Tay from Reka, I don't know if you're familiar with his work, but also another encoder-decoder bet, and one of his arguments was actually people kind of over-index on the decoder-only GPT-3 type paradigm. I wonder if you have thoughts there that is maybe non-consensus as well. Yeah, no, absolutely.Jeremy [00:25:45]: So I think it's a great example. So one of the people we're collaborating with a little bit with BERT24 is Colin Raffle, who is the guy behind, yeah, most of that stuff, you know, between that and UL2, there's a lot of really interesting work. And so one of the things I've been encouraging the BERT group to do, Colin has as well, is to consider using a T5 pre-trained encoder backbone as a thing you fine-tune, which I think would be really cool. You know, Colin was also saying actually just use encoder-decoder as your Bert, you know, why don't you like use that as a baseline, which I also think is a good idea. Yeah, look.Swyx [00:26:25]: What technical arguments are people under-weighting?Jeremy [00:26:27]: I mean, Colin would be able to describe this much better than I can, but I'll give my slightly non-expert attempt. Look, I mean, think about like diffusion models, right? Like in stable diffusion, like we use things like UNet. You have this kind of downward path and then in the upward path you have the cross connections, which it's not a tension, but it's like a similar idea, right? You're inputting the original encoding path into your decoding path. It's critical to make it work, right? Because otherwise in the decoding part, the model has to do so much kind of from scratch. So like if you're doing translation, like that's a classic kind of encoder-decoder example. If it's decoder only, you never get the opportunity to find the right, you know, feature engineering, the right feature encoding for the original sentence. And it kind of means then on every token that you generate, you have to recreate the whole thing, you know? So if you have an encoder, it's basically saying like, okay, this is your opportunity model to create a really useful feature representation for your input information. So I think there's really strong arguments for encoder-decoder models anywhere that there is this kind of like context or source thing. And then why encoder only? Well, because so much of the time what we actually care about is a classification, you know? It's like an output. It's like generating an arbitrary length sequence of tokens. So anytime you're not generating an arbitrary length sequence of tokens, decoder models don't seem to make much sense. Now the interesting thing is, you see on like Kaggle competitions, that decoder models still are at least competitive with things like Deberta v3. They have to be way bigger to be competitive with things like Deberta v3. And the only reason they are competitive is because people have put a lot more time and money and effort into training the decoder only ones, you know? There isn't a recent Deberta. There isn't a recent Bert. Yeah, it's a whole part of the world that people have slept on a little bit. And this is just what happens. This is how trends happen rather than like, to me, everybody should be like, oh, let's look at the thing that has shown signs of being useful in the past, but nobody really followed up with properly. That's the more interesting path, you know, where people tend to be like, oh, I need to get citations. So what's everybody else doing? Can I make it 0.1% better, you know, or 0.1% faster? That's what everybody tends to do. Yeah. So I think it's like, Itay's work commercially now is interesting because here's like a whole, here's a whole model that's been trained in a different way. So there's probably a whole lot of tasks it's probably better at than GPT and Gemini and Claude. So that should be a good commercial opportunity for them if they can figure out what those tasks are.Swyx [00:29:07]: Well, if rumors are to be believed, and he didn't comment on this, but, you know, Snowflake may figure out the commercialization for them. So we'll see.Jeremy [00:29:14]: Good.Alessio [00:29:16]: Let's talk about FSDP, Qlora, Qdora, and all of that awesome stuff. One of the things we talked about last time, some of these models are meant to run on systems that nobody can really own, no single person. And then you were like, well, what if you could fine tune a 70B model on like a 4090? And I was like, no, that sounds great, Jeremy, but like, can we actually do it? And then obviously you all figured it out. Can you maybe tell us some of the worst stories behind that, like the idea behind FSDP, which is kind of taking sharded data, parallel computation, and then Qlora, which is do not touch all the weights, just go quantize some of the model, and then within the quantized model only do certain layers instead of doing everything.Jeremy [00:29:57]: Well, do the adapters. Yeah.Alessio [00:29:59]: Yeah. Yeah. Do the adapters. Yeah. I will leave the floor to you. I think before you published it, nobody thought this was like a short term thing that we're just going to have. And now it's like, oh, obviously you can do it, but it's not that easy.Jeremy [00:30:12]: Yeah. I mean, to be honest, it was extremely unpleasant work to do. It's like not at all enjoyable. I kind of did version 0.1 of it myself before we had launched the company, or at least the kind of like the pieces. They're all pieces that are difficult to work with, right? So for the quantization, you know, I chatted to Tim Detmers quite a bit and, you know, he very much encouraged me by saying like, yeah, it's possible. He actually thought it'd be easy. It probably would be easy for him, but I'm not Tim Detmers. And, you know, so he wrote bits and bytes, which is his quantization library. You know, he wrote that for a paper. He didn't write that to be production like code. It's now like everybody's using it, at least the CUDA bits. So like, it's not particularly well structured. There's lots of code paths that never get used. There's multiple versions of the same thing. You have to try to figure it out. So trying to get my head around that was hard. And you know, because the interesting bits are all written in CUDA, it's hard to like to step through it and see what's happening. And then, you know, FSTP is this very complicated library and PyTorch, which not particularly well documented. So the only really, really way to understand it properly is again, just read the code and step through the code. And then like bits and bytes doesn't really work in practice unless it's used with PEF, the HuggingFace library and PEF doesn't really work in practice unless you use it with other things. And there's a lot of coupling in the HuggingFace ecosystem where like none of it works separately. You have to use it all together, which I don't love. So yeah, trying to just get a minimal example that I can play with was really hard. And so I ended up having to rewrite a lot of it myself to kind of create this like minimal script. One thing that helped a lot was Medec had this LlamaRecipes repo that came out just a little bit before I started working on that. And like they had a kind of role model example of like, here's how to train FSTP, LoRa, didn't work with QLoRa on Llama. A lot of the stuff I discovered, the interesting stuff would be put together by Les Wright, who's, he was actually the guy in the Fast.ai community I mentioned who created the Ranger Optimizer. So he's doing a lot of great stuff at Meta now. So yeah, I kind of, that helped get some minimum stuff going and then it was great once Benjamin and Jono joined full time. And so we basically hacked at that together and then Kerim joined like a month later or something. And it was like, gee, it was just a lot of like fiddly detailed engineering on like barely documented bits of obscure internals. So my focus was to see if it kind of could work and I kind of got a bit of a proof of concept working and then the rest of the guys actually did all the work to make it work properly. And, you know, every time we thought we had something, you know, we needed to have good benchmarks, right? So we'd like, it's very easy to convince yourself you've done the work when you haven't, you know, so then we'd actually try lots of things and be like, oh, and these like really important cases, the memory use is higher, you know, or it's actually slower. And we'd go in and we just find like all these things that were nothing to do with our library that just didn't work properly. And nobody had noticed they hadn't worked properly because nobody had really benchmarked it properly. So we ended up, you know, trying to fix a whole lot of different things. And even as we did so, new regressions were appearing in like transformers and stuff that Benjamin then had to go away and figure out like, oh, how come flash attention doesn't work in this version of transformers anymore with this set of models and like, oh, it turns out they accidentally changed this thing, so it doesn't work. You know, there's just, there's not a lot of really good performance type evals going on in the open source ecosystem. So there's an extraordinary amount of like things where people say like, oh, we built this thing and it has this result. And when you actually check it, so yeah, there's a shitload of war stories from getting that thing to work. And it did require a particularly like tenacious group of people and a group of people who don't mind doing a whole lot of kind of like really janitorial work, to be honest, to get the details right, to check them. Yeah.Alessio [00:34:09]: We had a trade out on the podcast and we talked about how a lot of it is like systems work to make some of these things work. It's not just like beautiful, pure math that you do on a blackboard. It's like, how do you get into the nitty gritty?Jeremy [00:34:22]: I mean, flash attention is a great example of that. Like it's, it basically is just like, oh, let's just take the attention and just do the tiled version of it, which sounds simple enough, you know, but then implementing that is challenging at lots of levels.Alessio [00:34:36]: Yeah. What about inference? You know, obviously you've done all this amazing work on fine tuning. Do you have any research you've been doing on the inference side, how to make local inference really fast on these models too?Jeremy [00:34:47]: We're doing quite a bit on that at the moment. We haven't released too much there yet. But one of the things I've been trying to do is also just to help other people. And one of the nice things that's happened is that a couple of folks at Meta, including Mark Seraphim, have done a nice job of creating this CUDA mode community of people working on like CUDA kernels or learning about that. And I tried to help get that going well as well and did some lessons to help people get into it. So there's a lot going on in both inference and fine tuning performance. And a lot of it's actually happening kind of related to that. So PyTorch team have created this Torch AO project on quantization. And so there's a big overlap now between kind of the FastAI and AnswerAI and CUDA mode communities of people working on stuff for both inference and fine tuning. But we're getting close now. You know, our goal is that nobody should be merging models, nobody should be downloading merged models, everybody should be using basically quantized plus adapters for almost everything and just downloading the adapters. And that should be much faster. So that's kind of the place we're trying to get to. It's difficult, you know, because like Karim's been doing a lot of work with VLM, for example. These inference engines are pretty complex bits of code. They have a whole lot of custom kernel stuff going on as well, as do the quantization libraries. So we've been working on, we're also quite a bit of collaborating with the folks who do HQQ, which is a really great quantization library and works super well. So yeah, there's a lot of other people outside AnswerAI that we're working with a lot who are really helping on all this performance optimization stuff, open source.Swyx [00:36:27]: Just to follow up on merging models, I picked up there that you said nobody should be merging models. That's interesting because obviously a lot of people are experimenting with this and finding interesting results. I would say in defense of merging models, you can do it without data. That's probably the only thing that's going for it.Jeremy [00:36:45]: To explain, it's not that you shouldn't merge models. You shouldn't be distributing a merged model. You should distribute a merged adapter 99% of the time. And actually often one of the best things happening in the model merging world is actually that often merging adapters works better anyway. The point is, Sean, that once you've got your new model, if you distribute it as an adapter that sits on top of a quantized model that somebody's already downloaded, then it's a much smaller download for them. And also the inference should be much faster because you're not having to transfer FB16 weights from HPM memory at all or ever load them off disk. You know, all the main weights are quantized and the only floating point weights are in the adapters. So that should make both inference and fine tuning faster. Okay, perfect.Swyx [00:37:33]: We're moving on a little bit to the rest of the fast universe. I would have thought that, you know, once you started Answer.ai, that the sort of fast universe would be kind of on hold. And then today you just dropped Fastlight and it looks like, you know, there's more activity going on in sort of Fastland.Jeremy [00:37:49]: Yeah. So Fastland and Answerland are not really distinct things. Answerland is kind of like the Fastland grown up and funded. They both have the same mission, which is to maximize the societal benefit of AI broadly. We want to create thousands of commercially successful products at Answer.ai. And we want to do that with like 12 people. So that means we need a pretty efficient stack, you know, like quite a few orders of magnitude more efficient, not just for creation, but for deployment and maintenance than anything that currently exists. People often forget about the D part of our R&D firm. So we've got to be extremely good at creating, deploying and maintaining applications, not just models. Much to my horror, the story around creating web applications is much worse now than it was 10 or 15 years ago in terms of, if I say to a data scientist, here's how to create and deploy a web application, you know, either you have to learn JavaScript or TypeScript and about all the complex libraries like React and stuff, and all the complex like details around security and web protocol stuff around how you then talk to a backend and then all the details about creating the backend. You know, if that's your job and, you know, you have specialists who work in just one of those areas, it is possible for that to all work. But compared to like, oh, write a PHP script and put it in the home directory that you get when you sign up to this shell provider, which is what it was like in the nineties, you know, here are those 25 lines of code and you're done and now you can pass that URL around to all your friends, or put this, you know, .pl file inside the CGI bin directory that you got when you signed up to this web host. So yeah, the thing I've been mainly working on the last few weeks is fixing all that. And I think I fixed it. I don't know if this is an announcement, but I tell you guys, so yeah, there's this thing called fastHTML, which basically lets you create a complete web application in a single Python file. Unlike excellent projects like Streamlit and Gradio, you're not working on top of a highly abstracted thing. That's got nothing to do with web foundations. You're working with web foundations directly, but you're able to do it by using pure Python. There's no template, there's no ginger, there's no separate like CSS and JavaScript files. It looks and behaves like a modern SPA web application. And you can create components for like daisy UI, or bootstrap, or shoelace, or whatever fancy JavaScript and or CSS tailwind etc library you like, but you can write it all in Python. You can pip install somebody else's set of components and use them entirely from Python. You can develop and prototype it all in a Jupyter notebook if you want to. It all displays correctly, so you can like interactively do that. And then you mentioned Fastlight, so specifically now if you're using SQLite in particular, it's like ridiculously easy to have that persistence, and all of your handlers will be passed database ready objects automatically, that you can just call dot delete dot update dot insert on. Yeah, you get session, you get security, you get all that. So again, like with most everything I do, it's very little code. It's mainly tying together really cool stuff that other people have written. You don't have to use it, but a lot of the best stuff comes from its incorporation of HTMX, which to me is basically the thing that changes your browser to make it work the way it always should have. So it just does four small things, but those four small things are the things that are basically unnecessary constraints that HTML should never have had, so it removes the constraints. It sits on top of Starlet, which is a very nice kind of lower level platform for building these kind of web applications. The actual interface matches as closely as possible to FastAPI, which is a really nice system for creating the kind of classic JavaScript type applications. And Sebastian, who wrote FastAPI, has been kind enough to help me think through some of these design decisions, and so forth. I mean, everybody involved has been super helpful. Actually, I chatted to Carson, who created HTMX, you know, so about it. Some of the folks involved in Django, like everybody in the community I've spoken to definitely realizes there's a big gap to be filled around, like, highly scalable, web foundation-based, pure Python framework with a minimum of fuss. So yeah, I'm getting a lot of support and trying to make sure that FastHTML works well for people.Swyx [00:42:38]: I would say, when I heard about this, I texted Alexio. I think this is going to be pretty huge. People consider Streamlit and Gradio to be the state of the art, but I think there's so much to improve, and having what you call web foundations and web fundamentals at the core of it, I think, would be really helpful.Jeremy [00:42:54]: I mean, it's based on 25 years of thinking and work for me. So like, FastML was built on a system much like this one, but that was of hell. And so I spent, you know, 10 years working on that. We had millions of people using that every day, really pushing it hard. And I really always enjoyed working in that. Yeah. So, you know, and obviously lots of other people have done like great stuff, and particularly HTMX. So I've been thinking about like, yeah, how do I pull together the best of the web framework I created for FastML with HTMX? There's also things like PicoCSS, which is the CSS system, which by default, FastHTML comes with. Although, as I say, you can pip install anything you want to, but it makes it like super easy to, you know, so we try to make it so that just out of the box, you don't have any choices to make. Yeah. You can make choices, but for most people, you just, you know, it's like the PHP in your home directory thing. You just start typing and just by default, you'll get something which looks and feels, you know, pretty okay. And if you want to then write a version of Gradio or Streamlit on top of that, you totally can. And then the nice thing is if you then write it in kind of the Gradio equivalent, which will be, you know, I imagine we'll create some kind of pip installable thing for that. Once you've outgrown, or if you outgrow that, it's not like, okay, throw that all away and start again. And this like whole separate language that it's like this kind of smooth, gentle path that you can take step-by-step because it's all just standard web foundations all the way, you know.Swyx [00:44:29]: Just to wrap up the sort of open source work that you're doing, you're aiming to create thousands of projects with a very, very small team. I haven't heard you mention once AI agents or AI developer tooling or AI code maintenance. I know you're very productive, but you know, what is the role of AI in your own work?Jeremy [00:44:47]: So I'm making something. I'm not sure how much I want to say just yet.Swyx [00:44:52]: Give us a nibble.Jeremy [00:44:53]: All right. I'll give you the key thing. So I've created a new approach. It's not called prompt engineering. It's called dialogue engineering. But I'm creating a system for doing dialogue engineering. It's currently called AI magic. I'm doing most of my work in this system and it's making me much more productive than I was before I used it. So I always just build stuff for myself and hope that it'll be useful for somebody else. Think about chat GPT with code interpreter, right? The basic UX is the same as a 1970s teletype, right? So if you wrote APL on a teletype in the 1970s, you typed onto a thing, your words appeared at the bottom of a sheet of paper and you'd like hit enter and it would scroll up. And then the answer from APL would be printed out, scroll up, and then you would type the next thing. And like, which is also the way, for example, a shell works like bash or ZSH or whatever. It's not terrible, you know, like we all get a lot done in these like very, very basic teletype style REPL environments, but I've never felt like it's optimal and everybody else has just copied chat GPT. So it's also the way BART and Gemini work. It's also the way the Claude web app works. And then you add code interpreter. And the most you can do is to like plead with chat GPT to write the kind of code I want. It's pretty good for very, very, very beginner users who like can't code at all, like by default now the code's even hidden away, so you never even have to see it ever happened. But for somebody who's like wanting to learn to code or who already knows a bit of code or whatever, it's, it seems really not ideal. So okay, that's one end of the spectrum. The other end of the spectrum, which is where Sean's work comes in, is, oh, you want to do more than chat GPT? No worries. Here is Visual Studio Code. I run it. There's an empty screen with a flashing cursor. Okay, start coding, you know, and it's like, okay, you can use systems like Sean's or like cursor or whatever to be like, okay, Apple K in cursors, like a creative form that blah, blah, blah. But in the end, it's like a convenience over the top of this incredibly complicated system that full-time sophisticated software engineers have designed over the past few decades in a totally different environment as a way to build software, you know. And so we're trying to like shoehorn in AI into that. And it's not easy to do. And I think there are like much better ways of thinking about the craft of software development in a language model world to be much more interactive, you know. So the thing that I'm building is neither of those things. It's something between the two. And it's built around this idea of crafting a dialogue, you know, where the outcome of the dialogue is the artifacts that you want, whether it be a piece of analysis or whether it be a Python library or whether it be a technical blog post or whatever. So as part of building that, I've created something called Claudette, which is a library for Claude. I've created something called Cosette, which is a library for OpenAI. They're libraries which are designed to make those APIs much more usable, much easier to use, much more concise. And then I've written AI magic on top of those. And that's been an interesting exercise because I did Claudette first, and I was looking at what Simon Willison did with his fantastic LLM library. And his library is designed around like, let's make something that supports all the LLM inference engines and commercial providers. I thought, okay, what if I did something different, which is like make something that's as Claude friendly as possible and forget everything else. So that's what Claudette was. So for example, one of the really nice things in Claude is prefill. So by telling the assistant that this is what your response started with, there's a lot of powerful things you can take advantage of. So yeah, I created Claudette to be as Claude friendly as possible. And then after I did that, and then particularly with GPT 4.0 coming out, I kind of thought, okay, now let's create something that's as OpenAI friendly as possible. And then I tried to look to see, well, where are the similarities and where are the differences? And now can I make them compatible in places where it makes sense for them to be compatible without losing out on the things that make each one special for what they are. So yeah, those are some of the things I've been working on in that space. And I'm thinking we might launch AI magic via a course called how to solve it with code. The name is based on the classic Polya book, if you know how to solve it, which is, you know, one of the classic math books of all time, where we're basically going to try to show people how to solve challenging problems that they didn't think they could solve without doing a full computer science course, by taking advantage of a bit of AI and a bit of like practical skills, as particularly for this like whole generation of people who are learning to code with and because of ChatGPT. Like I love it, I know a lot of people who didn't really know how to code, but they've created things because they use ChatGPT, but they don't really know how to maintain them or fix them or add things to them that ChatGPT can't do, because they don't really know how to code. And so this course will be designed to show you how you can like either become a developer who can like supercharge their capabilities by using language models, or become a language model first developer who can supercharge their capabilities by understanding a bit about process and fundamentals.Alessio [00:50:19]: Nice. That's a great spoiler. You know, I guess the fourth time you're going to be on learning space, we're going to talk about AI magic. Jeremy, before we wrap, this was just a great run through everything. What are the things that when you next come on the podcast in nine, 12 months, we're going to be like, man, Jeremy was like really ahead of it. Like, is there anything that you see in the space that maybe people are not talking enough? You know, what's the next company that's going to fall, like have drama internally, anything in your mind?Jeremy [00:50:47]: You know, hopefully we'll be talking a lot about fast HTML and hopefully the international community that at that point has come up around that. And also about AI magic and about dialogue engineering. Hopefully dialogue engineering catches on because I think it's the right way to think about a lot of this stuff. What else? Just trying to think about all on the research side. Yeah. I think, you know, I mean, we've talked about a lot of it. Like I think encoder decoder architectures, encoder only architectures, hopefully we'll be talking about like the whole re-interest in BERT that BERT 24 stimulated.Swyx [00:51:17]: There's a safe space model that came out today that might be interesting for this general discussion. One thing that stood out to me with Cartesia's blog posts was that they were talking about real time ingestion, billions and trillions of tokens, and keeping that context, obviously in the state space that they have.Jeremy [00:51:34]: Yeah.Swyx [00:51:35]: I'm wondering what your thoughts are because you've been entirely transformers the whole time.Jeremy [00:51:38]: Yeah. No. So obviously my background is RNNs and LSTMs. Of course. And I'm still a believer in the idea that state is something you can update, you know? So obviously Sepp Hochreiter came up, came out with xLSTM recently. Oh my God. Okay. Another whole thing we haven't talked about, just somewhat related. I've been going crazy for like a long time about like, why can I not pay anybody to save my KV cash? I just ingested the Great Gatsby or the documentation for Starlet or whatever, you know, I'm sending it as my prompt context. Why are you redoing it every time? So Gemini is about to finally come out with KV caching, and this is something that Austin actually in Gemma.cpp had had on his roadmap for years, well not years, months, long time. The idea that the KV cache is like a thing that, it's a third thing, right? So there's RAG, you know, there's in-context learning, you know, and prompt engineering, and there's KV cache creation. I think it creates like a whole new class almost of applications or as techniques where, you know, for me, for example, I very often work with really new libraries or I've created my own library that I'm now writing with rather than on. So I want all the docs in my new library to be there all the time. So I want to upload them once, and then we have a whole discussion about building this application using FastHTML. Well nobody's got FastHTML in their language model yet, I don't want to send all the FastHTML docs across every time. So one of the things I'm looking at doing in AI Magic actually is taking advantage of some of these ideas so that you can have the documentation of the libraries you're working on be kind of always available. Something over the next 12 months people will be spending time thinking about is how to like, where to use RAG, where to use fine-tuning, where to use KV cache storage, you know. And how to use state, because in state models and XLSTM, again, state is something you update. So how do we combine the best of all of these worlds?Alessio [00:53:46]: And Jeremy, I know before you talked about how some of the autoregressive models are not maybe a great fit for agents. Any other thoughts on like JEPA, diffusion for text, any interesting thing that you've seen pop up?Jeremy [00:53:58]: In the same way that we probably ought to have state that you can update, i.e. XLSTM and state models, in the same way that a lot of things probably should have an encoder, JEPA and diffusion both seem like the right conceptual mapping for a lot of things we probably want to do. So the idea of like, there should be a piece of the generative pipeline, which is like thinking about the answer and coming up with a sketch of what the answer looks like before you start outputting tokens. That's where it kind of feels like diffusion ought to fit, you know. And diffusion is, because it's not autoregressive, it's like, let's try to like gradually de-blur the picture of how to solve this. So this is also where dialogue engineering fits in, by the way. So with dialogue engineering, one of the reasons it's working so well for me is I use it to kind of like craft the thought process before I generate the code, you know. So yeah, there's a lot of different pieces here and I don't know how they'll all kind of exactly fit together. I don't know if JEPA is going to actually end up working in the text world. I don't know if diffusion will end up working in the text world, but they seem to be like trying to solve a class of problem which is currently unsolved.Alessio [00:55:13]: Awesome, Jeremy. This was great, as usual. Thanks again for coming back on the pod and thank you all for listening. Yeah, that was fantastic. Get full access to Latent Space at www.latent.space/subscribe
July was a month of momentum for the agbiosciences, including the release of new research that revealed Indiana agbioscience contributes nearly $70B to the state's economy. We are here to recap the month: Accelerate 2050 – a new study that highlights three priority opportunities for the agbiosciences including Food is Health, Farmer-Focused Innovation and Farmer-Focused Innovation. We also hear RTI's Jim Redden's comments on what these areas of focus mean for Indiana – and around the world. Heartland BioWorks – recently announced a nearly $51 million implementation grant made possible by the Regional Technology and Innovation Hubs program created by the CHIPS and Science Act of 2022. Vital Farms – expanding its supply chain to Seymour, Indiana with plans to break ground in 2025 and be fully operational in 2027. This will add 150 new jobs to the area. IBJ 250 – celebrating agbioscience representation on Indiana's list of most prominent leaders! We are hitting the road for more! West Lafayette and Bedford – join us! Click here to register for these free events: https://agrinovusindiana.com/events/accelerate-2050-driving-growth-in-the-future/ QUADRANT is also back in September. Register here: https://agrinovusindiana.com/quadrant/ Other episodes of Agbioscience mentioned: Dr. Kofi Essel, Elevance Health - https://podcast.agrinovusindiana.com/public/215/Agbioscience-8a9a4381/6d11d81a Dave Roberts + Andrew Kossack, Applied Research Institute: https://podcast.agrinovusindiana.com/public/215/Agbioscience-8a9a4381/22551e39
July was a month of momentum for the agbiosciences, including the release of new research that revealed Indiana agbioscience contributes nearly $70B to the state's economy. We are here to recap the month: Accelerate 2050 – a new study that highlights three priority opportunities for the agbiosciences including Food is Health, Farmer-Focused Innovation and Farmer-Focused Innovation. We also hear RTI's Jim Redden's comments on what these areas of focus mean for Indiana – and around the world. Heartland BioWorks – recently announced a nearly $51 million implementation grant made possible by the Regional Technology and Innovation Hubs program created by the CHIPS and Science Act of 2022. Vital Farms – expanding its supply chain to Seymour, Indiana with plans to break ground in 2025 and be fully operational in 2027. This will add 150 new jobs to the area. IBJ 250 – celebrating agbioscience representation on Indiana's list of most prominent leaders! We are hitting the road for more! West Lafayette and Bedford – join us! Click here to register for these free events: https://agrinovusindiana.com/events/accelerate-2050-driving-growth-in-the-future/ QUADRANT is also back in September. Register here: https://agrinovusindiana.com/quadrant/ Other episodes of Agbioscience mentioned: Dr. Kofi Essel, Elevance Health - https://podcast.agrinovusindiana.com/public/215/Agbioscience-8a9a4381/6d11d81a Dave Roberts + Andrew Kossack, Applied Research Institute: https://podcast.agrinovusindiana.com/public/215/Agbioscience-8a9a4381/22551e39
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI #74: GPT-4o Mini Me and Llama 3, published by Zvi on July 26, 2024 on LessWrong. We got two big model releases this week. GPT-4o Mini is covered here. Llama 3.1-405B (and 70B and 8B) is mostly covered in yesterday's post, this has some follow up. Table of Contents 1. Introduction. 2. Table of Contents. 3. Language Models Offer Mundane Utility. All your coding are belong to us. 4. Language Models Don't Offer Mundane Utility. Math is hard. Can be expensive. 5. GPT-4o Mini Me. You complete me at lower than usual cost. 6. Additional Llama-3.1 Notes. Pricing information, and more rhetoric. 7. Fun With Image Generation. If you're confused why artists are so upset. 8. Deepfaketown and Botpocalypse Soon. Not surprises. 9. They Took Our Jobs. Layoffs at Activision and across gaming. 10. In Other AI News. New benchmarks, new chip variants, and more. 11. The Art of the Jailbreak. Pliny remains undefeated. 12. Quiet Speculations. Where will the utility be coming from? 13. The Quest for Sane Regulations. Public opinion continues to be consistent. 14. Openly Evil AI. Some Senators have good questions. 15. The Week in Audio. Dwarkesh in reverse, and lots of other stuff. Odd Lots too. 16. Rhetorical Innovation. What are corporations exactly? 17. Aligning a Smarter Than Human Intelligence is Difficult. So are evals. 18. People Are Worried About AI Killing Everyone. Roon warns you to beware. 19. The Sacred Timeline. Hype? 20. Other People Are Not As Worried About AI Killing Everyone. Older Joe Rogan. 21. The Lighter Side. It's on. Language Models Offer Mundane Utility Coding is seriously much faster now, and this is the slowest it will ever be. Roon: pov: you are ten months from working for claude sonnet the new technical founder. Garry Tan: Underrated trend. It's happening. Sully: 50% of our code base was written entirely by LLMs expect this to be ~80% by next year With sonnet we're shipping so fast, it feels like we tripled headcount overnight Not using Claude 3.5 to code? Expect to be crushed by teams who do (us). Not only coding, either. Jimmy (QTing Tan): It can also do hardware related things quite well too, and legal, and logistics (planning) and compliance even. I've been able to put off hiring for months. When I run out of sonnet usage I patch in gpt-4o, it's obviously and notably worse which I why I rarely use it as a primary anymore. Claude 3.5 Sonnet becomes the first AI to crush the Lem Test to 'write an impossible poem.' Laugh all you want, this is actually great. Kache: dude hahahahahah i used so many tokens today on just formatting json logs near: the just stop oil people are gonna come and spray paint you now Compared to how much carbon a human coder would have used? Huge improvement. Language Models Don't Offer Mundane Utility IMO problems are still mostly too hard. The linked one, which GPT-4, GPT-4o and Claude 3.5 Sonnet failed on, seems unusually easy? Although a math Olympiad solver does, predictably given the contests we've seen. [EDIT: I didn't read this properly, but a reader points out this is the floor symbol, which means what I thought was an obvious proof doesn't actually answer the question, although it happens to get the right answer. Reader says the answers provided would actually also get 0/7, order has been restored]. Figure out what song Aella was talking about here. Found the obvious wrong answer. Grok offers to tell you 'more about this account.' I haven't seen the button yet, probably it is still experimental. Our price cheap. Llama 3.1-405B was a steal in terms of compute costs. Seconds: "AI is expensive" its not even half the cost of a middling marvel movie. Teortaxes: Pretty insane that the cost of producing llama-3-405B, this behemoth, is like 40% of *Ant-Man and the Wasp: Quantumania* movie at most If I were Zuck, I'd have open sourced a $...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Llama Llama-3-405B?, published by Zvi on July 25, 2024 on LessWrong. It's here. The horse has left the barn. Llama-3.1-405B, and also Llama-3.1-70B and Llama-3.1-8B, have been released, and are now open weights. Early indications are that these are very good models. They were likely the best open weight models of their respective sizes at time of release. Zuckerberg claims that open weights models are now competitive with closed models. Yann LeCun says 'performance is on par with the best closed models.' This is closer to true than in the past, and as corporate hype I will essentially allow it, but it looks like this is not yet fully true. Llama-3.1-405B not as good as GPT-4o or Claude Sonnet. Certainly Llama-3.1-70B is not as good as the similarly sized Claude Sonnet. If you are going to straight up use an API or chat interface, there seems to be little reason to use Llama. That is a preliminary result. It is still early, and there has been relatively little feedback. But what feedback I have seen is consistent on this. Prediction markets are modestly more optimistic. This market still has it 29% to be the #1 model on Arena, which seems unlikely given Meta's own results. Another market has it 74% to beat GPT-4-Turbo-2024-04-09, which currently is in 5th position. That is a big chance for it to land in a narrow window between 1257 and 1287. This market affirms that directly on tiny volume. Such open models like Llama-3.1-405B are of course still useful even if a chatbot user would have better options. There are cost advantages, privacy advantages and freedom of action advantages to not going through OpenAI or Anthropic or Google. In particular, if you want to distill or fine-tune a new model, and especially if you want to fully own the results, Llama-3-405B is here to help you, and Llama-3-70B and 8B are here as potential jumping off points. I expect this to be the main practical effect this time around. If you want to do other things that you can't do with the closed options? Well, technically you can't do most of them under Meta's conditions either, but there is no reason to expect that will stop people, especially those overseas including in China. For some of these uses that's a good thing. Others, not as good. Zuckerberg also used the moment to offer a standard issue open source manifesto, in which he abandons any sense of balance and goes all-in, which he affirmed in a softball interview with Rowan Cheung. On the safety front, while I do not think they did their safety testing in a way that would have caught issues if there had been issues, my assumption is there was nothing to catch. The capabilities are not that dangerous at this time. Thus I do not predict anything especially bad will happen here. I expect the direct impact of Llama-3.1-405B to be positive, with the downsides remaining mundane and relatively minor. The only exception would be the extent to which this enables the development of future models. I worry that this differentially accelerates and enables our rivals and enemies and hurts our national security, and indeed that this will be its largest impact. And I worry more that this kind of action and rhetoric will lead us down the path where if things get dangerous in the future, it will become increasingly hard not to get ourselves into deep trouble, both in terms of models being irrevocably opened up when they shouldn't be and increasing pressure on everyone else to proceed even when things are not safe, up to and including loss of control and other existential risks. If Zuckerberg had affirmed a reasonable policy going forward but thought the line could be drawn farther down the line, I would have said this was all net good. Instead, I am dismayed. I do get into the arguments about open weights at the end of this post, because it felt obligato...
Try Llama 3.1 Models on SimTheory: https://simtheory.aiJoin our community: https://thisdayinai.comShow notes: https://thisdayinai.com/bookmarks/64-ep71------CHAPTERS:------00:00 - Llama 3.1 8B, 70B and 405B News & Initial Thoughts27:44 - Discussion on Context Input Optimization, RAG and context focuses including "memory stack"38:53 -Best model right now? GPT4-Mini daily driving & is Claude Sonnet 3.5 getting dumber?42:17 - Official Llama 3.1 BOOM FACTOR scores47:08 - GPT4o-Mini Fine Tuning is Now Available53:19 - Chris's Apology for Ruining Online Poker with AI to the Poker Community ------Thanks for listening and all of your support!
If you see this in time, join our emergency LLM paper club on the Llama 3 paper!For everyone else, join our special AI in Action club on the Latent Space Discord for a special feature with the Cursor cofounders on Composer, their newest coding agent!Today, Meta is officially releasing the largest and most capable open model to date, Llama3-405B, a dense transformer trained on 15T tokens that beats GPT-4 on all major benchmarks:The 8B and 70B models from the April Llama 3 release have also received serious spec bumps, warranting the new label of Llama 3.1.If you are curious about the infra / hardware side, go check out our episode with Soumith Chintala, one of the AI infra leads at Meta. Today we have Thomas Scialom, who led Llama2 and now Llama3 post-training, so we spent most of our time on pre-training (synthetic data, data pipelines, scaling laws, etc) and post-training (RLHF vs instruction tuning, evals, tool calling).Synthetic data is all you needLlama3 was trained on 15T tokens, 7x more than Llama2 and with 4 times as much code and 30 different languages represented. But as Thomas beautifully put it:“My intuition is that the web is full of s**t in terms of text, and training on those tokens is a waste of compute.” “Llama 3 post-training doesn't have any human written answers there basically… It's just leveraging pure synthetic data from Llama 2.”While it is well speculated that the 8B and 70B were "offline distillations" of the 405B, there are a good deal more synthetic data elements to Llama 3.1 than the expected. The paper explicitly calls out:* SFT for Code: 3 approaches for synthetic data for the 405B bootstrapping itself with code execution feedback, programming language translation, and docs backtranslation.* SFT for Math: The Llama 3 paper credits the Let's Verify Step By Step authors, who we interviewed at ICLR:* SFT for Multilinguality: "To collect higher quality human annotations in non-English languages, we train a multilingual expert by branching off the pre-training run and continuing to pre-train on a data mix that consists of 90% multilingualtokens."* SFT for Long Context: "It is largely impractical to get humans to annotate such examples due to the tedious and time-consuming nature of reading lengthy contexts, so we predominantly rely on synthetic data to fill this gap. We use earlier versions of Llama 3 to generate synthetic data based on the key long-context use-cases: (possibly multi-turn) question-answering, summarization for long documents, and reasoning over code repositories, and describe them in greater detail below"* SFT for Tool Use: trained for Brave Search, Wolfram Alpha, and a Python Interpreter (a special new ipython role) for single, nested, parallel, and multiturn function calling.* RLHF: DPO preference data was used extensively on Llama 2 generations. This is something we partially covered in RLHF 201: humans are often better at judging between two options (i.e. which of two poems they prefer) than creating one (writing one from scratch). Similarly, models might not be great at creating text but they can be good at classifying their quality.Last but not least, Llama 3.1 received a license update explicitly allowing its use for synthetic data generation.Llama2 was also used as a classifier for all pre-training data that went into the model. It both labelled it by quality so that bad tokens were removed, but also used type (i.e. science, law, politics) to achieve a balanced data mix. Tokenizer size mattersThe tokens vocab of a model is the collection of all tokens that the model uses. Llama2 had a 34,000 tokens vocab, GPT-4 has 100,000, and 4o went up to 200,000. Llama3 went up 4x to 128,000 tokens. You can find the GPT-4 vocab list on Github.This is something that people gloss over, but there are many reason why a large vocab matters:* More tokens allow it to represent more concepts, and then be better at understanding the nuances.* The larger the tokenizer, the less tokens you need for the same amount of text, extending the perceived context size. In Llama3's case, that's ~30% more text due to the tokenizer upgrade. * With the same amount of compute you can train more knowledge into the model as you need fewer steps.The smaller the model, the larger the impact that the tokenizer size will have on it. You can listen at 55:24 for a deeper explanation.Dense models = 1 Expert MoEsMany people on X asked “why not MoE?”, and Thomas' answer was pretty clever: dense models are just MoEs with 1 expert :)[00:28:06]: I heard that question a lot, different aspects there. Why not MoE in the future? The other thing is, I think a dense model is just one specific variation of the model for an hyperparameter for an MOE with basically one expert. So it's just an hyperparameter we haven't optimized a lot yet, but we have some stuff ongoing and that's an hyperparameter we'll explore in the future.Basically… wait and see!Llama4Meta already started training Llama4 in June, and it sounds like one of the big focuses will be around agents. Thomas was one of the authors behind GAIA (listen to our interview with Thomas in our ICLR recap) and has been working on agent tooling for a while with things like Toolformer. Current models have “a gap of intelligence” when it comes to agentic workflows, as they are unable to plan without the user relying on prompting techniques and loops like ReAct, Chain of Thought, or frameworks like Autogen and Crew. That may be fixed soon?
US equities ended mostly higher in fairly quiet Wednesday trading, with the Dow Jones, S&P500, and Nasdaq closing up 4bps, 16bps, and 49bps respectively. May new home sales missed. Architecture Billings Index declined in May to its lowest level since August 2020. Today's $70B auction of 5-year notes saw slight stop through in latest well-received sale. Micron trading lower after hours as revenue guidance was shy.
It's return guest season here at Latent Space! We last talked to Kanjun in October and Jonathan in May (and December post Databricks acquisition): Imbue and Databricks are back for a rare treat: a double-header interview talking about DBRX from Databricks and Imbue 70B, a new internal LLM that “outperforms GPT-4o” zero-shot on a range of reasoning and coding-related benchmarks and datasets, while using 7x less data than Llama 3 70B.While Imbue, being an agents company rather than a model provider, are not releasing their models today, they are releasing almost everything else: * Cleaned-up and extended versions of 11 of the most popular NLP reasoning benchmarks* An entirely new code-focused reasoning benchmark* A fine-tuned 70B model, built with Meta Llama 3, to identify ambiguity* A new dataset of 450,000 human judgments about ambiguity* Infrastructure scripts for bringing a cluster from bare metal to robust, high performance training* Our cost-aware hyperparameter optimizer, CARBS, which automatically and systematically fine-tunes all hyperparameters to derive optimum performance for models of any sizeAs well as EXTREMELY detailed posts on the infrastructure needs, hyperparameter search, and clean versions of the sorry state of industry standard benchmarks. This means for the FIRST TIME (perhaps since Meta's OPT-175B in 2022?) you have this level of educational detail into the hardware and ML nitty gritty of training extremely large LLMs, and if you are in fact training LLMs of this scale you now have evals, optimizers, scripts, and human data/benchmarks you can use to move the industry forward together with Imbue.We are busy running the sold-out AI Engineer World's Fair today, and so are unable to do our usual quality writeup, however, please enjoy our show notes and the excellent conversation! Thanks also to Kanjun, Ashley, Tom and the rest of team Imbue for setting up this interview behind the scenes.Video podTimestamps* [00:00:00] Introduction and catch up with guests* [00:01:55] Databricks' text to image model release* [00:03:46] Details about the DBRX model* [00:05:26] Imbue's infrastructure, evaluation, and hyperparameter optimizer releases* [00:09:18] Challenges of training foundation models and getting infrastructure to work* [00:12:03] Details of Imbue's cluster setup* [00:18:53] Process of bringing machines online and common failures* [00:22:52] Health checks and monitoring for the cluster* [00:25:06] Typical timelines and team composition for setting up a cluster* [00:27:24] Monitoring GPU utilization and performance* [00:29:39] Open source tools and libraries used* [00:32:33] Reproducibility and portability of cluster setup* [00:35:57] Infrastructure changes needed for different model architectures* [00:40:49] Imbue's focus on text-only models for coding and reasoning* [00:42:26] CARBS hyperparameter tuner and cost-aware optimization* [00:51:01] Emergence and CARBS* [00:53:18] Evaluation datasets and reproducing them with high quality* [00:58:40] Challenges of evaluating on more realistic tasks* [01:06:01] Abstract reasoning benchmarks like ARC* [01:10:13] Long context evaluation and needle-in-a-haystack tasks* [01:13:50] Function calling and tool use evaluation* [01:19:19] Imbue's future plans for coding and reasoning applications* [01:20:14] Databricks' future plans for useful applications and upcoming blog postsTranscriptSWYX [00:00:00]: Welcome to the Latent Space Podcast, another super special edition. Today, we have sort of like a two-header. John Frankel from Mosaic Databricks, or Databricks Mosaic, and Josh Albrecht from MBU. Welcome.JOSH [00:00:12]: Hey, glad to be here.SWYX [00:00:14]: Thank you for having us. Hey, so both of you are kind of past guests. Jonathan, you were actually one of the most popular episodes from last year talking about MPT7B. Remember the days when we trained large models and there was 7B?JONATHAN [00:00:30]: Yeah, back when reproducing LLAMA1-7B was considered a huge accomplishment for the field. Those are the good old days. I miss that.SWYX [00:00:38]: As the things have accelerated a lot. Actually, let's do a quick catch up and Josh, you can chime on in as well. So Databricks got acquired. I talked to you at New York.JONATHAN [00:00:45]: Mosaic got acquired, although sometimes it feels like Mosaic acquired Databricks because, you know, we're having a lot of fun being here. But, you know, yeah.SWYX [00:00:52]: Yeah. I mean, you are chief scientist now of Databricks.JONATHAN [00:00:55]: Chief AI scientist. Careful with the title. As much as I would love to understand how Spark works, I'm going to have to defer that to much smarter people than me.SWYX [00:01:03]: Got it. And I don't know about like what you would highlight so far as a post-acquisition, but the most recent news is that you guys released DBRX. Is that the thing that most people should be aware of?JONATHAN [00:01:13]: Actually, that's no longer the most recent news. Honestly, the most recent news, we announced this, but it was at our Data and AI Summit last week. So it was announced among like 100,000 other things, is that we finally released our text to image model, which has been a year in the making through a collaboration directly with Shutterstock. There was a lot of work put into finding a dataset that we were comfortable with working on and trying to build a model that honestly, I felt like I could trust and that others might be able to trust to put out in the world. So that model was released last week. It's unfortunately just available via API due to the fact that the data is quite sensitive and quite valuable. It's Shutterstock's entire business in a lot of ways, but I'm still really excited that there's now a model that is trained on a dataset where the provenance of every single image is known, and it's a damn good model. So I'm really proud of the team on that.SWYX [00:01:55]: Yeah, amazing. Josh, do you have any thoughts on image model questions?JOSH [00:01:59]: That is not my area of expertise, but I was excited to see the release of it last week as well, and very happy that you guys did a nice job on the data side of everything there. So that was cool to see.SWYX [00:02:09]: I think what's unusual is like, I think Shutterstock's doing multiple deals in multiple labs. So what is the Shutterstock model? Like, I guess, is this the house model for Shutterstock? Is this Databricks' version of the Shutterstock model? Like, what is this?JONATHAN [00:02:22]: The way that I would think about it is that Shutterstock is doing an amazing business in AI across the board. Their dataset is kind of widely known to be the best stock photos dataset in the world, the most comprehensive, the biggest. When you think about like, what dataset am I going to train a multimodal model on? You call Shutterstock. And I, at least I've heard in the news, like OpenAI, Google, Meta, Apple have all called Shutterstock and made those deals. So a lot of models have had Shutterstock data incorporated into them. But this is the only model I know of so far where it was, you know, exclusively and specifically trained just on the vanilla Shutterstock data. There was nothing else mixed in. We didn't go and scrape the web and find other data or combined datasets or anything like that. And so this is, in some sense, the house blend. But the other piece is that it's just a dataset where the provenance of every image is known in public. Where did the data come from? It is the Shutterstock collection. That's it. You know, nothing less, nothing more. And certainly being at Databricks, if I've learned one thing, I've learned about enterprise customers and what they want out of AI. And one of the things they ask for most is just, what can you tell me about the data the model was trained on? And here, especially for text to image models, where images are just tricky subject matter, there's been a lot of kind of legal conversation about images, especially. It's nice to just have something where I can point to it and say, you know, if you want to know where the images came from, these are what they are and this is how they got there.SWYX [00:03:36]: I will talk a little bit about Databricks because it's relevant to the rest of today's episode. So Databricks, sorry, I keep misspeaking. It's DBRX.JONATHAN [00:03:46]: DBRX, actually, there's been a pronunciation update. It is now D-B-Rex. So we have decided to add a dinosaur mascot because what model doesn't like a mascot? So literally, I wish I could pull it up. There is a little plush dinosaur that we had made. It's like the world's cutest dinosaur, but it is the official mascot of D-B-Rex. And there's a little dinosaur logo that, you know, you'll probably see around a little bit more because DBRX is a mouthful, but D-B-Rex, like, you know, it's just kind of...SWYX [00:04:13]: Rolls off the tongue. I love mascots. Like every company should have a mascot. And I think Hugging Face got it right. You need an emoji mascot because that's the minimal viable image.JONATHAN [00:04:21]: I probably shouldn't talk at all about, you know, Velociraptor, but, you know, that's a, maybe that's something we can talk about later in the summer. I'll just leave it at that.SWYX [00:04:28]: Okay. That's a hint to names. I feel like your names leak a lot of alpha. So just to quickly cover the headline details, DBRX, as Make Sure Experts model, that's fairly big, 132 billion total parameters, so 36 billion active on any input, pre-trained on 12 trillion tokens of text and code, and did really well on evals to the point where you had to dye your hair blue. That's my high level conclusion.JONATHAN [00:04:53]: Never make a bet with your team two weeks out from model launch, even when, you know, human eval is looking quite bad. Because if you set some bar, even if it's arbitrary and you think there's no way in hell they're going to hit it, apparently money doesn't motivate people anymore. Humiliating their boss motivates people. So Josh, you should really take a hint from this. You know, you cannot pay someone enough money to make up for you dyeing your hair blue.JOSH [00:05:15]: I'll keep that in mind for our next model.SWYX [00:05:17]: It works. So speaking of Imbue's next model, perhaps Josh, you want to actually just say hi to the general sort of latent space audience and talk about what we're releasing today. Yeah.JOSH [00:05:26]: I'm Josh, CTO of Imbue, and we're not releasing the model. We're not releasing the weights, but we are releasing a bunch of different things that should make it easier for other people to make their own models. So I think right now, training foundation models from scratch is like a very difficult, time-consuming, expensive, kind of risky endeavor, especially for smaller companies. And the things that we're releasing hopefully make that at least a little bit easier. So the things that we're releasing fall into kind of three different buckets. One is infrastructure and scripts for dealing with the kind of hardware and hardware failures and understanding how well is the actually lowest level of thing actually working so that you can actually do your training at all and at a reasonable speed without having to constantly restart, etc. So infrastructure and training scripts. A second set of things is around the evaluation. So after you've trained it, like how well is this actually working and how do you know how well it's working? We're releasing a whole bunch of different data there, a new benchmark about code, reasoning, understanding, as well as our own private versions of 11 different open source benchmarks. So things like pool queue or ANLI, where we've gone through and kind of cleaned up the data as much as possible by looking at all the ones that models get wrong or that are flagged for ambiguity and also our own kind of private reproductions of those where we've done like a kind of clean room black box, like, okay, this is what the data set is supposed to be. Here are some examples. Let's make our own version of this to make sure that there is no data contamination, etc. To make sure that we're actually, you know, not testing on train. And then I think a final thing that we're releasing there is around 450,000 human judgments about ambiguity and question quality, which we used in the process of cleaning these evaluations and we also hope will be helpful for other people training kind of similar models. And then the third thing is CARBS, our hyperparameter, our cost-aware hyperparameter optimizer, which was especially helpful for being able to experiment at much smaller scales and then scale those experiments up to the much larger scale kind of on the first try without having to retry it. You don't want to be training, you know, 10, 20 different 70B models. You really want to get these larger modelsSWYX [00:07:30]: right on the first try.JOSH [00:07:30]: And so the ability to kind of tune things very precisely and learn scaling laws, not just for, you know, the like data and flops, but also for learning rate and all the other hyperparameters and see like how should you scale these things up was extremely valuable to us as we were training the larger models. Yeah, that's a lot of stuff.SWYX [00:07:49]: Yeah, exactly. So there's a bunch of stuffJOSH [00:07:50]: we'll have to go through all of it.JONATHAN [00:07:52]: Yeah, I just want to throw in how excited I am about this. This is the stuff that nobody ever talks about. That is the difference between success and failure in this stuff. Like, can you get your cluster to run? Can you get software on your cluster? Can you figure out what broke? Because fault tolerance is still not really built into any of the fundamental primitives of training models. And so if something breaks, you have to go figure out what broke, your job stops, you have to restart your job. It is a nightmare just to get to the point where anything can train on the cluster. A basic MPI hello world that has the GPUs talk to each other is hard enough, let alone actually training a model, let alone getting good performance out of the GPUs, let alone actually getting a model that converges to anything interesting. There's so many levels of things you have to accomplish. This is the kind of stuff that matters. I think to a point that Josh made earlier, before we got on here, there are plenty of weights out there. Nobody's released this.JOSH [00:08:46]: Yeah, that was part of the motivation actually is that there are lots of other things that are complimentary, but I have not seen nearly as much discussion about some of these other things that we think are pretty important. I mean, in some sense,SWYX [00:08:56]: I'm very excited to have Jonathan on because this is a little bit, you're a bread and butter with Mosaic. And I think you've released some part with Composer. And I think it's just really interesting to see like a different take, basically a full stack take that's kind of open source today.JONATHAN [00:09:18]: Yeah, it's really kind of, it's been an ordeal to figure this out. And every time something changes, whether it's a new GPU or even a new driver update, you get new creative errors and new things go wrong. And, you know, we've dealt with the weirdest things from, you know, our InfiniBand cables getting stolen from the data center twice, like in boxes before they arrived at the data center. Like, you know, Porch Pirate basically had stolen our InfiniBand cables back when those were hard to come by. To like, you know, weird recalls of switches to like the strangest stuff has happened. I have my favorite GPU failures I've seen, like ones where the GPU doesn't fail, it has a correctable memory issue and the memory correction causes the GPU to become a straggler and hold up the whole job. Like weird stuff happens and figuring out how to not just identify all of that, but then eventually productize it, is in some sense, the entire story of Mosaic and now Databricks in terms of our ML offering. Really, the thing we offer is we have gone through this suffering and figured out how to even productize that. It has been a pain in the butt.SWYX [00:10:20]: Yeah, it's a lot of work.JOSH [00:10:20]: I think my favorite failure was GPU is just giving wrong math. Like if they give errors, great, because you can see the errors, but if they just give you the wrong math back, not so fun.SWYX [00:10:30]: When did they give you wrong math?JOSH [00:10:32]: Like literally you could just, you know, add two things. For example, the numbers come back. They're not the numbers that they're supposed to be.JONATHAN [00:10:40]: I think it's important to say at this stage, just because like it, I think it goes without saying for Josh and I, but it's worth saying here, this isn't to say that like anything is wrong with us. It's not like NVIDIA did a bad job or, you know, Mellanox did a bad job or the like the server builder, the data center operator, the cloud provider, like the million other parties that are involved in building this. We are running these insane chips that are huge and complicated and built on tiny transistors at insane frequencies with insane heat in data centers that for the most part, were not built remotely for this kind of power or heat and have been retrofitted for this. Like failures happen on a good day with normal CPUs. And this is not a good day and not a normal CPU for the most part. It's fun to joke about all the weird things we see. This is not to say anybody's done anything wrong. This is just kind of part and parcel of working on a massive cluster running at multiple megawatts of power at a time.SWYX [00:11:32]: It's crazy. Yeah.JONATHAN [00:11:33]: So optical cables, like all sorts, like everything.SWYX [00:11:37]: I'll take the opportunity to start going to the sort of infra piece. There's just like a description of the infra just to give people a sense of what we talk about when we talk about massive clusters. So I'm just going to read off the blog post here. This post is about one cluster that has 4,092 H100 GPUs spread across 511 computers. They use unified fabric manager nodes, which manage the infinite band network. And you talk a little bit about your networking. Is there anything unusual about this setup that you'll call out to people?JOSH [00:12:03]: Yeah, actually this particular cluster is a little bit non-standard. The normal, like vanilla setup for these large clusters as vanilla as it can be is what's normally like a 127 node cluster. So closer to like 1024 GPUs instead of 4,000. Here we have a larger cluster. As you start to get into the larger clusters, the networking becomes a little bit more custom. It's a little bit more, it's a little bit trickier. It's a little bit more difficult to get these things to all be able to talk to each other at the same speed. And so this has, in this particular case, this is a three tier network architecture instead of two tiers, kind of the normal one. So most of the clusters are a little bit smaller. As you get to even larger scales, then this becomes even much more complicated,SWYX [00:12:43]: much more expensive.JOSH [00:12:43]: So we chose this particular scale, kind of knowing our own workloads and kind of what we wanted to do. This was kind of the right size for us. But yeah, I think it's not exactly vanilla already. It's already getting into kind of the custom territory.SWYX [00:12:54]: So my understanding is that there, and is there any part of this that comes with the Voltage Park deal that you guys had? Is that part of the hardware that you got from the deal with them?JOSH [00:13:04]: Yeah, so we worked really closely with Voltage Park to set up all their clusters and infrastructure and everything and kind of decide even like what to order, how should the networking work? Like we were very involved in kind of the construction and bring up of this. And that's what this post is about, is about that process of like bringing up all these, there's like different clusters in different places of different scales. So in this particular post, we're talking about this one 4096 GPU, but there are other clusters that they have as well. And we were very closely involved with figuring out the exact architecture and kind of the trade-offs that go along with picking, you know, those exact components. You really don't want to like place the wrong order because it takes months to get it and it's very expensive. So yeah, we were happy to help out with that.JONATHAN [00:13:43]: And then your bit of good cables get stolen.SWYX [00:13:44]: Yeah, yeah, exactly.JOSH [00:13:47]: We wanted to make sure that we ended up with compute that would work for us and that would also work for their other customers. And so we kind of helped design something so that we would get exactly what we were looking for. We knew that these kinds of details would be super important and that getting down to the level of the hardware and like having these good scripts and everything was going to be a core part of like actually getting this to work. I'm very glad that we did that. I don't think that most companies kind of take that full stack approach, but for us, it certainly paid off.SWYX [00:14:12]: Yeah, it's basically sort of built to spec. It's interesting that relationship because you usually, for the rest of us who don't operate at your scale, we take whatever we can get from cloud providers, but you are basically co-designing from the single machine up. And you described that a little bit. Do you want to take us through the process that you described here?JOSH [00:14:27]: Yeah, so for the actual, like the blog post and kind of bringing these machines online.SWYX [00:14:32]: Yeah.JOSH [00:14:32]: So yeah, I think the process, as we have it broken down in the blog post, there's kind of a few different layers. First is like getting the individual machines to work at all and then getting the machines to actually be able to talk to each other. So getting the InfiniBand networking to work and then getting to a point where, you know, not just the machines are working and they can talk to each other, but everything is actually working correctly. There's a big gap between like it's working at all to it's working perfectly correctly. And then after you have all this stuff working perfectly correctly, nice and healthy, then now you get into kind of the software data, like training issues. And then after that, you're still not done. Like now, even once you're training at full speed, things are going to fail over time. Things are going to change. There's going to be new, you know, firmware updates. Like how do you kind of deal with this change and flux over time without going crazySWYX [00:15:16]: and pulling your hair out,JOSH [00:15:16]: trying to like reproduce things or understand why there were regressions. And so there's a lot of work to kind of automate the infrastructure tooling as well. And kind of the first step, like bringing these things online in the first place, you know, you have hundreds of machines at this point. So you don't necessarily want to be like walking around with like a CD-ROM or a USB drive, like plugging it in with your keyboard, like hitting next, next, next on the OS install. That's not how this works. You do that for one machine. And then you use, we use this thing called Metal as a Service to bring up all the other machines. So it's a kind of server that can kind of install the operating system on these other machines. So most like when you're talking about these machines, like each machine is, you know, on the order of hundreds of thousands of dollars. So they usually come with a kind of out-of-band management interface as well. So they don't, they have their InfiniBand networking. They have their normal 100 gigabit per second Ethernet networking. These are like dual, redundant, et cetera. And then you also have this extra out-of-band management network. So you can log in and you can see like the boot screen or you can see the blue screen of death. You can like get in there and actually see what was wrong, which is pretty fun. And it makes it like possible to automate a lot of this work. So the beginning of that, and the blog post goes into much more detail about like exactly how we set these up and kind of the other errors that we ran into. When you're bringing these online, you'll definitely have failures. Even if they all worked in the factory, they get shipped, some parts come loose, something fails, something goes wrong. So when you're bringing them online, there'll be some that don't quite work for all sorts of reasons. As you start to be working with machines at this scale, like if something happens one in a thousand times, you're like pretty likely to see it. And so you can get pretty rare, weird things, especially since we had fairly early builds and fairly early versions of this hardware. Like these are some of the like first machines that were ever produced, some of the first GPUs. So you've got some extra special things there. We definitely worked with Dell, for example, on making fixes in the firmware level to be like, okay, like this thing is wrong. Like we need to update this at the firmware to like actually fix this particular thing. So we worked pretty closely with Dell and Nvidia. Yeah, that's what I'm saying. Like this stuff gets complicated. And the thing is like, you know, taking a step back, the whole reason we're doing this, right, is that we knew that this was going to be complicated. There would be these kinds of failures. And if we're just using, you know, AWS or some other cloud provider, these errors are still gonna be there and you're gonna have no way to know and no way to debug this and no way to diagnose what's going wrong. And so we would much rather be able to like call up Dell and say, hey, this isn't working. And they're like, yep, okay, cool. Let's debug it together. Oh, I see. Yeah, cool. We'll ship a firmware update and actually fix this for you. That was a much better experience than like, great, just magically fails. I guess we restart and hope that that machine goes away. Like that's not a very good place to be. So yeah, that's kind of the first place is getting to a place where like GPU training is working on your single node machines. You can observe stuff. We have tons of tooling around like, you know, Prometheus and all sorts of other tools for understanding what's going on in these machines because you don't want to be like logging into each one and looking at the temperature or something you really need to have tooling to collect all these metrics, et cetera. Unfortunately, all of the scripts that we have for this are like for this entire cluster and for all this infrastructure are a little bit like special purpose for our particular thing. So it's not that every script that we have, it's not that you can just like take this and plug this in. Even if we did open source all the tooling that we have, you'd still have to do like a lot of work to open source it. What we are releasing is as many of the things that we can that are going to be useful for other people. You're still going to have to have some way of kind of managing these things, making your own like logging aggregators, et cetera, et cetera. So that's kind of bringing them up to the like, you know, the single nodes that are working. From there, it goes into, I'm happy to keep going if you want. Well, I just want to leave the opportunity for JohnSWYX [00:18:53]: to comment if there's anything that's different from how he runs things.JONATHAN [00:18:57]: Oh, I mean, all I'll say is I'll endorse this and say this s**t is hard. Like this is really, really hard. And, you know, I have a special props to, you know, the folks in Vue because they were building this from the ground up. You know, at Databricks and at Mosaic, we typically work with cloud providers because some of this stuff is just, there's too much to handle. It's complicated. There's a lot to deal with. And this doesn't even get into things like physical security, you know, securing power if you're the data center operator. Like this gets infinitely complicated and you have to abstract somewhere. Like, you know, and then you get to the folks who are literally building their own custom chips and like, good God.SWYX [00:19:36]: Like, oh my God, that's, you know,JONATHAN [00:19:38]: if you're one of those folks, you're having, you know, pour one out for the infra people at some of the AI chip startups who are having a really, really interesting time right now. But this stuff is really hard. And I don't think we talk about it much because there's so many other things that are hard. But the other hard things, I think everybody's becoming pretty familiar with at this point. This is something that I don't think there's ever really been a comprehensive discussion of, at least not that I've seen.SWYX [00:20:00]: Yeah, so my impression is that you guys, Mosaic, have your own software for sort of spinning up and down machines, just like Imbue had to build. But Imbue probably, it sounds like Imbue, you guys went fuller stack. I don't know how to describe it. Like Mosaic is not working with Dell on like their firmware.JONATHAN [00:20:21]: No, no, we're typically working with like, you know, pick your cloud provider on their Dell firmware or what have you. Like, it's kind of, I think one of the things, I don't know, Josh, you can correct me on this. It's kind of impossible if you're doing training to not go all the way through the entire stack, regardless of what happens. Like somehow I'm still chatting with cloud providers about power contracts, even though the whole point of dealing with the cloud provider is not to have to think about power contracts. Somehow I'm still asking them about which InfiniBand provider they used this time to see if this is part of the bad batch of cables I encountered on that cloud provider or what have you. Or like, we're still talking about a firmware update from pick your provider. You can't not do this. It's convenient that they have data center staff who are worrying about what to send back to which provider when, and they have people who can go and wait for the InfiniBand cables so they don't get stolen outside. But, you know, it's kind of, it's impossible not to really go full stack if you're thinking about the infrastructure at all. I don't know, Josh, correct me. No, I think that's right.JOSH [00:21:17]: That's what we expected from the beginning as well, is that we would inevitably have to get into the details here. And I'm glad that we kind of just planned for it. I think it made it a lot easier from our perspective to have direct control over this. Instead of having to go to the cloud provider that goes to the data center, that goes to the supplier, we could just go direct to NVIDIA or DellSWYX [00:21:37]: or the data center,JOSH [00:21:37]: whoever was responsible and be like, hey, this thing needs to change. And they're like, oh, okay. Yeah, that is our responsibility. Great, we can fix that. So it was just a lot easier for us to fix these bugs than if we had to go through an extra layer of email.SWYX [00:21:48]: Something we discussed in the pre-show was that you had a rule of thumb for your cluster of reliability. You say here in the post, by and large, you expect around 3% of your machines to break every week. So you're basically going to turn through all your machines in a year.JOSH [00:22:04]: As it says in the post. So that would be true if it was a uniform failure like that. But as it says in the post, it's usually these kind of problematic nodes. And to be clear, that is the number that we've heard from other people is like they're having about 3%. I don't think we're experiencing failure rates that are that high. I think ours is actually quite a bit lower than that, probably because we've taken the time to like dig into a large, maybe larger number than we should have of these failures and get to the root cause of it and be like, oh, okay, like that's exactly what's going wrong.SWYX [00:22:33]: How do we fix this?JOSH [00:22:33]: How do we prevent this from happening? How do we make automated checks for this so that if it does happen, it just goes back to whoever owns that particular part of the process and they can fix it immediately.SWYX [00:22:43]: And that's part of what you're also open sourcing, which is the health checks, right? You got the NIC health checks, GPU health check, this space health check, Docker D message. I don't know what that is.JOSH [00:22:52]: That one is just a lot of stuff.SWYX [00:22:54]: Yeah.JOSH [00:22:55]: That one is one where we realized that actually like when these machines boot, sometimes they wouldn't actually boot cleanly all the way. Or when they rebooted, they had problems that they didn't have when they were working before, which was kind of frustrating. Like usually if you restart your computer,SWYX [00:23:08]: it gets better.JOSH [00:23:08]: Here you restart. It did not get better.SWYX [00:23:10]: It got worse.JOSH [00:23:10]: That was very frustrating. So this health check looks at every particular line we've ever seen from the boot, like in D message, like every single log line that your computer emitsSWYX [00:23:21]: and says like,JOSH [00:23:21]: have we ever seen this before?SWYX [00:23:23]: Is this expected?JOSH [00:23:23]: Is this in the right order? Or is there something out of place? If there's anything out of place, let me say, okay, great. Like now it goes into this, like longer, more triage list of like, all right, great. Like, is this acceptable?SWYX [00:23:33]: Should we flag this?JOSH [00:23:33]: Like, should someone take a look at this? So we're looking down at a very, very granular detail level, what's happening on these computers to make sure that nothing is out of place. And that's critical because without that, if you're running your training, as Jonathan said, and this thing is slow, like what are you supposed to do? Right?SWYX [00:23:49]: Like you really,JOSH [00:23:49]: you really want to be very certain that like all 4,000 of these GPUs are working like they're supposed to.SWYX [00:23:54]: We know that.JOSH [00:23:54]: And so if it's slow, it's because like we messed up the config or something else and not because of this earlier thing that's like really hard to detect in software later.JONATHAN [00:24:01]: Yeah. I think the, I'm just curious to ask,SWYX [00:24:03]: like, you know,JONATHAN [00:24:03]: suppose you were to set up another, let's say another H100 cluster and it were at a different data center. And instead of the vendor being Dell, it was super micro or what have you. How much of this would be repeatable? And how much of this would you have to redo? I, you know, I genuinely don't know.SWYX [00:24:18]: A decent amount.JOSH [00:24:19]: I think it would go a lot faster the second time. I think there's lots of learnings that we had. And also the blog post,SWYX [00:24:24]: you know, yes,JOSH [00:24:24]: we are releasing the health checks, releasing some scripts, but a lot of the valuable stuff is also in the blog post itself, in the details and kind of the, you know, the learnings that we've had and the sort of errors that we run into. We tried to as much as possible surface those to other peopleSWYX [00:24:36]: could learn from thoseJOSH [00:24:36]: and avoid the same mistakes or failures as well. But I think it would go a lot faster.SWYX [00:24:41]: Although, yes,JOSH [00:24:41]: there would certainly be some things that'd be a little bit different. I mean, there'd probably be different CPUsSWYX [00:24:46]: or whatever,JOSH [00:24:46]: but I think a lot of that stuff is less,SWYX [00:24:49]: it's less,JOSH [00:24:49]: that's the like, that's less variable. I think most of it would apply the second time around. Although I'm sure next timeSWYX [00:24:56]: we're building one,JOSH [00:24:56]: it'll probably be, you know, at a scale that's 10x as big with a different chip or something like this.SWYX [00:25:00]: And then who knows?JOSH [00:25:01]: Yeah, with Kinect X8,JONATHAN [00:25:02]: that will have its own fun behavior and all that good stuff. Yeah.SWYX [00:25:06]: Perhaps there's something that people don't discuss about, and you don't even talk about this in the blog, but I always wonder is what is the timeline that's like kind of reasonable for this amount of work, at least the initial stages? And also what does the team composition look like for setting up a cluster, right? Like what are the mix of skills that you typically would require to get all this going?JOSH [00:25:27]: I'm, I can't really speak to typical. One thing I am very proud of is how much we accomplished with such a ridiculously small team. Like our infrastructure team is like, you know, fluctuates from week to week, depending on like how many things are on fire and how much we need to build. But it's like between like three and six people, like it's small. It's not like some huge team of like tons and tons of engineers. But those people are very, very good at what they do. And so that has allowed us to get a lot of mileage out of out of these things. I think it's not that we're building everything, right? It's not that three to six people build this whole thing. I definitely want to like, you know, say thanks very much to Dell and H5 and NVIDIA and the other people that have done a lot of the work, like to bring up this cluster, you know, with 4000 GPUs and three tier networking, networking architecture, you have 12,000 cables. So that's 24,000 things that need to be plugged in. Like that's just a lot of stuff to plug in, right? And you don't want to mess it up. Like each one needs to be done correctly. Like it's a little bit loose. Like it doesn't really work.SWYX [00:26:23]: If you break it,JOSH [00:26:23]: you need to replace it. Like there's a lot of workSWYX [00:26:26]: that goes into this.JOSH [00:26:27]: Yeah.SWYX [00:26:28]: And then, you know,JOSH [00:26:28]: that's just like that's it. That's if you were to do everything right the first time.SWYX [00:26:32]: And if you didn'tJOSH [00:26:32]: have to fix anything. But inevitably, you know, you will have to replace something, which means like taking all the wires out, pulling the thing out, taking all the GPUs out, going and fixing some cable, putting it all back correctly, putting it back in, doing this every time. So there were a lot of people at Dell, NVIDIA and at H5 that all helped a ton with this stuff. I don't know the exact size of the Dell team. It also fluctuated over time.SWYX [00:26:55]: Yeah, excellent. And then, you know, you so you have all the hardware set up and now you're firing it up for a single node. There's a long description that you guys have about just like monitoring the MFU, right? And what each situation might look might be indicative of. One of the most interesting things to me that I saw from here is like, you know, if training immediately starts off at 60 to 80% MFU, something's wrong.SWYX [00:27:24]: But like, you know, like what what are like, you know, some anecdotes or, you know, notable scenarios here that you might you might call out as maybe counterintuitive or super interesting.JOSH [00:27:36]: There's just so many of them. I mean, one of them, which I think is probably pretty common, like common knowledge by this point. But like we did have a sort of likeSWYX [00:27:46]: which one was this exactly?JOSH [00:27:47]: I think for the MFU, like gradually getting worse over time. I think that one, when we saw that the first time we were like, what the heck is going on? Like, why does it get just like a little bit worse? This is so strange. Like, what is it getting lazy or tired or something? Like, is it heat? Like what's going on? And in this particular case, it was memory fragmentation. Because you have hundreds of machines, they're doing garbage collection slightly different times. And then they get slightly further apart and slightly more and more jittered until eventually they're all happening kind of at random times. And just like really messing up each one of your steps. So you just turn off garbage collection and call it a day, basically,SWYX [00:28:20]: to be honest.JOSH [00:28:20]: There's other things you can do if you want to be a little bit more sophisticated about it. But you can also just manuallyJONATHAN [00:28:25]: have it all garbage collect on some interval. Like that's what we've done. We just have a garbage collection callback that just runs. But I've seen the exact same thing.JOSH [00:28:33]: Yeah, yeah, exactly. So I thought that one was kind of funny. And we did trace that one down and look and we did find the actual call. Like, again, this goes to like having good tools. So we had really good tools where we could look at a bunch of like actual traces in C and be like, OK, cool. This is the thing that's taking a lot of time. Or like, you know, this is the thing that doesn't quite line up here. Like, oh, I guess it's garbage collection. OK, cool.SWYX [00:28:52]: Interesting.JOSH [00:28:52]: Yeah, let's just try taking it off.SWYX [00:28:54]: OK, great.JOSH [00:28:54]: That's what it was. Now we can fix it. So for each of them, like basically bugs are not hard if you have good tools. But if you don't have good tools, bugs can be very, very hard. So similarly for like heat, another thing that we saw was like, oh, you know, the CPU is getting throttled. OK, well, it's easy to see if you're monitoring the CPU throttling or monitoring the heat. If you're not monitoring that, it's really hard to know why it's just suddenly one of them is going slower. I noticed also in the pieceSWYX [00:29:17]: that you mentioned FSDP with 0.3. Actually, we met, I went to iClear and Guanhua from the DSP team was there presenting 0++. I was wondering if you want to make any call outs to, you know, particular open source or open library or open whatever implementation teams that were super helpful in your process. I think we ended up actuallyJOSH [00:29:39]: pulling from a whole bunch of different ones to pull things in into our own particular pipeline. So we use things from NVIDIA's, you know, Megatron stuff. We use stuff from probably DeepSpeed. I think we pulled in a bunch of different pieces from a bunch of different places. So it was really nice to see all these working open source like examples. I think I really appreciate all the effort that has gone into actually tuning these things because you can tune them, but it's a lot of work to like tune this stuff and do all this stuff from scratch. It's really nice to have like a working example. I think those are probably the two biggest ones, DeepSpeed and Megatron alone, but there are probably other ones as well.SWYX [00:30:13]: Is there a particular thing in the ecosystem where you would call out as like, you know, there should be something here that is open source, but like it's not really, it's like everyone kind of builds it on their own. I want to say something with the file system because everyone talks about the file system eventually.JOSH [00:30:28]: The file system actually was,SWYX [00:30:30]: I mean, we did somethingJOSH [00:30:31]: kind of dumb there. Like we have our own sort of local mirror so that we can, you know, like a crappy version of S3SWYX [00:30:38]: that's local,JOSH [00:30:38]: but it's just a pretty simple script, right?SWYX [00:30:41]: Like I think we run likeJOSH [00:30:41]: a little web server that just like serves files and then, you know, it can upload themSWYX [00:30:45]: and download them.JOSH [00:30:45]: Okay, great. And part of the reason we did that is that our internet connectionSWYX [00:30:50]: in the beginningJOSH [00:30:50]: was not the like full speedSWYX [00:30:52]: one that we wouldJOSH [00:30:52]: eventually have. And so we are a little bit more kind of bottlenecked in terms of internet bandwidth. And so we had this. I think we looked at a bunch of services out there like Minio and some other ones, but a lot of these like come with a lot of extra overhead and maintenance. And since we already have so much infrastructureSWYX [00:31:09]: to deal with,JOSH [00:31:09]: we kind of didn't want to, you know, bring in a whole other like cloud provider, virtualize something, something.SWYX [00:31:14]: We just wanted something simple.JOSH [00:31:14]: So we went with that, which has been quite helpful. Like our toolsSWYX [00:31:19]: are usually quite simple.JOSH [00:31:19]: It's like Bash and Python and SSH and Docker. Like we'd like to keep things simple so that's easier to debug, like less layers of infrastructure, less layers of abstraction, make it a lot easier to work with. Like we don't use Kubernetes,SWYX [00:31:30]: for example,JOSH [00:31:30]: and we just directly launch these things. And it's just been much easier to debug this way. One tool actually that does come into mind that I will call out is Kraken from Uber. That was great. We love that tool. We were a little bit skeptical. What is it?SWYX [00:31:44]: I'm sorry. Yeah.JOSH [00:31:45]: So Kraken is this, yeah, it's a distributed like Docker registry, basically, that uses BitTorrent to like transfer things between the machines in a sort of nice optimal way. Like in the very beginning, the naive way is like you have this one Docker registry, which was outside of the cluster. So every time we change an image, you know, there's many gigabytes that each of the 500 machines needs to download.SWYX [00:32:07]: So that just takesJOSH [00:32:07]: a really long time. So what this thing does is like just one of them downloads it and then like they all sort of broadcast all the pieces to each other. And it was just like a really nice, fast way of getting these images down. And it was very robust.SWYX [00:32:19]: Like there's a lotJOSH [00:32:19]: going on under the hood, but I think it's a pretty cool tool that we haven't really had any bugs with it at all. Amazing.SWYX [00:32:26]: Yeah. I mean, that's all my questions, I guess, for the info piece. I don't know if, John, you had something that you were sort of burning to ask or.JONATHAN [00:32:33]: No, all I can say is just sameSWYX [00:32:36]: in a lot of places, like, you know, and they're done thatJONATHAN [00:32:38]: seeing this plus one. I think the one big difference, you know, perhaps in philosophies is we've tried to basically standardize on as much commodity stuff as possible, just because, you know, I think the reason I asked about trying to do thisSWYX [00:32:50]: on multiple differentJONATHAN [00:32:50]: pieces of infrastructure is like, I think we're running on like six or seven different clouds right now. And everybody has done something slightly different. And my gosh, the little differences add up as you know, you've seen. And so, you know,SWYX [00:33:04]: our philosophy has been like, whatever the hellJONATHAN [00:33:05]: we can standardize, please let's standardize it. Like vanilla off the shelf FSDB.SWYX [00:33:10]: And like, you know,JONATHAN [00:33:10]: we wrote our own data loader, but we've tried to make that as much of a standard as we can across our infrastructure and in Databricks, because things just start getting really complicatedSWYX [00:33:18]: or like we useJONATHAN [00:33:18]: Kubernetes extensively because it at least gives us a uniform set of APIs. Like that's our hardware abstraction layer to a certain extent for everything else. So it's just, you know, a difference in philosophy there. But otherwise, like, yeah, this stuff is really, really hard. And I feel like we take for granted how much of this, you know, is done for us when you go and you just query chat GPT, for example. Like, oh my God, everything going on underneath that, you know, it's kind of a miracle that the machines boot up, let alone that you can like query a giant language model that's probably doing inference across multiple machines and was trained across thousands of machines. Like, you know, minor miracle.SWYX [00:33:54]: Yeah, it is an awesome amount of power that we invoke with a single API call that we take for granted these days. It's absurd. Yeah, I mean, like Kubernetes, like that point about Kubernetes, I will say as a former AWS employee, like it seems like it would be ideal for imbue to at some point make it more abstracted or agnostic because you're going to want to, you know, replicate your setup. We do have our ownJOSH [00:34:19]: sort of replacement. It's just a much simpler version of Kubernetes. Kubernetes is really designed for running services, not for running experiments. Like that's not its like main architecture. And so for us, like we have everything that's like, cool, you're going to run an experiment. So you want it to run to completion, right?SWYX [00:34:34]: OK, great.JOSH [00:34:34]: Like the primitives are sort of built around a slightly different style. And that makes it a lot easier, like just a lot simpler to fit that the nature of like these machines are going to disappear. They will need to be rebooted for infrastructure upgrades. They will like something will happen to the GPUs. Failure is like baked into this as like a core part of our infrastructure. So it's not that we don't have an abstraction. It's that it's a sort of simpler, more tailored abstraction for the particular work that we're doing.JONATHAN [00:34:58]: Yeah, I think it all depends on what your goals are. And like, I think the challenge in a lot of the deep learning stuff right now is that people are trying to like, people often build things that are more complicated than necessary to get the job done. And the complication is the enemy of everything. You know, don't use a fancier parallelism strategy than you have to. Don't use a fancier set of libraries than you have to.SWYX [00:35:18]: Don't do anythingJONATHAN [00:35:18]: that you don't have to do because it's hard enough as it is. Like, don't overcomplicateSWYX [00:35:23]: your own life.JONATHAN [00:35:23]: Don't try to bring in more tools or more fancy architecture tweaks if you absolutely don't have to.SWYX [00:35:29]: Like getting to the minimumJONATHAN [00:35:30]: necessary to get the job done. And it's really tempting to want to try to use everything. So like, I totally understand that one.SWYX [00:35:37]: I think the last piece I'll maybe call out is that I'm just going to weave this in just because I see the opportunity to do it. Are there any infrastructure shifts that need to be, that need to rise because of changing architecture? So I think, for example,SWYX [00:35:57]: you're announcing a dense model, a 70B dense model, whereas John just worked on DBRX and the image-to-text model, which presumably has different bottlenecks.JONATHAN [00:36:10]: That's correct for us. You know, we train both dense and mixture of expert models. The one we happened to, you know, kind of get permission to open source was a mixture of expert model. And those models are very demanding when it comes to network bandwidth, at least if you're training them in kind of FSTP 03 style, where there's just a lot of parameters getting shuffled back and forth. And your ratio of kind of compute to amount of data that you have to shuffle back and forth becomes a lot worse because you're now, you know, you're only using a fraction of the parameters for every token instead of all the parameters. And so we had to really push the envelope on getting all the stuff to the right places on time. And so actually the networking part of DBRX was the single hardest thing, I think, of the entire process. Just get MOE training, working at scale across a big cluster. We still managed to, I think, do it all with commodity parts, which was very exciting. You know, we were using FSTP and we eventually used HSTP so that we could have HSTP as a version of FSTP where you have multiple smaller replicas and you're doing data parallel within those replicas. And that helped a lot with network latency issues that we were running into just because we were transmitting so much data, you know, for every single part of the process. I think it actually, like, it was instructive for how Google designs their hardware and software together personally. Their training, as far as I understand, using kind of a 03 style of training and have been for a while. They also train mixture of expert models. TPUs have a very different network bandwidth to compute ratio. They have a lot more bandwidth just objectively. And TPUs per chip tend to be a little bit less compute intensive and have a little bit less memory. You know, it's just a different design choice. So the ratio of flops to bandwidth is very different. And that means that it's much easier for Google to be able to pull offSWYX [00:37:54]: some of this stuff.JONATHAN [00:37:54]: They also have interesting, you know, Torus style network architecture or Torus style, like, literal network architectureSWYX [00:38:00]: is not like the model,JONATHAN [00:38:00]: but the network.SWYX [00:38:02]: Is this the sort of block attention? I forgot what you call it. So this is just more or the,JONATHAN [00:38:07]: yeah, this is more, not the ring attention, but these are the ring all reduces. Like you have three different dimensions of rings because they kind of put you in these three dimensional Toruses from what I understand. And so like, you know, Google's infrastructure in some sense is kind of, I wouldn't say built for this, but maybe the way that Google trains models is built for a slightly different bit of infrastructure they have. And it's kind of neat to think about that. You know, as one thing that I think NVIDIA announced for, you know, for, for both the GH200 and the GB200 is this hybrid networking where you'll have blocks of NVLink network chips. I think for the GB200, I think it's like groups of 72 GPUs will all have NVLink to each other. So higher bandwidth, then you'll have normal networking of some kind, InfiniBand or Rocky or what have you between these blocks. And that's kind of a, you know, it's a change due to the fact that, you know, it's hard to build really high bandwidth networks over very large groups, but it is now a blocked networking. And you have to think about how you architect your model and your parallelism differently. You also have to think about fault tolerance differently because it now matters where you lose a GPU, whereas it didn't before. So, you know, it's, it's, it's just all really interesting and really fun speaking personally, but it's going to mean new nightmares when we all move to that generation and have to think about, you know, new versions of these problems.JOSH [00:39:20]: As you go up to larger scales, it gets quite different. Like right now, you know, if you're experiencing, let's say, for example, you experience a GPU failure every day, that's fine.SWYX [00:39:31]: Just restart.JOSH [00:39:31]: If you make your thing 24 times as big, now it's once an hour. Now it stops being quite as easy to just restart, right? So now you have to kind of break, like bake in this sort of redundancy that you didn't have before. So I think as you go up in scale, you end up running into like a lot of really interesting problems that also inform the, the actual like design. Yeah, I mean, as an orchestration guy,SWYX [00:39:52]: this is why I always emphasize like very cheap storage or very fast storage. So you can checkpoint more, but I don't think that's probably not the best solution to for fast, you know, training.JONATHAN [00:40:05]: Which works fine when you're doing language and then you move to vision or video. And then, you know, you have multi petabyte datasetsSWYX [00:40:12]: and getting, you know,JONATHAN [00:40:13]: cheap, fast multi petabyte storage starts to bite. Like I've certainly encountered issues where the literal data center where my GPUs were did not have enough, you know, object store to fit the datasets that people wanted to bring into that data center from whichever users were, were trying to bring them in. And then you get to a wholeSWYX [00:40:31]: different world of hurtJONATHAN [00:40:31]: where you have to keep your data in a different region because the region is just out of storage. So things get fun really fast.SWYX [00:40:39]: Speaking of vision, Josh, actually, you know, Embu is an agents company, but you're only, you're announcing a text-only model. What, where does, where does the vision side come in?JOSH [00:40:49]: I think we've actually done a lot of work in the past and people can see kind of our blog posts about sort of self-supervised learning and some other kind of vision-related stuff in the past as well. So we're very familiar with, with that stuff. But I think our main focus right now is on kind of, as we say, coding and reasoning. And there, there's certainly a visual component to some problems. But, you know, it's not necessarily required for all problems. And actually we found that for most of the kind of like code writing and, and reasoning problems that we care about, the visual part isn't really a huge important part of it. Sometimes if you really need to, you can maybe describeSWYX [00:41:24]: the thing.JOSH [00:41:24]: There are other like, you know, multimodal models that you can use off the shelf to sort of plug in for those particular piecesSWYX [00:41:30]: that you need, right?JOSH [00:41:30]: Like if something is driving a browser or whatever, like you can sometimes get away with not having to have that baked into the original model. So our folk were, you know, in a sense, we kind of do a lot across the stack. We're working on our own infrastructure and pre-training and RL and fine tuning and products and everything. But in another sense, we're very narrowly focused on the application side. So all of the stuff across the stack is kind of going toward a very particular purpose. And so that particular purpose right now doesn't really need vision. So we think that people are going to make all sorts of really cool image modelsSWYX [00:42:00]: like Jonathan, right?JOSH [00:42:00]: And all sorts of interesting multimodal models into the future. We'll let them go do that. That's great. We'll take advantage of that, partner with those people in the future. And right now we're really focused on kind of the core reasoning and coding capabilities and aspects of the model.SWYX [00:42:14]: I wanted to go into carbs since that's kind of the next layer of the stack. We talked about carbs in the first episode with Kanjin because you've actually had a blog post about it like a couple of years ago. Maybe let's introduce it.JONATHAN [00:42:26]: Has that been a couple of years now?JOSH [00:42:28]: No, it must have been at least one year. Hopefully it's not multiple years.SWYX [00:42:32]: Sorry, I'm counting AI time. Yeah, yeah. Yeah, I was going to sayJONATHAN [00:42:35]: you're making me feel really old right now.SWYX [00:42:39]: I count everything before the generally intelligent rename as like, you know, prehistory. Yeah. And now sort of modernity, right? So I actually thought carbs was more about hyperparameter optimization in a sense of like sort of parameters, hyperparameter search. Whereas, you know, when you introduced it, especially in this blog post, it's more about scaling laws and predictability of like, are we sort of in the right ballpark before we scale things up? Maybe sort of recount the history of carbs.JOSH [00:43:10]: Yeah, so it really is a little bit of both. So carbs is, it's maybe a backronym, but it's for cost aware Pareto region Bayesian search. So this is about technically how it works, but carbs is like, you know, we like pastries and stuff.SWYX [00:43:26]: So great, why not? But the point is thatJOSH [00:43:29]: it's a cost aware hyperparameter tuner. So most hyperparameter tuners, you kind of say, OK, here's this objective function. I want you to make this number as big as possible or as small as possible, whichever direction you want to go. So yeah, just go make this number, you know, as small as possible. OK, so it'll try a bunch of differentSWYX [00:43:46]: hyperparameters,JOSH [00:43:46]: a bunch of different configurationsSWYX [00:43:48]: to figure out, like,JOSH [00:43:48]: how do I tweak your network and architecture, et cetera, to get the kind of best performance I possibly can. That's usually saying, like, you know, almost all of these hyperparameter configurations are, let's say they're all going to use the same number of GPUs or the same number of nodes.SWYX [00:44:01]: So it's going to runJOSH [00:44:01]: for the same amount of time.SWYX [00:44:03]: So you can do that.JOSH [00:44:03]: You can get a number out and that's great. But what carbs does is it says,SWYX [00:44:07]: OK, actually,JOSH [00:44:07]: what if we relax that constraint? What if we say each of these different points, we're going to model how expensive it will be to sample this configuration. So if what if we train with just one one hundredth of the data? Like, how well can we do?SWYX [00:44:19]: What if we trainJOSH [00:44:19]: with one tenth of the data? What if we train with all the data? That way you can understand, like, as we get more and more data, as we spend more and more compute,SWYX [00:44:26]: as we make a biggerJOSH [00:44:26]: and bigger network, how does performance change with these things that change? Like how expensive it is to even explore this data point. So by doing that, we can see the scaling laws for not just, you know,SWYX [00:44:36]: the scaling lawsJOSH [00:44:36]: from like the, you know, Chantilla paper, the scaling laws for all parameters. We can see how does how does the number of layers change with this? How does the, you know, the learning rate change? How do the like, you know, various types of regularization change? So you can see these nice scaling laws. And as you're going across costs, like how should this be changing as you're scaling up your model? So that, coupled with the kind of metric that we chose, which is a very precise way of measuring performance, allowed us to really like hone in on parameters that worked really wellSWYX [00:45:05]: and understand, like,JOSH [00:45:05]: how do we want to scale those up, especially as we're changingSWYX [00:45:08]: things about the network?JOSH [00:45:08]: Like one of the things that we did is we used a custom tokenizer. As we change this tokenizer, changes a bunch of other things about the model. So how should we scale up this entirely new tokenizer? Like no one has ever made a model this large with this tokenizer before. And so how do we want toSWYX [00:45:22]: change all these things?JOSH [00:45:22]: Harps kind of shows you, like, look, as you change these parameters, like these other ones are kind of dependent on this.SWYX [00:45:28]: Like this is the, these areJOSH [00:45:28]: the relationships between them. So you can better understand, like, OK, if I'm going to scale this up 10x or 100x, like, where do I want to be? I can only go so far. And so, you know, we did run, like, I think maybe it was like a 14b one or somethingSWYX [00:45:40]: like that to check.JOSH [00:45:41]: But and so we had a bunch of like 1b or 14b and then at 70b. I don't think we had a, I think we just did like one at 14b. So you can, we get to check that like, oh, is this on the curve? Like, is this where we expect? It was like right there. So then great, go on to the next one. Yeah, I mean, that makes a lot of sense.SWYX [00:45:56]: I wonder if, so one of the key questions, and correct me if I'm wrong, but like usually people do search or do their evals just based on loss. But you actually evaluate based on, you know, the sort of end state evals that people might expect, like HellaSwag and Lombata, whatever. What is the norm here? Is there a norm?JOSH [00:46:20]: Yeah, I don't know if there's a hundred percent.SWYX [00:46:21]: I don't know. I only see loss on most people's reports.JOSH [00:46:25]: I think it's easy to, like, loss is very nice because it's very precise. It will tell you, like, very fine grained differences between like really small changes in your hyperparameters or network architecture. Whereas, especially at the smaller scales, if you're looking at like accuracy, it's very noisy. Like it might be zero or a hundred or like, you know, fluctuating by like 10 or 20 percentage points, which makes it really hard to tell, like, did that change actually mean anything? So our loss is sort of a combination of these two. Instead of saying, like, let's just look at perplexity, we say, let's look at perplexity on the tasks that we care about for multiple choice questions effectively.SWYX [00:47:00]: So we're saying like, yes,JOSH [00:47:00]: this is formulated as a multiple choice question, and we're going to look at the, like, you know, the loss of perplexity for this particular answer token. And that ends up being something that's like both targeted to what you actually care about and also very precise. The nice thing about this though is that it's independent of the data that you train on. One thing that's annoying about perplexity or about loss is that as you change your data set, this is really obnoxious because now it fundamentally changes your loss, right? And so you can't tell, like, how do I tweak my data set? But because we have this held out evaluation dat
Welcome to episode 257 of the Cloud Pod podcast – where the forecast is always cloudy! This week your hosts Justin, Matthew, Ryan, and Jonathan are in the barnyard bringing you the latest news, which this week is really just Meta's release of Llama 3. Seriously. That's every announcement this week. Don’t say we didn't warn you. Titles we almost went with this week: Meta Llama says no Drama No Meta Prob-llama Keep Calm and Llama on Redis did not embrace the Llama MK The bedrock of good AI is built on Llamas The CloudPod announces support for Llama3 since everyone else was doing it Llama3, better know as Llama Llama Llama The Cloud Pod now known as the LLMPod Cloud Pod is considering changing its name to LlamaPod Unlike WinAMP nothing whips the llamas ass A big thanks to this week's sponsor: Check out Sonrai Securities‘ new Cloud Permission Firewall. Just for our listeners, enjoy a 14 day trial at www.sonrai.co/cloudpod Follow Up 01:27 Valkey is Rapidly Overtaking Redis Valkey has continued to rack up support from AWS, Ericsson, Google, Oracle and Verizon initially, to now being joined by Alibaba, Aiven, Heroku and Percona backing Valkey as well. Numerous blog posts have come out touting Valkey adoption. I'm not sure this whole thing is working out as well as Redis CEO Rowan Trollope had hoped. AI Is Going Great – Or How AI Makes All It's Money 03:26 Introducing Meta Llama 3: The most capable openly available LLM to date Meta has launched Llama 3, the next generation of their state-of-the-art open source large language model. Llama 3 will be available on AWS, Databricks, GCP, Hugging Face, Kaggle, IBM WatsonX, Microsoft Azure, Nvidia NIM, and Snowflake with support from hardware platforms offered by AMD, AWS, Dell, Intel, Nvidia and Qualcomm Includes new trust and safety tools such as Llama Guard 2, Code Shield and Cybersec eval 2 They plan to introduce new capabilities, including longer context windows, additional model sizes and enhanced performance. The first two models from Meta Lama3 are the 8B and 70B parameter variants that can support a broad range of use cases. Meta shared some benchmarks against Gemma 7B and Mistral 7B vs the Lama 3 8B models and showed improvements across all major benchmarks. Including Math with Gemma 7b doing 12.2 vs 30 with Llama 3 It had highly comparable performance with the 70B model against Gemini Pro 1.5 and Claude 3 Sonnet scoring within a few points of most of the other scores. Jonathan recommends using LM Studio to get start playing around with LLMS, which you can find at https://lmstudio.ai/ 04:42 Jonathan – “Isn’t it funny how you go from an 8 billion parameter model to a 70 billion parameter model but nothing in between? Like you would have thought there would be some kind of like, some middle ground maybe? But, uh, but… No. But, um,
Welcome to the Car Dealership Guy Podcast. In this episode, I'm speaking with Royce Neubauer, CEO of Auto Hauler Exchange where we discuss: Building the "Amazon" of car shipping, Dirty secrets of the $70B vehicle logistics industry, How he's saving dealers millions with faster car transportation, and much more. This episode of the Car Dealership Guy Podcast is brought to you by: Experian - Experian Automotive is one of the most trusted sources of data and insights in the automobile industry and they're a one-stop-shop: Dealers can use automotive, lifestyle, and predictive data to find in-market car shoppers faster and at lower cost. By leveraging targeted audiences, dealers can communicate with car shoppers based on make, model, off-lease, off-loan, in-equity, fuel type, or vehicle servicing needs. Learn more @ https://bit.ly/3W1vrgY Auto Hauler Exchange - The Auto Hauler Exchange has revolutionized the way vehicle haulers and vehicle shippers collaborate. The Exchange is the first digital marketplace of its kind in the vehicle logistics industry. Learn more @ https://www.autohaulerexchange.com/ Car Dealership Guy Industry Job Board - Connecting world-class talent with top-notch companies in Automotive. Find your next role—or start hiring today—at CDGJobs.com. Interested in advertising with Car Dealership Guy? Drop us a line here Interested in being considered as a guest on the podcast? Add your name here Topics: (00:00:00) - Intro (00:04:42) - Royce's career and background (00:08:36) - How does the state of automotive logistics compare today to other industries? (00:22:47) - Do OEMs markup transportation to dealers? (0:28:41) - What are you seeing in the industry in terms of savings? (00:30:30) - Building predictability into the shipping process (00:34:00) - What rates are you seeing in the market? (00:36:28) - Why do carriers prefer to work with you? (00:38:47) - Do you think shipping to consumers is a tailwind? (00:41:39) - How is EV shipping impacting the industry? (00:43:51) - What is next for Auto Hauler and what do you predict for the industry? Check out Auto Hauler Exchange here. Check out the website for more (https://dealershipguy.com) and follow me on X @GuyDealership! (https://x.com/guydealership) This podcast is for informational purposes only and should not be relied upon as a basis for investment decisions.
US equities finished mixed in Wednesday trading, little changed after a strong start to week, with the Dow Jones down 11 bps, while the S&P500 and Nasdaq finished up 2bps and 10bps respectively. March headline and core durable goods orders beat though February revised down across the board. Today's $70B auction of five-year notes tailed after yesterday's strong two-year note auction. Visa was helped by accelerating US volumes.
The AI Breakdown: Daily Artificial Intelligence News and Discussions
Meta has released the first of its Llama 3 models, enhancing the landscape of open-source large language models. These models, including the initial 8B and 70B versions, promise advanced capabilities and integration into Meta's consumer products, solidifying Meta's commitment to leading in AI innovation. The release indicates Meta's strategic push to compete in the open-source domain and extend these high-level AI functionalities through Meta AI across its platforms, setting a new standard for accessibility and performance in the AI community. ** CHECK OUT THE JUST-LAUNCHED SUPERINTELLIGENT PLATFORM - 300+ AI video tutorials https://besuper.ai/ Consensus 2024 is happening May 29-31 in Austin, Texas. This year marks the tenth annual Consensus, making it the largest and longest-running event dedicated to all sides of crypto, blockchain and Web3. Use code AIBREAKDOWN to get 15% off your pass at https://go.coindesk.com/43SWugo ** ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
AlabamaAL House has passed the CHOOSE Act, now sending over to state senateSen. Tuberville responds to bizarre claims from MSNBC Joy Reid re: IVF & babiesGovernor Ivey grants $150M to 5 Wiregrass counties for high speed internetA bill is filed this week to prohibit sex offenders from becoming a first responderA media competitor rips into 1819 news for asking about a pulled articleNational22 AGs report that HHS as lost 85 thousand illegal migrants in the systemIllegals at border say Biden is better than Trump, for letting them inHunter Biden giving deposition today to House Judiciary & Oversight committeesTX AG sues porn industry for not complying with state law on age verificationGoogle loses $70B in market shares after disastrous rollout of AI "Gemini"Dr. Harvey Risch talks Covid, CIA, coverup and bioweapons industry
The AI Breakdown: Daily Artificial Intelligence News and Discussions
While everyone has been watching the Gemini controversy (which wiped about $70B off of Alphabet/Google's marketcap yesterday), Bing's Sydney has quietly resurfaced with some disturbing comments. NLW explores the latest. ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/