Podcasts about imagenet

132PODCASTS
225EPISODES
43mAVG DURATION
1MONTHLY NEW EPISODE
Apr 8, 2026LATEST

POPULARITY

20192020202120222023202420252026

Best podcasts about imagenet

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

10 episodes with imagenet

This Week in Machine Learning & Artificial Intelligence (AI) Podcast

7 episodes with imagenet

The Nonlinear Library

15 episodes with imagenet

Papers Read on AI

20 episodes with imagenet

a16z

5 episodes with imagenet

AI with AI

7 episodes with imagenet

Machine Learning Street Talk

3 episodes with imagenet

The Lunar Society

3 episodes with imagenet

WIRED Business – Spoken Edition

4 episodes with imagenet

Yannic Kilcher Videos (Audio Only)

8 episodes with imagenet

Eye On A.I.

2 episodes with imagenet

The Daktronics Experience

2 episodes with imagenet

London Futurists

2 episodes with imagenet

The Robot Brains Podcast

2 episodes with imagenet

Stanford MLSys Seminar

2 episodes with imagenet

Short And Sweet AI

2 episodes with imagenet

PaperPlayer biorxiv neuroscience

4 episodes with imagenet

The Nonlinear Library: LessWrong

6 episodes with imagenet

Latest podcast episodes about imagenet

Building Human-Centered AI

SHIFT

Play Episode Listen Later Apr 8, 2026 17:41

Best-known as the creator of ImageNet, we meet the Godmother of AI, Dr. Fei-Fei Li In the latest installment of our oral history project. She's a Chinese-American computer scientist and the creator of ImageNet - the dataset that made rapid advances possible in this field of AI that helps computers take meaningful information from things like photos and videos.We Meet: Stanford University's Fei-Fei Li, author of "The Worlds I See: Curiosity, Exploration, and Discovery at the Dawn of AI" and the founder of World LabsCredits:This episode of SHIFT was produced by Jennifer Strong with help from Emma Cillekens. It was mixed by Garret Lang, with original music from him and Jacob Gorski. Art by Meg Marco.

ai art shift discovery exploration chinese americans godmothers human centered fei fei li imagenet

Teknik - Et si l'IA n'avait plus besoin de vos données? - Parce que... c'est l'épisode 0x735!

PolySécure Podcast

Play Episode Listen Later Apr 2, 2026 61:16

Parce que… c'est l'épisode 0x735! Shameless plug 14 au 17 avril 2026 - Botconf 2026 20 au 22 avril 2026 - ITSec Code rabais de 15%: Seqcure15 28 et 29 avril 2026 - Cybereco Cyberconférence 2026 9 au 17 mai 2026 - NorthSec 2026 3 au 5 juin 2026 - SSTIC 2026 19 septembre 2026 - Bsides Montréal 1 au 3 décembre 2026 - Forum INCYBER - Canada 2026 24 et 25 février 2027 - SéQCure 2027 Description Retour aux sources techniques Dans cet épisode, l'animateur retrouve Frédéric Grelot, expert en intelligence artificielle, qu'il n'avait pas croisé depuis un moment. Frédéric a quitté son poste de dirigeant chez Glimps, une entreprise qu'il avait exportée au Canada, pour retrouver ses premières amours : la technique et la recherche. Il a rejoint l'AMIAD (Agence pour la maîtrise de l'IA de défense), rattachée au ministère des Armées français, créée en 2024. Un retour assumé, les « mains dans le cambouis », comme il le dit lui-même. Le fil conducteur de cet échange tourne autour d'une idée provocatrice : peut-on faire de l'intelligence artificielle sans données ? Frédéric avertit d'emblée que la formule est volontairement accrocheuse, et que la réponse sera nuancée. Une histoire de l'IA en quelques étapes clés Pour comprendre où l'on va, Frédéric propose un retour sur les grandes ruptures qui ont façonné l'IA depuis une quarantaine d'années. 1989 – Les débuts des réseaux convolutifs. Le réseau LeNet-5, conçu pour lire des chiffres manuscrits sur des chèques, représente l'un des premiers exemples concrets de réseaux de neurones convolutifs. Ces réseaux fonctionnent en empilant des couches d'analyse : les premières détectent des formes simples (points, lignes, angles), les suivantes des structures plus complexes (roues, rétroviseurs, puis une voiture entière). Ce paradigme a dominé le domaine pendant environ vingt ans. 2012 – La double révolution. Deux événements simultanés ont provoqué une explosion du domaine. D'une part, Nvidia a démocratisé l'utilisation des GPU pour le calcul scientifique via son API CUDA, rendant accessibles des calculs matriciels massivement parallèles. D'autre part, le jeu de données ImageNet a été publié en accès libre — un million d'images réparties en 1 000 catégories — offrant à la communauté une base commune pour entraîner et évaluer des modèles. Ces deux facteurs combinés ont déclenché une effervescence considérable, notamment dans le domaine de la vision par ordinateur. 2017 – L'avènement des transformers. La publication du célèbre article Attention is all you need introduit une nouvelle architecture qui va s'imposer comme le standard de l'IA moderne. Contrairement aux approches séquentielles précédentes, le transformer analyse chaque mot d'une phrase en le mettant en relation avec tous les mots qui le précèdent, enrichissant progressivement le sens de chaque élément couche après couche. Cette capacité à saisir le contexte global d'une séquence est à la base de tous les grands modèles de langage actuels. Son principal défaut : un coût de calcul quadratique par rapport à la longueur des séquences. Doubler la longueur d'un texte quadruple le volume de calcul. Les recherches de ces huit dernières années ont largement porté sur la résolution de ce problème, avec des résultats impressionnants — certains modèles open source atteignent aujourd'hui des fenêtres de contexte d'un à deux millions de tokens. Novembre 2022 – ChatGPT et la démocratisation. La sortie de ChatGPT marque moins une rupture technologique qu'une rupture d'usage. En mettant dans les mains du grand public ce qui n'existait que dans des laboratoires, OpenAI a transformé une évolution technique en révolution sociale. Les questions d'hallucinations, de jailbreak et d'alignement des modèles ont alors émergé comme des enjeux majeurs. Les modèles de fondation : l'IA qui apprend à vivre avant de se spécialiser C'est ici que Frédéric introduit son concept central : les modèles de fondation. L'analogie qu'il utilise est parlante : un enfant qui grandit jusqu'à 20 ans — observant des formes, des visages, des papillons, faisant du cerf-volant — développe une compréhension générique du monde qui en fait un « excellent modèle de fondation ». Il sera ensuite capable d'apprendre un métier précis, comme la géométrie ou la rétroingénierie de code, en repartant de cette base solide plutôt que de zéro. Un modèle de fondation est entraîné sur des quantités massives de données brutes, sans nécessairement annoter chaque exemple. Une fois cette phase généraliste accomplie, on n'a plus besoin que d'un tout petit volume de données spécialisées et annotées pour l'amener à un niveau d'excellence sur une tâche précise. Là où il fallait autrefois des millions d'exemples étiquetés, quelques centaines suffisent désormais. Ce paradigme bouleverse les rapports de force autour de la donnée. Frédéric, qui conseillait autrefois à ses clients de « conserver précieusement toutes leurs données », leur dit aujourd'hui d'en garder un peu, de bonne qualité. Le reste n'est plus indispensable. Vers l'inférence sans entraînement La dernière évolution abordée est peut-être la plus spectaculaire : le zero-shot learning. Grâce à la richesse des modèles de fondation, il est aujourd'hui possible de montrer une seule image d'un objet inconnu — une voiture jamais vue pendant l'entraînement — et d'être immédiatement capable de la reconnaître dans d'autres photos. Aucun entraînement supplémentaire n'est nécessaire : le modèle comprend l'objet à partir d'un seul exemple. C'est en ce sens que l'on peut parler d'IA « sans données » : non pas qu'il n'y en ait jamais eu, mais que l'utilisateur final n'a plus besoin d'en fournir pour bénéficier de capacités autrefois réservées aux experts disposant de vastes bases de données annotées. Un domaine en perpétuelle ébullition La conversation aborde également la dynamique concurrentielle entre modèles propriétaires américains et modèles open source, notamment chinois (DeepSeek) et français (Mistral). Les contraintes imposées aux acteurs chinois en matière de puissance de calcul ont paradoxalement stimulé l'innovation, à travers des techniques comme la distillation, le pruning ou l'optimisation des architectures d'attention. L'épisode se conclut sur une note d'ouverture : les hallucinations reculent, les modèles apprennent à dire « je ne sais pas », et le champ continue d'évoluer à un rythme soutenu — autant de raisons de se retrouver pour un prochain épisode. Collaborateurs Nicolas-Loïc Fortin Frédéric Grelot Crédits Montage par Intrasecure inc Locaux virtuels par Riverside.fm

canada chatgpt attention ces ia vers openai arm nvidia riverside parce shameless montages gpu besoin donn devos aucun contrairement teknik mistral locaux double r imagenet

Unified Latents (UL): How to train your latents (Teaser for Feb 28th Technical Update)

AI Unraveled: Latest AI News & Trends, Master GPT, Gemini, Generative AI, LLMs, Prompting, GPT Store

Play Episode Listen Later Feb 28, 2026 2:04

Listen to Full Audio at https://podcasts.apple.com/us/podcast/scientist-vs-storyteller-benchmarking-gpt-5-2-claude/id1684415169?i=1000752001078For years, Latent Diffusion Models—the tech behind Stable Diffusion and DALL-E—have relied on a bit of an 'art form' called KL-regularization. Basically, researchers had to manually guess how much to compress an image before the AI started to lose the details. If you compressed too much, the image got blurry. Too little, and the model became too expensive to train.Enter Unified Latents, or UL.In a new paper out of DeepMind Amsterdam, researchers have introduced a framework that replaces that guesswork with a single, cohesive mathematical objective. Instead of training the compressor and the generator separately, UL trains the Encoder, the Prior, and the Decoder all at once.The 'Secret Sauce' here is something called Fixed Gaussian Noise Encoding. By injecting a constant, specific amount of noise during the encoding process, DeepMind has created a 'Maximum Precision Link.' This forces the encoder to be incredibly efficient, focusing only on the most important structures of an image.The results are staggering: UL achieved a state-of-the-art Video Distance score on the Kinetics-600 dataset and hit a competitive 1.4 FID on ImageNet—all while using significantly less computational power than traditional methods.This episode is made possible by our sponsors:

canada ai reach train human soccer architects technical loop c suite unified deepmind ul fid google deepmind stable diffusion senior software engineer decoder media kit arxiv kinetics imagenet encoder enterprise architects

Episode #524: The 500-Year Prophecy: Why Buddhism and AI Are Colliding Right Now

Crazy Wisdom

Play Episode Listen Later Jan 19, 2026 60:49

In this episode of the Crazy Wisdom podcast, host Stewart Alsop sits down with Kelvin Lwin for their second conversation exploring the fascinating intersection of AI and Buddhist cosmology. Lwin brings his unique perspective as both a technologist with deep Silicon Valley experience and a serious meditation practitioner who's spent decades studying Buddhist philosophy. Together, they examine how AI development fits into ancient spiritual prophecies, discuss the dangerous allure of LLMs as potentially "asura weapons" that can mislead users, and explore verification methods for enlightenment claims in our modern digital age. The conversation ranges from technical discussions about the need for better AI compilers and world models to profound questions about humanity's role in what Lwin sees as an inevitable technological crucible that will determine our collective spiritual evolution. For more information about Kelvin's work on attention training and AI, visit his website at alin.ai. You can also join Kelvin for live meditation sessions twice daily on Clubhouse at clubhouse.com/house/neowise.Timestamps00:00 Exploring AI and Spirituality05:56 The Quest for Enlightenment Verification11:58 AI's Impact on Spirituality and Reality17:51 The 500-Year Prophecy of Buddhism23:36 The Future of AI and Business Innovation32:15 Exploring Language and Communication34:54 Programming Languages and Human Interaction36:23 AI and the Crucible of Change39:20 World Models and Physical AI41:27 The Role of Ontologies in AI44:25 The Asura and Deva: A Battle for Supremacy48:15 The Future of Humanity and AI51:08 Persuasion and the Power of LLMs55:29 Navigating the New Age of TechnologyKey Insights1. The Rarity of Polymath AI-Spirituality Perspectives: Kelvin argues that very few people are approaching AI through spiritual frameworks because it requires being a polymath with deep knowledge across multiple domains. Most people specialize in one field, and combining AI expertise with Buddhist cosmology requires significant time, resources, and academic background that few possess.2. Traditional Enlightenment Verification vs. Modern Claims: There are established methods for verifying enlightenment claims in Buddhist traditions, including adherence to the five precepts and overcoming hell rebirth through karmic resolution. Many modern Western practitioners claiming enlightenment fail these traditional tests, often changing the criteria when they can't meet the original requirements.3. The 500-Year Buddhist Prophecy and Current Timing: We are approximately 60 years into a prophesied 500-year period where enlightenment becomes possible again. This "startup phase of Buddhism revival" coincides with technological developments like the internet and AI, which are seen as integral to this spiritual renaissance rather than obstacles to it.4. LLMs as UI Solution, Not Reasoning Engine: While LLMs have solved the user interface problem of capturing human intent, they fundamentally cannot reason or make decisions due to their token-based architecture. The technology works well enough to create illusion of capability, leading people down an asymptotic path away from true solutions.5. The Need for New Programming Paradigms: Current AI development caters too much to human cognitive limitations through familiar programming structures. True advancement requires moving beyond human-readable code toward agent-generated languages that prioritize efficiency over human comprehension, similar to how compilers already translate high-level code.6. AI as Asura Weapon in Spiritual Warfare: From Buddhist cosmological perspective, AI represents an asura (demon-realm) tool that appears helpful but is fundamentally wasteful and disruptive to human consciousness. Humanity exists as the battleground between divine and demonic forces, with AI serving as a weapon that both sides employ in this cosmic conflict.7. 2029 as Critical Convergence Point: Multiple technological and spiritual trends point toward 2029 as when various systems will reach breaking points, forcing humanity to either transcend current limitations or be consumed by them. This timing aligns with both technological development curves and spiritual prophecies about transformation periods.

amazon ai power future navigating christianity western impact robots spirituality chatgpt humanity quest silicon valley reddit civil war weapons titans sexuality clubhouse oracle decision making yahoo buddhist academia siri buddhism new age enlightenment nvidia scheduling persuasion hinduism orthodox crucible battleground sanskrit singularity tokens tower of babel hallucinations vipassana sensuality gpus greek mythology simulation theory sexual misconduct computer vision ontology scientific method polymath nonduality inference knowledge management confucianism exploring ai rarity colliding programming languages fourth turning asura theravada compiler yann lecun monasteries raw power samatha guilty conscience tang dynasty great conjunction functional programming theosis mind virus brain scans categorization apostolic succession imagenet object oriented programming water consumption power consumption peak experience crazy wisdom overfitting third noble truth assembly language year prophecy

Episode 153: 2025 Holiday Gift Guide

Teaching Python

Play Episode Listen Later Dec 14, 2025 40:12

Julian Sequeira from PyBites joins Sean and Kelly to share their top holiday gift picks for coders, makers, and educators. This episode features 15+ gift ideas ranging from budget-friendly maker tools to classroom robots—plus book recommendations, coding platforms, and a few surprises. Show Notes Wins of the Week Julian: Staying focused on "the one thing" at PyBites, plus 3D printing a custom cappuccino stencil for his local café Kelly: Surviving a muddy, clay-covered hill in North Carolina while on vacation Sean: Designing and 3D printing a custom bracket for his screen door using Fusion 360 Holiday Gift Ideas Julian's Picks Hoverboard with Go-Kart Attachment (~$299 AUD) - Two-wheeled self-balancing boards that can convert to a go-kart with a third wheel attachment. Available at Hoveroo (https://hoveroo.com.au) in Australia. Secret Coders Book Series (~$10-20 USD each) - A six-book graphic novel series that wraps coding puzzles and concepts into mystery stories. Recommended by Faye Shaw from the Boston PyLadies community. Great for ages 8-15. 3D Printer (~$200-300 USD) - Entry-level printers like the Bambu Lab A1 Mini or Elegoo Neptune 4 Pro have dropped significantly in price. Look for auto bed leveling as a key feature. Duolingo Chess (~$13/month with subscription) - A new addition to Duolingo that teaches chess tactics, strategy, and formal terminology through structured lessons. Great for building problem-solving skills. Classic Video Games (Zelda, Pokémon) - Story-driven games that build resilience and problem-solving skills, as an alternative to dopamine-heavy platforms like Roblox. Kelly's Picks Soccer Bot (~$59.99) - An indoor soccer training robot that challenges footwork skills. Works best on hard floors. "The Worlds I See" by Dr. Fei-Fei Li - Memoir of the computer scientist behind ImageNet and modern image recognition, covering her immigrant journey and rise in AI. A must-read for anyone interested in AI. LEGO Retro Radio Building Set (~$99) - A 1970s-style radio that you build, then insert your phone to play music. Features working dials that create authentic radio crackle sounds. Spydroid Loco Hex Robot (classroom investment) - A large spider-shaped robot that codes in Python and block programming. Features LIDAR and AI-based mapping. Seen at ISTE. Richtie Mini from Hugging Face ($299-$449) - An adorable AI desktop companion robot with onboard models. Two versions: one that connects to your computer and one that's self-contained. Sean's Picks LED Pucks (LED 001 Kit) (~$6-13) - Small USB-powered LED discs perfect for 3D printed projects like planet lamps. Available from Bambu Labs or Amazon. RGB versions include remote controls. Daily Desk Calendar (~$15-20) - A throwback gift that provides daily doses of humor, trivia, or inspiration. Suggestions include The Far Side, "They Can Talk," or "How to Win Friends and Influence People." PyBites Coding Platform (subscription) - Bite-sized Python challenges for sharpening coding skills. Great for teachers, students, and professionals looking for practical coding practice. Digital Calipers (~$40-50) - USB-rechargeable precision measuring tools essential for 3D printing and maker projects. Great for teaching geometry and measurement concepts. Deburring Tool (~$10) - A small tool with a curved swiveling blade for cleaning up 3D prints. A quality-of-life improvement for any maker's toolkit. Links Mentioned PyBites (https://pybit.es) - Python coaching and coding challenges Hoveroo (https://hoveroo.com.au) - Hoverboards (Australia) Bambu Lab (https://bambulab.com) - 3D printers and LED pucks Printables (https://www.printables.com) - 3D printing models MakerWorld (https://makerworld.com) - 3D printing models Hugging Face Richtie Mini (https://huggingface.co) - AI companion robot Duolingo (https://duolingo.com) - Language learning app with chess Secret Coders book series - Available on Amazon "The Worlds I See" by Dr. Fei-Fei Li - Available at bookstores Upcoming Events PyCon US 2026 - Long Beach, California Education Summit - Proposals open after the holidays, deadline around March/April Submit proposals when the website opens! Special Guest: Julian Sequeira.

What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

a16z

Play Episode Listen Later Dec 5, 2025 61:56

Fei-Fei Li is a Stanford professor, co-director of Stanford Institute for Human-Centered Artificial Intelligence, and co-founder of World Labs. She created ImageNet, the dataset that sparked the deep learning revolution. Justin Johnson is her former PhD student, ex-professor at Michigan, ex-Meta researcher, and now co-founder of World Labs.Together, they just launched Marble—the first model that generates explorable 3D worlds from text or images.In this episode Fei-Fei and Justin explore why spatial intelligence is fundamentally different from language, what's missing from current world models (hint: physics), and the architectural insight that transformers are actually set models, not sequence models. Resources:Follow Fei-Fei on X: https://x.com/drfeifeiFollow Justin on X: https://x.com/jcjohnssFollow Shawn on X: https://x.com/swyxFollow Alessio on X: https://x.com/fanahova Stay Updated:If you enjoyed this episode, please be sure to like, subscribe, and share with your friends.Follow a16z on X: https://x.com/a16zFollow a16z on LinkedIn:https://www.linkedin.com/company/a16zFollow the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYXFollow the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details, please see http://a16z.com/disclosures. Stay Updated:Find a16z on XFind a16z on LinkedInListen to the a16z Podcast on SpotifyListen to the a16z Podcast on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 25, 2025 60:38

Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D.We discuss:* The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone.* What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets.* Fei-fei's essay:on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in.* Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning.* The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem.* Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters.* Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots.* How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn't to throw away LLMs but to complement them with rich, embodied models of the world.* Fei-Fei and Justin's long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making.—Fei-Fei Li* X: https://x.com/drfeifei* LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247Justin Johnson* X: https://x.com/jcjohnss* LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664Where to find Latent Space* X: https://x.com/latentspacepodFull Video EpisodeTimestamps00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership00:02:00 From ImageNet to World Models: The Evolution of Computer Vision00:12:42 Dense Captioning and Early Vision-Language Work00:19:57 Spatial Intelligence: Beyond Language Models00:28:46 Introducing Marble: World Labs' First Spatial Intelligence Model00:33:21 Gaussian Splats and the Technical Architecture of Marble00:22:10 Physics, Dynamics, and the Future of World Models00:41:09 Multimodality and the Interplay of Language and Space00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI00:56:58 Hiring, Research Directions, and the Future of World Labs Get full access to Latent.Space at www.latent.space/subscribe

ai future real space dna language 3d hiring vr stanford intelligence models physics dynamics robotics labs gpu vfx spatial marble gpus interplay latent fei justin johnson gaussian fei fei li imagenet multimodality fei fei technical architecture

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 25, 2025

Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D. We discuss: The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone. What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets. Fei-fei's essay (https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence) on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in. Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning. The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem. Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters. Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots. How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn't to throw away LLMs but to complement them with rich, embodied models of the world. Fei-Fei and Justin's long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making. — Fei-Fei Li X: https://x.com/drfeifei LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247 Justin Johnson X: https://x.com/jcjohnss LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664 Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership 00:02:00 From ImageNet to World Models: The Evolution of Computer Vision 00:12:42 Dense Captioning and Early Vision-Language Work 00:19:57 Spatial Intelligence: Beyond Language Models 00:28:46 Introducing Marble: World Labs' First Spatial Intelligence Model 00:33:21 Gaussian Splats and the Technical Architecture of Marble 00:22:10 Physics, Dynamics, and the Future of World Models 00:41:09 Multimodality and the Interplay of Language and Space 00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI 00:56:58 Hiring, Research Directions, and the Future of World Labs

ai future real dna language 3d hiring vr stanford intelligence models physics dynamics robotics labs gpu vfx spatial marble gpus interplay fei justin johnson gaussian fei fei li imagenet multimodality fei fei technical architecture

The Godmother of AI on jobs, robots & why world models are next | Dr. Fei-Fei Li

Lenny's Podcast: Product | Growth | Career

Play Episode Listen Later Nov 16, 2025 79:34

Dr. Fei-Fei Li is known as the “godmother of AI.” She's been at the center of AI's biggest breakthroughs for over two decades. She spearheaded ImageNet, the dataset that sparked the deep-learning revolution we're living right now, served as Google Cloud's Chief AI Scientist, directed Stanford's Artificial Intelligence Lab, and co-founded Stanford's Institute for Human-Centered AI. In this conversation, Fei-Fei shares the rarely told history of how we got here—including the wild fact that just nine years ago, calling yourself an AI company was basically a death sentence.We discuss:1. How ImageNet helped spark the AI explosion we're living through2. Why world models and spatial intelligence represent the next frontier in AI, beyond large language models3. Why Fei-Fei believes AI won't replace humans but will require us to take responsibility for ourselves4. The surprising applications of Marble, from movie production to psychological research5. Why robotics faces unique challenges compared with language models and what's needed to overcome them6. How to participate in AI regardless of your role—Brought to you by:Figma Make—A prompt-to-code tool for making ideas realJustworks—The all-in-one HR solution for managing your small business with confidenceSinch—Build messaging, email, and calling into your product—Transcript: https://www.lennysnewsletter.com/p/the-godmother-of-ai—My biggest takeaways (for paid newsletter subscribers):https://www.lennysnewsletter.com/i/178223233/my-biggest-takeaways-from-this-conversation—Where to find Dr. Fei-Fei Li• X: https://x.com/drfeifei• LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247• World Labs: https://www.worldlabs.ai—Where to find Lenny:• Newsletter: https://www.lennysnewsletter.com• X: https://twitter.com/lennysan• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/—In this episode, we cover:(00:00) Introduction to Dr. Fei-Fei Li(05:31) The evolution of AI(09:37) The birth of ImageNet(17:25) The rise of deep learning(23:53) The future of AI and AGI(29:51) Introduction to world models(40:45) The bitter lesson in AI and robotics(48:02) Introducing Marble, a revolutionary product(51:00) Applications and use cases of Marble(01:01:01) The founder's journey and insights(01:10:05) Human-centered AI at Stanford(01:14:24) The role of AI in various professions(01:18:16) Conclusion and final thoughts—References: https://www.lennysnewsletter.com/p/the-godmother-of-ai—Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.—Lenny may be an investor in the companies discussed. To hear more, visit www.lennysnewsletter.com

ai institute robots jobs human production stanford conclusion models applications references lenny agi google cloud marble godmothers fei fei li imagenet next dr fei fei artificial intelligence lab

The Frontier of Spatial Intelligence with Fei-Fei Li

a16z

Play Episode Listen Later Nov 13, 2025 44:11

Fei-Fei Li and Justin Johnson are pioneers in AI. While the world has only recently witnessed a surge in consumer AI, they have long been laying the groundwork for the innovations transforming industries today.With the recent launch of Marble, the first product from their company World Labs, we are revisiting this conversation to explore the ideas that started it all. World Labs is focused on spatial intelligence, building Large World Models that can perceive, generate, and interact with the 3D world. Marble brings that vision to life, allowing anyone, from individual creators to major platforms, to generate 3D scenes directly from text or image prompts and turn complex 3D creation into a simple, creative process.In this episode, a16z general partner Martin Casado talks with Fei-Fei and Justin about the journey from early AI winters to the rise of deep learning and multimodal AI. From foundational breakthroughs like ImageNet to the cutting-edge realm of spatial intelligence, they discuss the evolution of the field and what is next for innovation at World Labs. Timecode:0:00 – The Next Decade of AI2:45 – Origins: Backgrounds of the Founders6:50 – The Rise of Deep Learning & ImageNet8:00 – Algorithmic Unlocks: Compute, Data, and Supervised Learning12:00 – From Predictive to Generative AI16:20 – The Journey to Spatial Intelligence18:35 – Defining Spatial Intelligence21:15 – 3D Data, Computer Vision, and Breakthroughs23:15 – Reconstruction vs. Generation in Computer Vision24:45 – Spatial Intelligence vs. Language Models29:00 – Applications: Virtual, Augmented, and Physical Worlds39:55 – Building World Labs: Team and Vision41:55 – The North Star: Measuring Success in Spatial Intelligence Resources:Learn more about World Labs: https://www.worldlabs.aiLearn more about Marble: https://Marble.WorldLabs.aiFind Fei-Fei on Twitter: https://x.com/drfeifeiFind Justin on Twitter: https://x.com/jcjohnssFind Martin on Twitter: https://x.com/martin_casado Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends!Find a16z on X: https://x.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zListen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYXListen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711Follow our host: https://x.com/eriktorenbergPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on XFind a16z on LinkedInListen to the a16z Podcast on SpotifyListen to the a16z Podcast on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

spotify ai data 3d generation intelligence frontier reconstruction simplecast spatial marble augmented next decade computer vision justin johnson fei fei li imagenet fei fei martin casado

What's Your Problem: "Teaching Computers to See"

If/Then: Research findings to help us navigate complex issues in business, leadership, and society

Play Episode Listen Later Oct 29, 2025 27:03

This week on If/Then, we're sharing an episode of What's Your Problem?, a show from Pushkin Industries where entrepreneurs, engineers, and scientists talk about the future they're trying to build—and the problems they must solve to get there. Hosted by former Planet Money co-host Jacob Goldstein, each conversation explores the challenges and breakthroughs shaping the next wave of innovation.In this episode, Goldstein speaks with Fei-Fei Li, Stanford computer scientist, former Chief Scientist of AI and Machine Learning at Google, and one of the most influential figures in the field of computer vision. Li reflects on her pioneering work developing ImageNet, the massive dataset that helped spark the modern AI revolution, and the “north star” questions that have guided her research from neuroscience to machine learning.Together, they trace how a single insight about how humans see the world led to a paradigm shift in artificial intelligence—and how Li's vision continues to shape the way we teach machines to see, learn, and collaborate with us.More Resources: • Fei Fei Li • Stanford Institute for Human-Centered Artificial Intelligence (HAI) • ImageNet • What's Your Problem?If/Then is a podcast from Stanford Graduate School of Business that examines research findings that can help us navigate the complex issues we face in business, leadership, and society.Chapters: (00:00:00) Introducing “What's Your Problem?” Kevin Cool introduces the Pushkin Industries podcast hosted by Jacob Goldstein.00:00:45 — What Is Computer Vision? Jacob Goldstein and Fei-Fei Li explain how machines learn to see and interpret images.00:03:18 — Real-World Uses of AI Vision Li shares examples from healthcare, robotics, and environmental science.00:05:06 — Discovering the Science of SeeingHow human vision research inspired Li's lifelong “north star” in AI.00:09:56 — Creating ImageNet Li builds a massive image database that transforms computer vision research.00:13:29 — Defining 30,000 Visual Concepts How cognitive science helped shape ImageNet's massive scale.00:16:41 — Building the Dataset by HandLi's team uses global crowdsourcing to label millions of images.00:19:38 — The 2012 Breakthrough Jeff Hinton's neural network shatters records and sparks the deep learning era.00:22:19 — Data Meets Hardware Li reflects on how big data and GPUs converged to power modern AI.00:24:55 — Lightning Round with Fei-Fei Li Quick insights on resilience, mentorship, and the future of human-AI collaboration.See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

How to Benchmark Your Pricing Like AI Models with Steven Forth

Impact Pricing

Play Episode Listen Later Jul 21, 2025 35:20

Steven Forth is the Co-founder and Chief Value Officer at Ibbaka, a leading value and pricing consulting firm. With deep expertise in AI applications for pricing and value modeling, Steven is at the forefront of developing intelligent agents that help businesses understand and communicate value more effectively. His work focuses on the intersection of artificial intelligence, pricing strategy, and value creation, making him a pioneer in applying AI to solve complex pricing challenges. In this episode, Steven shares his insights on how benchmarking is revolutionizing both AI development and pricing strategy. Drawing parallels between how AI models are improved through benchmarking and how pricing models should be evaluated, he introduces a framework for measuring pricing effectiveness that could transform how we approach pricing decisions. Together with Mark, they explore the challenges of establishing "truth" in pricing, the role of synthetic data, and the future of AI-powered pricing tools. Why you have to check out today's podcast: Discover how AI benchmarking principles can revolutionize pricing model evaluation. Understand how to evaluate pricing models from both buyer and seller perspectives. Explore the future of AI-powered pricing tools and what it means for pricing professionals. "We don't start with the truth. We have to work our way towards truth through multiple iterations and applications." – Steven Forth Topics Covered: 02:15 – How Intercom's FinAI agent uses daily benchmarking to improve ticket resolution performance 05:30 – Why AI's success is built on benchmarking and how it emerged from the ImageNet competition 08:45 – The critical problem: pricing lacks standardized benchmarking like AI models have 11:20 – Michael Mansard's 12-factor pricing model assessment and its potential as an industry standard 14:10 – Why pricing models must be evaluated from both buyer and seller perspectives 17:25 – How market segmentation and use cases complicate pricing model benchmarking 20:40 – The role of synthetic data in pricing research and model validation 24:15 – Why "vibe coding" could disrupt traditional pricing consulting within 3 years 27:30 – The search for truth in pricing: hedonic pricing models and market assumptions 31:45 – Introduction to ValueIQ: Ibbaka's new AI agent for value-based selling Key Takeaways: "Anyone who says that they're data centric or data driven is actually before that they have to be model driven because they're using some form of model to organize the data." – Steven Forth "We should have done this 20 years ago. What were we thinking? Well, we weren't thinking. And we didn't have ways to do this for us anyway." – Steven Forth (on developing pricing benchmarks) "Benchmarking every day, I think, is going to be critical to the success of agents that do important business things." – Steven Forth "You can always improve your measurement, but at some point the return of improving the measurement is lower than the cost of increasing the validity of the measurement." – Steven Forth Resources and People Mentioned: Douglas Hubbard's How to Measure Anything (book): https://www.amazon.com/How-Measure-Anything-Intangibles-Business/dp/1118539273 ImageNet: https://image-net.org/ Michael Mansard's 12-Factor Pricing Model: https://www.insead.edu/bio/michael-mansard-0 Intercom's FinAI: https://www.intercom.com/help/en/articles/8205718-fin-ai-agent-resolutions Lovable, Replit, Bolt: https://linkblink.medium.com/bolt-vs-cursor-vs-replit-vs-lovable-ai-coders-comparison-guide-3b9d41e75810 ValueIQ: https://www.ibbaka.com/ibbaka-market-blog/get-ready-for-valueiq-sign-up-now-for-beta-access Connect with Steven Forth: LinkedIn: https://www.linkedin.com/in/stevenforth/ Email: steven@ibbaka.com Connect with Mark Stiving: LinkedIn: https://www.linkedin.com/in/stiving/ Email: mark@impactpricing.com

ai discover explore drawing pricing bolt benchmark benchmarking lovable intercom ai models replit imagenet measure anything

Teaching AI to Understand the Physical World, with Dr. Fei-Fei Li of World Labs

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

Play Episode Listen Later Jun 5, 2025 35:53

In this episode of No Priors, Sarah and Elad are joined by Dr. Fei-Fei Li, AI pioneer, co-director of Stanford's Human-Centered AI Institute, and founder of World Labs. Fei-Fei shares why she's building at the intersection of embodiment and intelligence, and what today's AI systems are still missing. From the early days of ImageNet to her vision for the next generation of robotics, she unpacks the human and technical motivations behind World Labs. They also discuss the challenges of 3D world modeling, her approach to building exceptional teams, and the special qualities that have led her students like Andrej Karpathy to make major breakthroughs. Show Notes: 0:00 Why and what Dr. Fei-Fei Li is building 3:00 World models at World Labs 6:44 Missing gaps in the AI future 9:16 Robotics and physical intelligence 16:15 Greatest challenges of 3D 19:08 Fei-Fei's work in PhD in ImageNet 23:05 Special moments in Dr. Li's career 29:33 Building teams 32:05 Human-centered AI

world ai building phd teaching missing 3d human stanford li robotics labs greatest elad physical world andrej karpathy fei fei li imagenet fei fei no priors

408. Synthetic Text Extruder Hype (ft. Emily Bender, Alex Hanna)

This Machine Kills

Play Episode Listen Later Jun 4, 2025 79:58

We chat with Emily Bender and Alex Hanna — authors of AI Con: How to Fight Big Tech's Hype and Create the Future We Want — and pierce the veil of hype by getting into how these systems actually work and, importantly, the work they cannot do despite claims by boosters and doomers alike. Think of datasets like ImageNet or LAION-5B as big vats of pink slime and LLMs like ChatGPT as “synthetic text extruding machines” that turn pink slime into nuggets of text. It's easy to forget that these magical mystery machines are direct descendants of very unexciting things like “T9 word.” We end the episode by chatting about why we shouldn't trust the hype about how AI is going to destroy (or revolutionize) the education sector. ••• The AI Con | Emily Bender and Alex Hanna https://thecon.ai/ ••• On the genealogy of machine learning datasets: A critical history of ImageNet https://journals.sagepub.com/doi/full/10.1177/20539517211035955 ••• Mystery AI Hype Theater 3000 https://www.dair-institute.org/maiht3k/ Standing Plugs: ••• Order Jathan's new book: https://www.ucpress.edu/book/9780520398078/the-mechanic-and-the-luddite ••• Subscribe to Ed's substack: https://substack.com/@thetechbubble ••• Subscribe to TMK on patreon for premium episodes: https://www.patreon.com/thismachinekills Hosted by Jathan Sadowski (bsky.app/profile/jathansadowski.com) and Edward Ongweso Jr. (www.x.com/bigblackjacobin). Production / Music by Jereme Brown (bsky.app/profile/jebr.bsky.social)

ai chatgpt hype bender synthetic t9 imagenet alex hanna production music edward ongweso jr

Understanding the Elegant Math Behind Modern Machine Learning

TechSurge: The Deep Tech Podcast

Play Episode Listen Later Feb 27, 2025 74:43

Artificial intelligence is evolving at an unprecedented pace—what does that mean for the future of technology, venture capital, business, and even our understanding of ourselves? Award-winning journalist and writer Anil Ananthaswamy joins us for our latest episode to discuss his latest book Why Machines Learn: The Elegant Math Behind Modern AI.Anil helps us explore the journey and many breakthroughs that have propelled machine learning from simple perceptrons to the sophisticated algorithms shaping today's AI revolution, powering GPT and other models. The discussion aims to demystify some of the underlying mathematical concepts that power modern machine learning, to help everyone grasp this technology impacting our lives–even if your last math class was in high school. Anil walks us through the power of scaling laws, the shift from training to inference optimization, and the debate among AI's pioneers about the road to AGI—should we be concerned, or are we still missing key pieces of the puzzle? The conversation also delves into AI's philosophical implications—could understanding how machines learn help us better understand ourselves? And what challenges remain before AI systems can truly operate with agency?If you enjoy this episode, please subscribe and leave us a review on your favorite podcast platform. Sign up for our newsletter at techsurgepodcast.com for exclusive insights and updates on upcoming TechSurge Live Summits.Links:Read Why Machines Learn, Anil's latest book on the math behind AIhttps://www.amazon.com/Why-Machines-Learn-Elegant-Behind/dp/0593185749Learn more about Anil Ananthaswamy's work and writinghttps://anilananthaswamy.com/Watch Anil Ananthaswamy's TED Talk on AI and intelligencehttps://www.ted.com/speakers/anil_ananthaswamyDiscover the MIT Knight Science Journalism Fellowship that shaped Anil's AI researchhttps://ksj.mit.edu/Understand the Perceptron, the foundation of neural networkshttps://en.wikipedia.org/wiki/PerceptronRead about the Perceptron Convergence Theorem and its significancehttps://www.nature.com/articles/323533a0

ai modern artificial intelligence math ted talks artificial transformers machine learning venture capital gpt agi elegant gpus deep tech anil neural networks ai models imagenet perceptron backpropagation convolutional neural networks

[Edito] Comparer les IA entre elles n'a pas de sens

Monde NumÃ©rique - JÃ©rÃ´me Colombain

Play Episode Listen Later Feb 21, 2025 4:57

Elon Musk a présenté Grok 3 comme l'IA "la plus intelligente sur Terre", mais cette affirmation tient-elle la route ? Avec la multiplication des intelligences artificielles, de ChatGPT à Mistral en passant par Grok ou Perplexity, une question revient sans cesse : quelle est la meilleure ? Pourtant, vouloir les comparer de manière globale n'a pas vraiment de sens, car chaque IA a ses propres spécificités et excelle dans certains domaines tout en montrant des limites dans d'autres.Performance, véracité des réponses, rapidité, coût, impact environnemental... Sur quels critères comparer ? En outre, chaque utilisateur a ses propres attentes et biais, influençant ainsi la perception de la "meilleure" IA. Il existe des outils de classement, comme Chatbot Arena ou le français compareia.beta.gouv.fr, qui permettent de comparer les IA à l'aveugle en se focalisant sur la qualité des réponses. Par ailleurs, des benchmarks techniques comme GLU, SQUAD ou ImageNet apportent des évaluations plus précises sur des compétences spécifiques.Cependant, il est difficile de dire qu'une IA est globalement meilleure qu'une autre. Certaines excellent en traduction, d'autres en génération de code, en recherche d'actualité ou en création de contenu. Plutôt que de chercher une IA universellement supérieure, mieux vaut identifier celle qui correspond le mieux à chaque besoin précis.Liens : https://lmarena.ai/https://www.comparia.beta.gouv.fr/Mots-clés : intelligence artificielle, IA, Grok 3, Elon Musk, ChatGPT, Mistral, Perplexity, comparatif IA, benchmark IA, chatbot arena, DINUM, compareia, GPT-4, IA générative, machine learning, modèle de langage-----------♥️ Soutenez Monde Numérique : https://donorbox.org/monde-numerique

Prof. Jakob Foerster - ImageNet Moment for Reinforcement Learning?

Machine Learning Street Talk

Play Episode Listen Later Feb 18, 2025 53:31

Prof. Jakob Foerster, a leading AI researcher at Oxford University and Meta, and Chris Lu, a researcher at OpenAI -- they explain how AI is moving beyond just mimicking human behaviour to creating truly intelligent agents that can learn and solve problems on their own. Foerster champions open-source AI for responsible, decentralised development. He addresses AI scaling, goal misalignment (Goodhart's Law), and the need for holistic alignment, offering a quick look at the future of AI and how to guide it.SPONSOR MESSAGES:***CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. Check out their super fast DeepSeek R1 hosting!https://centml.ai/pricing/Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich. Goto https://tufalabs.ai/***TRANSCRIPT/REFS:https://www.dropbox.com/scl/fi/yqjszhntfr00bhjh6t565/JAKOB.pdf?rlkey=scvny4bnwj8th42fjv8zsfu2y&dl=0 Prof. Jakob Foersterhttps://x.com/j_foersthttps://www.jakobfoerster.com/University of Oxford Profile: https://eng.ox.ac.uk/people/jakob-foerster/Chris Lu:https://chrislu.page/TOC1. GPU Acceleration and Training Infrastructure [00:00:00] 1.1 ARC Challenge Criticism and FLAIR Lab Overview [00:01:25] 1.2 GPU Acceleration and Hardware Lottery in RL [00:05:50] 1.3 Data Wall Challenges and Simulation-Based Solutions [00:08:40] 1.4 JAX Implementation and Technical Acceleration2. Learning Frameworks and Policy Optimization [00:14:18] 2.1 Evolution of RL Algorithms and Mirror Learning Framework [00:15:25] 2.2 Meta-Learning and Policy Optimization Algorithms [00:21:47] 2.3 Language Models and Benchmark Challenges [00:28:15] 2.4 Creativity and Meta-Learning in AI Systems3. Multi-Agent Systems and Decentralization [00:31:24] 3.1 Multi-Agent Systems and Emergent Intelligence [00:38:35] 3.2 Swarm Intelligence vs Monolithic AGI Systems [00:42:44] 3.3 Democratic Control and Decentralization of AI Development [00:46:14] 3.4 Open Source AI and Alignment Challenges [00:49:31] 3.5 Collaborative Models for AI DevelopmentREFS[[00:00:05] ARC Benchmark, Chollethttps://github.com/fchollet/ARC-AGI[00:03:05] DRL Doesn't Work, Irpanhttps://www.alexirpan.com/2018/02/14/rl-hard.html[00:05:55] AI Training Data, Data Provenance Initiativehttps://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html[00:06:10] JaxMARL, Foerster et al.https://arxiv.org/html/2311.10090v5[00:08:50] M-FOS, Lu et al.https://arxiv.org/abs/2205.01447[00:09:45] JAX Library, Google Researchhttps://github.com/jax-ml/jax[00:12:10] Kinetix, Mike and Michaelhttps://arxiv.org/abs/2410.23208[00:12:45] Genie 2, DeepMindhttps://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/[00:14:42] Mirror Learning, Grudzien, Kuba et al.https://arxiv.org/abs/2208.01682[00:16:30] Discovered Policy Optimisation, Lu et al.https://arxiv.org/abs/2210.05639[00:24:10] Goodhart's Law, Goodharthttps://en.wikipedia.org/wiki/Goodhart%27s_law[00:25:15] LLM ARChitect, Franzen et al.https://github.com/da-fr/arc-prize-2024/blob/main/the_architects.pdf[00:28:55] AlphaGo, Silver et al.https://arxiv.org/pdf/1712.01815.pdf[00:30:10] Meta-learning, Lu, Towers, Foersterhttps://direct.mit.edu/isal/proceedings-pdf/isal2023/35/67/2354943/isal_a_00674.pdf[00:31:30] Emergence of Pragmatics, Yuan et al.https://arxiv.org/abs/2001.07752[00:34:30] AI Safety, Amodei et al.https://arxiv.org/abs/1606.06565[00:35:45] Intentional Stance, Dennetthttps://plato.stanford.edu/entries/ethics-ai/[00:39:25] Multi-Agent RL, Zhou et al.https://arxiv.org/pdf/2305.10091[00:41:00] Open Source Generative AI, Foerster et al.https://arxiv.org/abs/2405.08597

university ai work law evolution creativity events prof silver openai genie oxford university emergence zurich ml towers genai agi kuba yuan pragmatic decentralization zhou rl chief engineer alphago franzen ai development reinforcement learning foerster goodhart open source ai imagenet swarm intelligence kinetix chris lu

Fei-Fei Li: Staying curious at the forefront of AI

Tools and Weapons with Brad Smith

Play Episode Listen Later Jan 21, 2025 34:43

Fei-Fei Li is a pioneering AI scientist breaking new ground in computer vision, a Stanford professor, and currently leading the innovative start-up World Labs. While her career is deeply rooted in technical expertise, Dr. Li's journey is driven by an insatiable curiosity. In this episode, Brad and Dr. Li reflect on poignant moments from her memoir, "The Worlds I See: Curiosity, Exploration, and Discovery at the Dawn of AI," highlighting the crucial role of keeping humanity at the center of AI development. They also explore how government-funded academic research, driven by curiosity rather than profits, can lead to unexpected and profound discoveries that propel innovation and economic opportunities.Click here for the episode transcript.

Why Machines Learn: The Elegant Math Behind AI with Anil Ananthaswamy | SparX by Mukesh Bansal

SparX by Mukesh Bansal

Play Episode Listen Later Jan 21, 2025 67:50

Anil Ananthaswamy is a renowned science writer and journalist who has written extensively on various scientific topics. In his latest book "Why Machines Learn", Anil explores the fascinating world of artificial intelligence and machine learning. He reveals the intricate mechanisms and complex algorithms that underlie these cutting-edge technologies. Join us for a fascinating conversation with science writer Anil Ananthaswamy as he shares insights from his book and sheds light on the rapidly evolving field of AI. Tune in to gain a deeper understanding of how these machines work at a basic mathematics level. Resource List - Why Machines Learn, book by Anil Ananthaswamy - https://amzn.in/d/bmirU45 Dartmouth Summer Research Project on Artificial Intelligence - https://home.dartmouth.edu/about/artificial-intelligence-ai-coined-dartmouth What is the Perceptron artificial neural network? - https://www.geeksforgeeks.org/what-is-perceptron-the-simplest-artificial-neural-network/ Read about the McCulloch-Pitts Artificial Neuron - https://towardsdatascience.com/mcculloch-pitts-model-5fdf65ac5dd1 Nobel Prize in Physics 2024 - https://www.nobelprize.org/prizes/physics/2024/press-release/ What is the Hopfield Neural Network? - https://www.geeksforgeeks.org/hopfield-neural-network/ Read about Backpropagation - https://en.wikipedia.org/wiki/Backpropagation “Learning representations by back-propagating errors”, paper by Geoffrey Hinton, David Rumelhart and Ronald Williams - https://www.nature.com/articles/323533a0 AlexNet by Geoffery Hinton and team - https://en.wikipedia.org/wiki/AlexNet What is ImageNet? - https://www.image-net.org/about.php ‘Attention Is All You Need', transformer architecture paper - https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf What are Neural Scaling Laws? - https://en.wikipedia.org/wiki/Neural_scaling_law NeuroAI - https://neuro-ai.com/ About SparX by Mukesh Bansal SparX is a podcast where we delve into cutting-edge scientific research, stories from impact-makers and tools for unlocking the secrets to human potential and growth. We believe that entrepreneurship, fitness and the science of productivity is at the forefront of the India Story; the country is at the cusp of greatness and at SparX, we wish to make these tools accessible for every generation of Indians to be able to make the most of the opportunities around us. In a new episode every Sunday, our host Mukesh Bansal (Founder Myntra and Cult.fit) will talk to guests from all walks of life and also break down everything he's learnt about the science of impact over the course of his 20-year long career. This is the India Century, and we're enthusiastic to start this journey with you. Follow us on Instagram: / sparxbymukeshbansal Website: https://www.sparxbymukeshbansal.com You can also listen to SparX on all audio platforms Fasion | Outbreak | Courtesy EpidemicSound.com

ai artificial intelligence cult math physics machines indians nobel prize neural elegant anil geoffrey hinton bansal mukesh imagenet sparx perceptron backpropagation ronald williams

Fei-Fei Li on spatial intelligence and human-centered AI

Possible

Play Episode Listen Later Jan 15, 2025 44:13

How can we use AI to amplify human potential and build a better future? And what exactly does “AGI” even mean? To kick off Possible's fourth season, Reid and Aria sit down with world-renowned computer scientist Fei-Fei Li, whose work in artificial intelligence over the past several decades has earned her the nickname “the godmother of AI.” An entrepreneur and professor, Fei-Fei shares her journey from creating ImageNet, a massive dataset of labeled images that revolutionized computer vision, to her current role as co-founder and CEO of the spatial intelligence startup World Labs. She explains why spatial intelligence—the ability to perceive and interact with the 3D world—is so crucial for AI's development and how it could lead to breakthroughs in fields like medicine, climate, and education. They get into regulatory guardrails, governance, and what it will take to build a positive, human-centered AI future for all. For more info on the podcast and transcripts of all the episodes, visit https://www.possible.fm/podcast/ Topics: 1:55 - Hellos and intros 3:43 - ImageNet and the interplay between data and models 6:06 - World Labs and spatial intelligence 10:03 - Boundaries between 3D physical and digital worlds 11:50 - The difference between LLMs and LWMs 13:02 - What humans are capable of creating with technology 14:04 - Key principles of AI: human agency and respect 17:16 - Stanford Institute for Human-Centered AI 19:13 - What this moment in AI means for humanity 21:06 - Cross-sector collaboration 25:10 - AI4ALL program and the importance of diversity in AI development 27:00 - Midroll ad break 27:09 - Using AI to improve healthcare delivery and treatment 30:20 - Founding history of AI and the meaning of the term “AGI” 33:00 - Future of agentic AI and voice 34:42 - Fei-Fei's mentor and his advice 37:18 - Rapid-fire questions Possible is an award-winning podcast that sketches out the brightest version of the future—and what it will take to get there. Most of all, it asks: what if, in the future, everything breaks humanity's way? Tune in for grounded and speculative takes on how technology—and, in particular, AI—is inspiring change and transforming the future. Hosted by Reid Hoffman and Aria Finger, each episode features an interview with an ambitious builder or deep thinker on a topic, from art to geopolitics and from healthcare to education. These conversations also showcase another kind of guest: AI. Whether it's Inflection's Pi, OpenAI's ChatGPT or other AI tools, each episode will use AI to enhance and advance our discussion about what humanity could possibly get right if we leverage technology—and our collective effort—effectively.

NASA's Moon Micro-Mission, AgiBot's Humanoid Robot Training Dataset, and Microsoft's $100B AGI Definition

Discover Daily by Perplexity

Play Episode Listen Later Jan 7, 2025 7:45 Transcription Available

We're experimenting and would love to hear from you!In this episode of 'Discover Daily' by Perplexity, hosts Isaac and Sienna explore NASA's upcoming Lunar Trailblazer mission, scheduled for January 2025. This compact satellite mission aims to map water resources on the Moon's surface using advanced instruments like the High-resolution Volatiles and Minerals Moon Mapper and the Lunar Thermal Mapper. The mission represents a crucial step in NASA's Artemis program, designed to establish sustainable human presence on the MoonThe show delves into a groundbreaking development in robotics, highlighting Chinese startup AgiBot's release of the AgiBot World Alpha dataset. This comprehensive open-source collection features over one million trajectories from 100 robots in industrial-grade environments, potentially marking an 'ImageNet moment' for embodied intelligence in roboticsThe main story focuses on Microsoft and OpenAI's unconventional redefinition of Artificial General Intelligence (AGI), which ties achievement to a $100 billion profit milestone. The episode examines the implications of this profit-centric definition, Microsoft's diversification strategy in AI investments, and the complex dynamics of their partnership agreement. This innovative approach to defining AGI raises important questions about the future direction of AI development and its impact on the tech industryFrom Perplexity's Discover Feed: https://www.perplexity.ai/page/nasa-s-moon-micro-mission-Bua4as.9SCi.G_fZKZUCPAhttps://www.perplexity.ai/page/agibot-s-humanoid-robot-traini-ovKJpg2RSey1INdEnXwuNwhttps://www.perplexity.ai/page/microsoft-s-100b-agi-definitio-e6FaEhReQs.9exHMGZpuogPerplexity is the fastest and most powerful way to search the web. Perplexity crawls the web and curates the most relevant and up-to-date sources (from academic papers to Reddit threads) to create the perfect response to any question or topic you're interested in. Take the world's knowledge with you anywhere. Available on iOS and Android Join our growing Discord community for the latest updates and exclusive content. Follow us on: Instagram Threads X (Twitter) YouTube Linkedin

Classifying Images: Massive Parallelism And Surface Features

Fluidity

Play Episode Listen Later Jan 5, 2025 15:05

Analysis of image classifiers demonstrates that it is possible to understand backprop networks at the task-relevant run-time algorithmic level. In these systems, at least, networks gain their power from deploying massive parallelism to check for the presence of a vast number of simple, shallow patterns. https://betterwithout.ai/images-surface-features This episode has a lot of links: David Chapman's earliest public mention, in February 2016, of image classifiers probably using color and texture in ways that "cheat": twitter.com/Meaningness/status/698688687341572096 Jordana Cepelewicz's “Where we see shapes, AI sees textures,” Quanta Magazine, July 1, 2019: https://www.quantamagazine.org/where-we-see-shapes-ai-sees-textures-20190701/ “Suddenly, a leopard print sofa appears”, May 2015: https://web.archive.org/web/20150622084852/http://rocknrollnerd.github.io/ml/2015/05/27/leopard-sofa.html “Understanding How Image Quality Affects Deep Neural Networks” April 2016: https://arxiv.org/abs/1604.04004 Goodfellow et al., “Explaining and Harnessing Adversarial Examples,” December 2014: https://arxiv.org/abs/1412.6572 “Universal adversarial perturbations,” October 2016: https://arxiv.org/pdf/1610.08401v1.pdf “Exploring the Landscape of Spatial Robustness,” December 2017: https://arxiv.org/abs/1712.02779 “Overinterpretation reveals image classification model pathologies,” NeurIPS 2021: https://proceedings.neurips.cc/paper/2021/file/8217bb4e7fa0541e0f5e04fea764ab91-Paper.pdf “Approximating CNNs with Bag-of-Local-Features Models Works Surprisingly Well on ImageNet,” ICLR 2019: https://openreview.net/forum?id=SkfMWhAqYQ Baker et al.'s “Deep convolutional networks do not classify based on global object shape,” PLOS Computational Biology, 2018: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006613 François Chollet's Twitter threads about AI producing images of horses with extra legs: twitter.com/fchollet/status/1573836241875120128 and twitter.com/fchollet/status/1573843774803161090 “Zoom In: An Introduction to Circuits,” 2020: https://distill.pub/2020/circuits/zoom-in/ Geirhos et al., “ImageNet-Trained CNNs Are Biased Towards Texture; Increasing Shape Bias Improves Accuracy and Robustness,” ICLR 2019: https://openreview.net/forum?id=Bygh9j09KX Dehghani et al., “Scaling Vision Transformers to 22 Billion Parameters,” 2023: https://arxiv.org/abs/2302.05442 Hasson et al., “Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks,” February 2020: https://www.gwern.net/docs/ai/scaling/2020-hasson.pdf

ai deep massive exploring universal landscape explaining analysis images surface biological bag circuits classifying hasson goodfellow robustness parallelism chollet imagenet neurips david chapman quanta magazine artificial neural networks iclr plos computational biology meaningness

2024 in Vision [LS Live @ NeurIPS]

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Dec 22, 2024 57:25

Happy holidays! We'll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production!For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver.The single most requested domain was computer vision, and we could think of no one better to help us recap 2024 than our friends at Roboflow, who was one of our earliest guests in 2023 and had one of this year's top episodes in 2024 again. Roboflow has since raised a $40m Series B!LinksTheir slides are here:All the trends and papers they picked:* Isaac Robinson* Sora (see our Video Diffusion pod) - extending diffusion from images to video* SAM 2: Segment Anything in Images and Videos (see our SAM2 pod) - extending prompted masks to full video object segmentation* DETR Dominancy: DETRs show Pareto improvement over YOLOs* RT-DETR: DETRs Beat YOLOs on Real-time Object Detection* LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection* D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement* Peter Robicheaux* MMVP (Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs)* * Florence 2 (Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks) * PalíGemma / PaliGemma 2* PaliGemma: A versatile 3B VLM for transfer* PaliGemma 2: A Family of Versatile VLMs for Transfer* AlMv2 (Multimodal Autoregressive Pre-training of Large Vision Encoders) * Vik Korrapati - MoondreamFull Talk on YouTubeWant more content like this? Like and subscribe to stay updated on our latest talks, interviews, and podcasts.Transcript/Timestamps[00:00:00] Intro[00:00:05] AI Charlie: welcome to Latent Space Live, our first mini conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co host. When we were thinking of ways to add value to our academic conference coverage, we realized that there was a lack of good talks, just recapping the best of 2024, going domain by domain.[00:00:36] AI Charlie: We sent out a survey to the over 900 of you. who told us what you wanted, and then invited the best speakers in the Latent Space Network to cover each field. 200 of you joined us in person throughout the day, with over 2, 200 watching live online. Our second featured keynote is The Best of Vision 2024, with Peter Robichaud and Isaac [00:01:00] Robinson of Roboflow, with a special appearance from Vic Corrapati of Moondream.[00:01:05] AI Charlie: When we did a poll of our attendees, the highest interest domain of the year was vision. And so our first port of call was our friends at Roboflow. Joseph Nelson helped us kickstart our vision coverage in episode 7 last year, and this year came back as a guest host with Nikki Ravey of Meta to cover segment Anything 2.[00:01:25] AI Charlie: Roboflow have consistently been the leaders in open source vision models and tooling. With their SuperVision library recently eclipsing PyTorch's Vision library. And Roboflow Universe hosting hundreds of thousands of open source vision datasets and models. They have since announced a 40 million Series B led by Google Ventures.[00:01:46] AI Charlie: Woohoo.[00:01:48] Isaac's picks[00:01:48] Isaac Robinson: Hi, we're Isaac and Peter from Roboflow, and we're going to talk about the best papers of 2024 in computer vision. So, for us, we defined best as what made [00:02:00] the biggest shifts in the space. And to determine that, we looked at what are some major trends that happened and what papers most contributed to those trends.[00:02:09] Isaac Robinson: So I'm going to talk about a couple trends, Peter's going to talk about a trend, And then we're going to hand it off to Moondream. So, the trends that I'm interested in talking about are These are a major transition from models that run on per image basis to models that run using the same basic ideas on video.[00:02:28] Isaac Robinson: And then also how debtors are starting to take over the real time object detection scene from the YOLOs, which have been dominant for years.[00:02:37] Sora, OpenSora and Video Vision vs Generation[00:02:37] Isaac Robinson: So as a highlight we're going to talk about Sora, which from my perspective is the biggest paper of 2024, even though it came out in February. Is the what?[00:02:48] Isaac Robinson: Yeah. Yeah. So just it's a, SORA is just a a post. So I'm going to fill it in with details from replication efforts, including open SORA and related work, such as a stable [00:03:00] diffusion video. And then we're also going to talk about SAM2, which applies the SAM strategy to video. And then how debtors, These are the improvements in 2024 to debtors that are making them a Pareto improvement to YOLO based models.[00:03:15] Isaac Robinson: So to start this off, we're going to talk about the state of the art of video generation at the end of 2023, MagVIT MagVIT is a discrete token, video tokenizer akin to VQ, GAN, but applied to video sequences. And it actually outperforms state of the art handcrafted video compression frameworks.[00:03:38] Isaac Robinson: In terms of the bit rate versus human preference for quality and videos generated by autoregressing on these discrete tokens generate some pretty nice stuff, but up to like five seconds length and, you know, not super detailed. And then suddenly a few months later we have this, which when I saw it, it was totally mind blowing to me.[00:03:59] Isaac Robinson: 1080p, [00:04:00] a whole minute long. We've got light reflecting in puddles. That's reflective. Reminds me of those RTX demonstrations for next generation video games, such as Cyberpunk, but with better graphics. You can see some issues in the background if you look closely, but they're kind of, as with a lot of these models, the issues tend to be things that people aren't going to pay attention to unless they're looking for.[00:04:24] Isaac Robinson: In the same way that like six fingers on a hand. You're not going to notice is a giveaway unless you're looking for it. So yeah, as we said, SORA does not have a paper. So we're going to be filling it in with context from the rest of the computer vision scene attempting to replicate these efforts. So the first step, you have an LLM caption, a huge amount of videos.[00:04:48] Isaac Robinson: This, this is a trick that they introduced in Dolly 3, where they train a image captioning model to just generate very high quality captions for a huge corpus and then train a diffusion model [00:05:00] on that. Their Sora and their application efforts also show a bunch of other steps that are necessary for good video generation.[00:05:09] Isaac Robinson: Including filtering by aesthetic score and filtering by making sure the videos have enough motion. So they're not just like kind of the generators not learning to just generate static frames. So. Then we encode our video into a series of space time latents. Once again, SORA, very sparse in details.[00:05:29] Isaac Robinson: So the replication related works, OpenSORA actually uses a MAG VIT V2 itself to do this, but swapping out the discretization step with a classic VAE autoencoder framework. They show that there's a lot of benefit from getting the temporal compression, which makes a lot of sense as the Each sequential frames and videos have mostly redundant information.[00:05:53] Isaac Robinson: So by compressing against, compressing in the temporal space, you allow the latent to hold [00:06:00] a lot more semantic information while avoiding that duplicate. So, we've got our spacetime latents. Possibly via, there's some 3D VAE, presumably a MAG VATV2 and then you throw it into a diffusion transformer.[00:06:19] Isaac Robinson: So I think it's personally interesting to note that OpenSORA is using a MAG VATV2, which originally used an autoregressive transformer decoder to model the latent space, but is now using a diffusion diffusion transformer. So it's still a transformer happening. Just the question is like, is it?[00:06:37] Isaac Robinson: Parameterizing the stochastic differential equation is, or parameterizing a conditional distribution via autoregression. It's also it's also worth noting that most diffusion models today, the, the very high performance ones are switching away from the classic, like DDPM denoising diffusion probability modeling framework to rectified flows.[00:06:57] Isaac Robinson: Rectified flows have a very interesting property that as [00:07:00] they converge, they actually get closer to being able to be sampled with a single step. Which means that in practice, you can actually generate high quality samples much faster. Major problem of DDPM and related models for the past four years is just that they require many, many steps to generate high quality samples.[00:07:22] Isaac Robinson: So, and naturally, the third step is throwing lots of compute at the problem. So I didn't, I never figured out how to manage to get this video to loop, but we see very little compute, medium compute, lots of compute. This is so interesting because the the original diffusion transformer paper from Facebook actually showed that, in fact, the specific hyperparameters of the transformer didn't really matter that much.[00:07:48] Isaac Robinson: What mattered was that you were just increasing the amount of compute that the model had. So, I love how in the, once again, little blog posts, they don't even talk about [00:08:00] like the specific hyperparameters. They say, we're using a diffusion transformer, and we're just throwing more compute at it, and this is what happens.[00:08:08] Isaac Robinson: OpenSora shows similar results. The primary issue I think here is that no one else has 32x compute budget. So we end up with these we end up in the middle of the domain and most of the related work, which is still super, super cool. It's just a little disappointing considering the context. So I think this is a beautiful extension of the framework that was introduced in 22 and 23 for these very high quality per image generation and then extending that to videos.[00:08:39] Isaac Robinson: It's awesome. And it's GA as of Monday, except no one can seem to get access to it because they keep shutting down the login.[00:08:46] SAM and SAM2[00:08:46] Isaac Robinson: The next, so next paper I wanted to talk about is SAM. So we at Roboflow allow users to label data and train models on that data. Sam, for us, has saved our users 75 years of [00:09:00] labeling time.[00:09:00] Isaac Robinson: We are the, to the best of my knowledge, the largest SAM API that exists. We also, SAM also allows us to have our users train just pure bounding box regression models and use those to generate high quality masks which has the great side effect of requiring less training data to have a meaningful convergence.[00:09:20] Isaac Robinson: So most people are data limited in the real world. So anything that requires less data to get to a useful thing is that super useful. Most of our users actually run their object per frame object detectors on every frame in a video, or maybe not most, but many, many. And so Sam follows into this category of taking, Sam 2 falls into this category of taking something that really really works and applying it to a video which has the wonderful benefit of being plug and play with most of our Many of our users use cases.[00:09:53] Isaac Robinson: We're, we're still building out a sufficiently mature pipeline to take advantage of that, but it's, it's in the works. [00:10:00] So here we've got a great example. We can click on cells and then follow them. You even notice the cell goes away and comes back and we can still keep track of it which is very challenging for existing object trackers.[00:10:14] Isaac Robinson: High level overview of how SAM2 works. We there's a simple pipeline here where we can give, provide some type of prompt and it fills out the rest of the likely masks for that object throughout the rest of the video. So here we're giving a bounding box in the first frame, a set of positive negative points, or even just a simple mask.[00:10:36] Isaac Robinson: I'm going to assume people are somewhat familiar with SAM. So I'm going to just give a high level overview of how SAM works. You have an image encoder that runs on every frame. SAM two can be used on a single image, in which case the only difference between SAM two and SAM is that image encoder, which Sam used a standard VIT [00:11:00] Sam two replaced that with a hara hierarchical encoder, which gets approximately the same results, but leads to a six times faster inference, which is.[00:11:11] Isaac Robinson: Excellent, especially considering how in a trend of 23 was replacing the VAT with more efficient backbones. In the case where you're doing video segmentation, the difference is that you actually create a memory bank and you cross attend the features from the image encoder based on the memory bank.[00:11:31] Isaac Robinson: So the feature set that is created is essentially well, I'll go more into it in a couple of slides, but we take the features from the past couple frames, plus a set of object pointers and the set of prompts and use that to generate our new masks. Then we then fuse the new masks for this frame with the.[00:11:57] Isaac Robinson: Image features and add that to the memory bank. [00:12:00] It's, well, I'll say more in a minute. The just like SAM, the SAM2 actually uses a data engine to create its data set in that people are, they assembled a huge amount of reference data, used people to label some of it and train the model used the model to label more of it and asked people to refine the predictions of the model.[00:12:20] Isaac Robinson: And then ultimately the data set is just created from the engine Final output of the model on the reference data. It's very interesting. This paradigm is so interesting to me because it unifies a model in a dataset in a way that is very unique. It seems unlikely that another model could come in and have such a tight.[00:12:37] Isaac Robinson: So brief overview of how the memory bank works, the paper did not have a great visual, so I'm just, I'm going to fill in a bit more. So we take the last couple of frames from our video. And we take the last couple of frames from our video attend that, along with the set of prompts that we provided, they could come from the future, [00:13:00] they could come from anywhere in the video, as well as reference object pointers, saying, by the way, here's what we've found so far attending to the last few frames has the interesting benefit of allowing it to model complex object motion without actually[00:13:18] Isaac Robinson: By limiting the amount of frames that you attend to, you manage to keep the model running in real time. This is such an interesting topic for me because one would assume that attending to all of the frames is super essential, or having some type of summarization of all the frames is super essential for high performance.[00:13:35] Isaac Robinson: But we see in their later ablation that that actually is not the case. So here, just to make sure that there is some benchmarking happening, we just compared to some of the stuff that's came out prior, and indeed the SAM2 strategy does improve on the state of the art. This ablation deep in their dependencies was super interesting to me.[00:13:59] Isaac Robinson: [00:14:00] We see in section C, the number of memories. One would assume that increasing the count of memories would meaningfully increase performance. And we see that it has some impact, but not the type that you'd expect. And that it meaningfully decreases speed, which justifies, in my mind, just having this FIFO queue of memories.[00:14:20] Isaac Robinson: Although in the future, I'm super interested to see A more dedicated summarization of all of the last video, not just a stacking of the last frames. So that another extension of beautiful per frame work into the video domain.[00:14:42] Realtime detection: DETRs > YOLO[00:14:42] Isaac Robinson: The next trend I'm interested in talking about is this interesting at RoboFlow, we're super interested in training real time object detectors.[00:14:50] Isaac Robinson: Those are bread and butter. And so we're doing a lot to keep track of what is actually happening in that space. We are finally starting to see something change. So, [00:15:00] for years, YOLOs have been the dominant way of doing real time object detection, and we can see here that they've essentially stagnated.[00:15:08] Isaac Robinson: The performance between 10 and 11 is not meaningfully different, at least, you know, in this type of high level chart. And even from the last couple series, there's not. A major change so YOLOs have hit a plateau, debtors have not. So we can look here and see the YOLO series has this plateau. And then these RT debtor, LW debtor, and Define have meaningfully changed that plateau so that in fact, the best Define models are plus 4.[00:15:43] Isaac Robinson: 6 AP on Cocoa at the same latency. So three major steps to accomplish this. The first RT deditor, which is technically a 2023 paper preprint, but published officially in 24, so I'm going to include that. I hope that's okay. [00:16:00] That is showed that RT deditor showed that we could actually match or out speed YOLOs.[00:16:04] Isaac Robinson: And then LWdebtor showed that pre training is hugely effective on debtors and much less so on YOLOs. And then DeFine added the types of bells and whistles that we expect from these types, this, this arena. So the major improvements that RTdebtor shows was Taking the multi scale features that debtors typically pass into their encoder and decoupling them into a much more efficient transformer encoder.[00:16:30] Isaac Robinson: The transformer is of course, quadratic complexity. So decreasing the amount of stuff that you pass in at once is super helpful for increasing your runtime or increasing your throughput. So that change basically brought us up to yellow speed and then they do a hardcore analysis on. Benchmarking YOLOs, including the NMS step.[00:16:54] Isaac Robinson: Once you once you include the NMS in the latency calculation, you see that in fact, these debtors [00:17:00] are outperforming, at least this time, the the, the YOLOs that existed. Then LW debtor goes in and suggests that in fact, the frame, the huge boost here is from pre training. So, this is the define line, and this is the define line without pre training.[00:17:19] Isaac Robinson: It's within range, it's still an improvement over the YOLOs, but Really huge boost comes from the benefit of pre training. When YOLOx came out in 2021, they showed that they got much better results by having a much, much longer training time, but they found that when they did that, they actually did not benefit from pre training.[00:17:40] Isaac Robinson: So, you see in this graph from LWdebtor, in fact, YOLOs do have a real benefit from pre training, but it goes away as we increase the training time. Then, the debtors converge much faster. LWdebtor trains for only 50 epochs, RTdebtor is 60 epochs. So, one could assume that, in fact, [00:18:00] the entire extra gain from pre training is that you're not destroying your original weights.[00:18:06] Isaac Robinson: By relying on this long training cycle. And then LWdebtor also shows superior performance to our favorite data set, Roboflow 100 which means that they do better on the real world, not just on Cocoa. Then Define throws all the bells and whistles at it. Yellow models tend to have a lot of very specific complicated loss functions.[00:18:26] Isaac Robinson: This Define brings that into the debtor world and shows consistent improvement on a variety of debtor based frameworks. So bring these all together and we see that suddenly we have almost 60 AP on Cocoa while running in like 10 milliseconds. Huge, huge stuff. So we're spending a lot of time trying to build models that work better with less data and debtors are clearly becoming a promising step in that direction.[00:18:56] Isaac Robinson: The, what we're interested in seeing [00:19:00] from the debtors in this, this trend to next is. Codetter and the models that are currently sitting on the top of the leaderboard for large scale inference scale really well as you switch out the backbone. We're very interested in seeing and having people publish a paper, potentially us, on what happens if you take these real time ones and then throw a Swingy at it.[00:19:23] Isaac Robinson: Like, do we have a Pareto curve that extends from the real time domain all the way up to the super, super slow but high performance domain? We also want to see people benchmarking in RF100 more, because that type of data is what's relevant for most users. And we want to see more pre training, because pre training works now.[00:19:43] Isaac Robinson: It's super cool.[00:19:48] Peter's Picks[00:19:48] Peter Robicheaux: Alright, so, yeah, so in that theme one of the big things that we're focusing on is how do we get more out of our pre trained models. And one of the lenses to look at this is through sort of [00:20:00] this, this new requirement for like, how Fine grained visual details and your representations that are extracted from your foundation model.[00:20:08] Peter Robicheaux: So it's sort of a hook for this Oh, yeah, this is just a list of all the the papers that I'm going to mention I just want to make sure I set an actual paper so you can find it later[00:20:18] MMVP (Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs)[00:20:18] Peter Robicheaux: Yeah, so sort of the big hook here is that I make the claim that LLMs can't see if you go to if you go to Claude or ChatGPT you ask it to see this Watch and tell me what time it is, it fails, right?[00:20:34] Peter Robicheaux: And so you could say, like, maybe, maybe the Like, this is, like, a very classic test of an LLM, but you could say, Okay, maybe this, this image is, like, too zoomed out, And it just, like, it'll do better if we increase the resolution, And it has easier time finding these fine grained features, Like, where the watch hands are pointing.[00:20:53] Peter Robicheaux: Nodice. And you can say, okay, well, maybe the model just doesn't know how to tell time from knowing the position of the hands. But if you actually prompt [00:21:00] it textually, it's very easy for it to tell the time. So this to me is proof that these LLMs literally cannot see the position of the watch hands and it can't see those details.[00:21:08] Peter Robicheaux: So the question is sort of why? And for you anthropic heads out there, cloud fails too. So the, the, my first pick for best paper of 2024 Envision is this MMVP paper, which tries to investigate the Why do LLMs not have the ability to see fine grained details? And so, for instance, it comes up with a lot of images like this, where you ask it a question that seems very visually apparent to us, like, which way is the school bus facing?[00:21:32] Peter Robicheaux: And it gets it wrong, and then, of course, it makes up details to support its wrong claim. And so, the process by which it finds these images is sort of contained in its hypothesis for why it can't. See these details. So it hypothesizes that models that have been initialized with, with Clip as their vision encoder, they don't have fine grained details and the, the features extracted using Clip because Clip sort of doesn't need to find these fine grained [00:22:00] details to do its job correctly, which is just to match captions and images, right?[00:22:04] Peter Robicheaux: And sort of at a high level, even if ChatGPT wasn't initialized with Clip and wasn't trained contrastively at all. The vision encoder wasn't trained contrastively at all. Still, in order to do its job of capturing the image it could do a pretty good job without actually finding the exact position of all the objects and visual features in the image, right?[00:22:21] Peter Robicheaux: So This paper finds a set of difficult images for these types of models. And the way it does it is it looks for embeddings that are similar in clip space, but far in DynaV2 space. So DynaV2 is a foundation model that was trained self supervised purely on image data. And it kind of uses like some complex student teacher framework, but essentially, and like, it patches out like certain areas of the image or like crops with certain areas of the image and tries to make sure that those have consistent representations, which is a way for it to learn very fine grained visual features.[00:22:54] Peter Robicheaux: And so if you take things that are very close in clip space and very far in DynaV2 space, you get a set of images [00:23:00] that Basically, pairs of images that are hard for a chat GPT and other big language models to distinguish. So, if you then ask it questions about this image, well, as you can see from this chart, it's going to answer the same way for both images, right?[00:23:14] Peter Robicheaux: Because to, to, from the perspective of the vision encoder, they're the same image. And so if you ask a question like, how many eyes does this animal have? It answers the same for both. And like all these other models, including Lava do the same thing, right? And so this is the benchmark that they create, which is like finding clip, like clip line pairs, which is pairs of images that are similar in clip space and creating a data set of multiple choice questions based off of those.[00:23:39] Peter Robicheaux: And so how do these models do? Well, really bad. Lava, I think, So, so, chat2BT and Jim and I do a little bit better than random guessing, but, like, half of the performance of humans who find these problems to be very easy. Lava is, interestingly, extremely negatively correlated with this dataset. It does much, much, much, much worse [00:24:00] than random guessing, which means that this process has done a very good job of identifying hard images for, for Lava, specifically.[00:24:07] Peter Robicheaux: And that's because Lava is basically not trained for very long and is initialized from Clip, and so You would expect it to do poorly on this dataset. So, one of the proposed solutions that this paper attempts is by basically saying, Okay, well if clip features aren't enough, What if we train the visual encoder of the language model also on dyno features?[00:24:27] Peter Robicheaux: And so it, it proposes two different ways of doing this. One, additively which is basically interpolating between the two features, and then one is interleaving, which is just kind of like training one on the combination of both features. So there's this really interesting trend when you do the additive mixture of features.[00:24:45] Peter Robicheaux: So zero is all clip features and one is all DynaV2 features. So. It, as you, so I think it's helpful to look at the right most chart first, which is as you increase the number of DynaV2 features, your model does worse and worse and [00:25:00] worse on the actual language modeling task. And that's because DynaV2 features were trained completely from a self supervised manner and completely in image space.[00:25:08] Peter Robicheaux: It knows nothing about text. These features aren't really compatible with these text models. And so you can train an adapter all you want, but it seems that it's in such an alien language that it's like a very hard optimization for this. These models to solve. And so that kind of supports what's happening on the left, which is that, yeah, it gets better at answering these questions if as you include more dyna V two features up to a point, but then you, when you oversaturate, it completely loses its ability to like.[00:25:36] Peter Robicheaux: Answer language and do language tasks. So you can also see with the interleaving, like they essentially double the number of tokens that are going into these models and just train on both, and it still doesn't really solve the MMVP task. It gets Lava 1. 5 above random guessing by a little bit, but it's still not close to ChachiPT or, you know, Any like human performance, obviously.[00:25:59] Peter Robicheaux: [00:26:00] So clearly this proposed solution of just using DynaV2 features directly, isn't going to work. And basically what that means is that as a as a vision foundation model, DynaV2 is going to be insufficient for language tasks, right?[00:26:14] Florence 2 (Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks)[00:26:14] Peter Robicheaux: So my next pick for best paper of 2024 would be Florence 2, which tries to solve this problem by incorporating not only This dimension of spatial hierarchy, which is to say pixel level understanding, but also in making sure to include what they call semantic granularity, which ends up, the goal is basically to have features that are sufficient for finding objects in the image, so they're, they're, they have enough pixel information, but also can be talked about and can be reasoned about.[00:26:44] Peter Robicheaux: And that's on the semantic granularity axis. So here's an example of basically three different paradigms of labeling that they do. So they, they create a big dataset. One is text, which is just captioning. And you would expect a model that's trained [00:27:00] only on captioning to have similar performance like chat2BT and like not have spatial hierarchy, not have features that are meaningful at the pixel level.[00:27:08] Peter Robicheaux: And so they add another type, which is region text pairs, which is essentially either classifying a region or You're doing object detection or doing instance segmentation on that region or captioning that region. And then they have text phrased region annotations, which is essentially a triple. And basically, not only do you have a region that you've described, you also find it's like, It's placed in a descriptive paragraph about the image, which is basically trying to introduce even more like semantic understanding of these regions.[00:27:39] Peter Robicheaux: And so like, for instance, if you're saying a woman riding on the road, right, you have to know what a woman is and what the road is and that she's on top of it. And that's, that's basically composing a bunch of objects in this visual space, but also thinking about it semantically, right? And so the way that they do this is they take basically they just dump Features from a vision encoder [00:28:00] straight into a encoder decoder transformer.[00:28:03] Peter Robicheaux: And then they train a bunch of different tasks like object detection and so on as a language task. And I think that's one of the big things that we saw in 2024 is these, these vision language models operating in, on pixel space linguistically. So they introduced a bunch of new tokens to point to locations and[00:28:22] Peter Robicheaux: So how does it work? How does it actually do? We can see if you look at the graph on the right, which is using the, the Dino, the the Dino framework your, your pre trained Florence 2 models transfer very, very well. They get 60%, 60 percent map on Cocoa, which is like approaching state of the art and they train[00:28:42] Vik Korrapati: with, and they[00:28:43] Peter Robicheaux: train with a much more more efficiently.[00:28:47] Peter Robicheaux: So they, they converge a lot faster, which both of these things are pointing to the fact that they're actually leveraging their pre trained weights effectively. So where is it falling short? So these models, I forgot to mention, Florence is a 0. 2 [00:29:00] billion and a 0. 7 billion parameter count. So they're very, very small in terms of being a language model.[00:29:05] Peter Robicheaux: And I think that. This framework, you can see saturation. So, what this graph is showing is that if you train a Florence 2 model purely on the image level and region level annotations and not including the pixel level annotations, like this, segmentation, it actually performs better as an object detector.[00:29:25] Peter Robicheaux: And what that means is that it's not able to actually learn all the visual tasks that it's trying to learn because it doesn't have enough capacity.[00:29:32] PalíGemma / PaliGemma 2[00:29:32] Peter Robicheaux: So I'd like to see this paper explore larger model sizes, which brings us to our next big paper of 2024 or two papers. So PolyGemma came out earlier this year.[00:29:42] Peter Robicheaux: PolyGemma 2 was released, I think like a week or two ago. Oh, I forgot to mention, you can actually train You can, like, label text datasets on RoboFlow and you can train a Florence 2 model and you can actually train a PolyGemma 2 model on RoboFlow, which we got into the platform within, like, 14 hours of release, which I was really excited about.[00:29:59] Peter Robicheaux: So, anyway, so [00:30:00] PolyGemma 2, so PolyGemma is essentially doing the same thing, but instead of doing an encoder decoder, it just dumps everything into a decoder only transformer model. But it also introduced the concept of location tokens to point to objects in pixel space. PolyGemma 2, so PolyGemma uses Gemma as the language encoder, and it uses Gemma2B.[00:30:17] Peter Robicheaux: PolyGemma 2 introduces using multiple different sizes of language encoders. So, the way that they sort of get around having to do encoder decoder is they use the concept of prefix loss. Which basically means that when it's generating, tokens autoregressively, it's all those tokens in the prefix, which is like the image that it's looking at and like a description of the task that it's trying to do.[00:30:41] Peter Robicheaux: They're attending to each other fully, full attention. Which means that, you know, it can sort of. Find high level it's easier for the, the prefix to color, to color the output of the suffix and also to just find like features easily. So this is sort of [00:31:00] an example of like one of the tasks that was trained on, which is like, you describe the task in English and then you give it all these, like, You're asking for it to segment these two classes of objects, and then it finds, like, their locations using these tokens, and it finds their masks using some encoding of the masks into tokens.[00:31:24] Peter Robicheaux: And, yeah, so, one of my critiques, I guess, of PolyGemma 1, at least, is that You find that performance saturates as a pre trained model after only 300 million examples seen. So, what this graph is representing is each blue dot is a performance on some downstream task. And you can see that after seeing 300 million examples, It sort of does equally well on all of the downtrend tasks that they tried it on, which was a lot as 1 billion examples, which to me also kind of suggests a lack of capacity for this model.[00:31:58] Peter Robicheaux: PolyGemma2, [00:32:00] you can see the results on object detection. So these were transferred to to Coco. And you can see that this sort of also points to an increase in capacity being helpful to the model. You can see as. Both the resolution increases, and the parameter count of the language model increases, performance increases.[00:32:16] Peter Robicheaux: So resolution makes sense, obviously, it helps to find small images, or small objects in the image. But it also makes sense for another reason, which is that it kind of gives the model a thinking register, and it gives it more tokens to, like, process when making its predictions. But yeah, you could, you could say, oh, 43.[00:32:30] Peter Robicheaux: 6, that's not that great, like Florence 2 got 60. But this is not Training a dino or a debtor on top of this language or this image encoder. It's doing the raw language modeling task on Cocoa. So it doesn't have any of the bells and whistles. It doesn't have any of the fancy losses. It doesn't even have bipartite graph matching or anything like that.[00:32:52] Peter Robicheaux: Okay, the big result and one of the reasons that I was really excited about this paper is that they blow everything else away [00:33:00] on MMVP. I mean, 47. 3, sure, that's nowhere near human accuracy, which, again, is 94%, but for a, you know, a 2 billion language, 2 billion parameter language model to be chat2BT, that's quite the achievement.[00:33:12] Peter Robicheaux: And that sort of brings us to our final pick for paper of the year, which is AIMV2. So, AIMV2 sort of says, okay, Maybe this language model, like, maybe coming up with all these specific annotations to find features and with high fidelity and pixel space isn't actually necessary. And we can come up with an even simpler, more beautiful idea for combining you know, image tokens and pixel tokens in a way that's interfaceable for language tasks.[00:33:44] Peter Robicheaux: And this is nice because it can scale, you can come up with lots more data if you don't have to come up with all these annotations, right? So the way that it works. is it does something very, very similar to PolyGemo, where you have a vision encoder that dumps image tokens into a decoder only transformer.[00:33:59] Peter Robicheaux: But [00:34:00] the interesting thing is that it also autoregressively tries to learn the mean squared error of the image tokens. So instead of having to come up with fancy object detection or semantic, or segment, or segmentation labels, you can just try to reconstruct the image and have it learn fine grained features that way.[00:34:16] Peter Robicheaux: And it does this in kind of, I think, a beautiful way that's kind of compatible with the PolyGemma line of thinking, which is randomly sampling a prefix line of thinking Prefix length and using only this number of image tokens as the prefix. And so doing a similar thing with the causal. So the causal with prefix is the, the attention mask on the right.[00:34:35] Peter Robicheaux: So it's doing full block attention with some randomly sampled number of image tokens to then reconstruct the rest of the image and the downstream caption for that image. And so, This is the dataset that they train on. It's image or internet scale data, very high quality data created by the data filtering networks paper, essentially which is maybe The best clip data that exists.[00:34:59] Peter Robicheaux: [00:35:00] And we can see that this is finally a model that doesn't saturate. It's even at the highest parameter count, it's, it appears to be, oh, at the highest parameter account, it appears to be improving in performance with more and more samples seen. And so you can sort of think that. You know, if we just keep bumping the parameter count and increasing the example scene, which is the, the, the line of thinking for language models, then it'll keep getting better.[00:35:27] Peter Robicheaux: So how does it actually do at finding, oh, it also improves with resolution, which you would expect for a model that This is the ImageNet classification accuracy, but yeah, it does better if you increase the resolution, which means that it's actually leveraging and finding fine grained visual features.[00:35:44] Peter Robicheaux: And so how does that actually do compared to CLIP on Cocoa? Well, you can see that if you slap a transformer detection head on it, Entry now in Cocoa, it's just 60. 2, which is also within spitting distance of Soda, which means that it does a very good job of [00:36:00] finding visual features, but you could say, okay, well, wait a second.[00:36:03] Peter Robicheaux: Clip got to 59. 1, so. Like, how does this prove your claim at all? Because doesn't that mean like clip, which is known to be clip blind and do badly on MMVP, it's able to achieve a very high performance on fine, on this fine grained visual features task of object detection, well, they train on like, Tons of data.[00:36:24] Peter Robicheaux: They train on like objects, 365, Cocoa, Flickr and everything else. And so I think that this benchmark doesn't do a great job of selling how good of a pre trained model MV2 is. And we would like to see the performance on fewer data as examples and not trained to convergence on object detection. So seeing it in the real world on like a dataset, like RoboFlow 100, I think would be quite interesting.[00:36:48] Peter Robicheaux: And our, our, I guess our final, final pick for paper of 2024 would be Moondream. So introducing Vic to talk about that.[00:36:54] swyx: But overall, that was exactly what I was looking for. Like best of 2024, an amazing job. Yeah, you can, [00:37:00] if there's any other questions while Vic gets set up, like vision stuff,[00:37:07] swyx: yeah,[00:37:11] swyx: Vic, go ahead. Hi,[00:37:13] Vik Korrapati / Moondream[00:37:13] question: well, while we're getting set up, hi, over here, thanks for the really awesome talk. One of the things that's been weird and surprising is that the foundation model companies Even these MLMs, they're just like worse than RT Tether at detection still. Like, if you wanted to pay a bunch of money to auto label your detection dataset, If you gave it to OpenAI or Cloud, that would be like a big waste.[00:37:37] question: So I'm curious, just like, even Pali Gemma 2, like is worse. So, so I'm curious to hear your thoughts on like, how come, Nobody's cracked the code on like a generalist that really you know, beats a specialist model in computer vision like they have in in LLM land.[00:38:00][00:38:01] Isaac Robinson: Okay. It's a very, very interesting question. I think it depends on the specific domain. For image classification, it's basically there. In the, in AIMv2 showed, a simple attentional probe on the pre trained features gets like 90%, which is as well as anyone does. The, the, the, the bigger question, like, why isn't it transferring to object detection, especially like real time object detection.[00:38:25] Isaac Robinson: I think, in my mind, there are two answers. One is, object detection is really, really, really the architectures are super domain specific. You know, we see these, all these super, super complicated things, and it's not super easy to, to, to build something that just transfers naturally like that, whereas image classification, you know, clip pre training transfers super, super quickly.[00:38:48] Isaac Robinson: And the other thing is, until recently, the real time object detectors didn't even really benefit from pre training. Like, you see the YOLOs that are like, essentially saturated, showing very little [00:39:00] difference with pre training improvements, with using pre trained model at all. It's not surprising, necessarily, that People aren't looking at the effects of better and better pre training on real time detection.[00:39:12] Isaac Robinson: Maybe that'll change in the next year. Does that answer your question?[00:39:17] Peter Robicheaux: Can you guys hear me? Yeah, one thing I want to add is just like, or just to summarize, basically, is that like, Until 2024, you know, we haven't really seen a combination of transformer based object detectors and fancy losses, and PolyGemma suffers from the same problem, which is basically to say that these ResNet, or like the convolutional models, they have all these, like, extreme optimizations for doing object detection, but essentially, I think it's kind of been shown now that convolution models like just don't benefit from pre training and just don't like have the level of intelligence of transformer models.[00:39:56] swyx: Awesome. Hi,[00:39:59] Vik Korrapati: can [00:40:00] you hear me?[00:40:01] swyx: Cool. I hear you. See you. Are you sharing your screen?[00:40:04] Vik Korrapati: Hi. Might have forgotten to do that. Let me do[00:40:07] swyx: that. Sorry, should have done[00:40:08] Vik Korrapati: that.[00:40:17] swyx: Here's your screen. Oh, classic. You might have to quit zoom and restart. What? It's fine. We have a capture of your screen.[00:40:34] swyx: So let's get to it.[00:40:35] Vik Korrapati: Okay, easy enough.[00:40:49] Vik Korrapati: All right. Hi, everyone. My name is Vic. I've been working on Moondream for almost a year now. Like Shawn mentioned, I just went and looked and it turns out the first version I released December [00:41:00] 29, 2023. It's been a fascinating journey. So Moonbeam started off as a tiny vision language model. Since then, we've expanded scope a little bit to also try and build some tooling, client libraries, et cetera, to help people really deploy it.[00:41:13] Vik Korrapati: Unlike traditional large models that are focused at assistant type use cases, we're laser focused on building capabilities that developers can, sorry, it's yeah, we're basically focused on building capabilities that developers can use to build vision applications that can run anywhere. So, in a lot of cases for vision more so than for text, you really care about being able to run on the edge, run in real time, etc.[00:41:40] Vik Korrapati: So That's really important. We have we have different output modalities that we support. There's query where you can ask general English questions about an image and get back human like answers. There's captioning, which a lot of our users use for generating synthetic datasets to then train diffusion models and whatnot.[00:41:57] Vik Korrapati: We've done a lot of work to minimize those sessions there. [00:42:00] So that's. Use lot. We have open vocabulary object detection built in similar to a couple of more recent models like Palagem, et cetera, where rather than having to train a dedicated model, you can just say show me soccer balls in this image or show me if there are any deer in this image, it'll detect it.[00:42:14] Vik Korrapati: More recently, earlier this month, we released pointing capability where if all you're interested in is the center of an object you can just ask it to point out where that is. This is very useful when you're doing, you know, I automation type stuff. Let's see, LA we, we have two models out right now.[00:42:33] Vik Korrapati: There's a general purpose to be para model, which runs fair. Like it's, it's it's fine if you're running on server. It's good for our local Amma desktop friends and it can run on flagship, flagship mobile phones, but it never. so much for joining us today, and we'll see you in the [00:43:00] next one. Less memory even with our not yet fully optimized inference client.[00:43:06] Vik Korrapati: So the way we built our 0. 5b model was to start with the 2 billion parameter model and prune it while doing continual training to retain performance. We, our objective during the pruning was to preserve accuracy across a broad set of benchmarks. So the way we went about it was to estimate the importance of different components of the model, like attention heads, channels MLP rows and whatnot using basically a technique based on the gradient.[00:43:37] Vik Korrapati: I'm not sure how much people want to know details. We'll be writing a paper about this, but feel free to grab me if you have more questions. Then we iteratively prune a small chunk that will minimize loss and performance retrain the model to recover performance and bring it back. The 0. 5b we released is more of a proof of concept that this is possible.[00:43:54] Vik Korrapati: I think the thing that's really exciting about this is it makes it possible for for developers to build using the 2B param [00:44:00] model and just explore, build their application, and then once they're ready to deploy figure out what exactly they need out of the model and prune those capabilities into a smaller form factor that makes sense for their deployment target.[00:44:12] Vik Korrapati: So yeah, very excited about that. Let me talk to you folks a little bit about another problem I've been working on recently, which is similar to the clocks example we've been talking about. We had a customer reach out who was talking about, like, who had a bunch of gauges out in the field. This is very common in manufacturing and oil and gas, where you have a bunch of analog devices that you need to monitor.[00:44:34] Vik Korrapati: It's expensive to. And I was like, okay, let's have humans look at that and monitor stuff and make sure that the system gets shut down when the temperature goes over 80 or something. So I was like, yeah, this seems easy enough. Happy to, happy to help you distill that. Let's, let's get it going. Turns out our model couldn't do it at all.[00:44:51] Vik Korrapati: I went and looked at other open source models to see if I could just generate a bunch of data and learn from that. Did not work either. So I was like, let's look at what the folks with [00:45:00] hundreds of billions of dollars in market cap have to offer. And yeah, that doesn't work either. My hypothesis is that like the, the way these models are trained are using a large amount of image text data scraped from the internet.[00:45:15] Vik Korrapati: And that can be biased. In the case of gauges, most gauge images aren't gauges in the wild, they're product images. Detail images like these, where it's always set to zero. It's paired with an alt text that says something like GIVTO, pressure sensor, PSI, zero to 30 or something. And so the models are fairly good at picking up those details.[00:45:35] Vik Korrapati: It'll tell you that it's a pressure gauge. It'll tell you what the brand is, but it doesn't really learn to pay attention to the needle over there. And so, yeah, that's a gap we need to address. So naturally my mind goes to like, let's use synthetic data to, Solve this problem. That works, but it's problematic because it turned out we needed millions of synthetic gauge images to get to reasonable performance.[00:45:57] Vik Korrapati: And thinking about it, reading a gauge is like [00:46:00] not a one, like it's not a zero short process in our minds, right? Like if you had to tell me the reading in Celsius for this, Real world gauge. There's two dials on there. So first you have to figure out which one you have to be paying attention to, like the inner one or the outer one.[00:46:14] Vik Korrapati: You look at the tip of the needle, you look at what labels it's between, and you count how many and do some math to figure out what that probably is. So what happens if we just add that as a Chain of thought to give the model better understanding of the different sub, to allow the model to better learn the subtasks it needs to perform to accomplish this goal.[00:46:37] Vik Korrapati: So you can see in this example, this was actually generated by the latest version of our model. It's like, okay, Celsius is the inner scale. It's between 50 and 60. There's 10 ticks. So the second tick, it's a little debatable here, like there's a weird shadow situation going on, the dial is off, so I don't know what the ground truth is, but it works okay.[00:46:57] Vik Korrapati: There's points on there that are, the points [00:47:00] over there are actually grounded. I don't know if this is easy to see, but when I click on those, there's a little red dot that moves around on the image. The model actually has to predict where this points are, I was already trying to do this with bounding boxes, but then Malmo came out with pointing capabilities.[00:47:15] Vik Korrapati: And it's like pointing is a much better paradigm to to represent this. We see pretty good results. This one's actually for clock reading. I couldn't find our chart for gauge reading at the last minute. So the light. Blue chart is with our rounded chain of thought. This measures, we have, we built a clock reading benchmark about 500 images.[00:47:37] Vik Korrapati: This measures accuracy on that. You can see it's a lot more sample efficient when you're using the chain of thought to model. Another big benefit from this approach is like, you can kind of understand how the model is. it and how it's failing. So in this example, the actual correct reading is 54 Celsius, the model output [00:48:00] 56, not too bad but you can actually go and see where it messed up. Like it got a lot of these right, except instead of saying it was on the 7th tick, it actually predicted that it was the 8th tick and that's why it went with 56.[00:48:14] Vik Korrapati: So now that you know that this. Failing in this way, you can adjust how you're doing the chain of thought to maybe say like, actually count out each tick from 40, instead of just trying to say it's the eighth tick. Or you might say like, okay, I see that there's that middle thing, I'll count from there instead of all the way from 40.[00:48:31] Vik Korrapati: So helps a ton. The other thing I'm excited about is a few short prompting or test time training with this. Like if a customer has a specific gauge that like we're seeing minor errors on, they can give us a couple of examples where like, if it's miss detecting the. Needle, they can go in and correct that in the chain of thought.[00:48:49] Vik Korrapati: And hopefully that works the next time. Now, exciting approach, we only apply it to clocks and gauges. The real question is, is it going to generalize? Probably, like, there's some science [00:49:00] from text models that when you train on a broad number of tasks, it does generalize. And I'm seeing some science with our model as well.[00:49:05] Vik Korrapati: So, in addition to the image based chain of thought stuff, I also added some spelling based chain of thought to help it understand better understand OCR, I guess. I don't understand why everyone doesn't do this, by the way. Like, it's trivial benchmark question. It's Very, very easy to nail. But I also wanted to support it for stuff like license plate, partial matching, like, hey, does any license plate in this image start with WHA or whatever?[00:49:29] Vik Korrapati: So yeah, that sort of worked. All right, that, that ends my story about the gauges. If you think about what's going on over here it's interesting that like LLMs are showing enormous. Progress in reasoning, especially with the latest set of models that we've seen, but we're not really seeing, I have a feeling that VLMs are lagging behind, as we can see with these tasks that should be very simple for a human to do [00:50:00] that are very easy to find VLMs failing at.[00:50:04] Vik Korrapati: My hypothesis on why this is the case is because On the internet, there's a ton of data that talks about how to reason. There's books about how to solve problems. There's books critiquing the books about how to solve problems. But humans are just so good at perception that we never really talk about it.[00:50:20] Vik Korrapati: Like, maybe in art books where it's like, hey, to show that that mountain is further away, you need to desaturate it a bit or whatever. But the actual data on how to, like, look at images is, isn't really present. Also, the Data we have is kind of sketched. The best source of data we have is like image all text pairs on the internet and that's pretty low quality.[00:50:40] Vik Korrapati: So yeah, I, I think our solution here is really just we need to teach them how to operate on individual tasks and figure out how to scale that out. All right. Yep. So conclusion. At Moondream we're trying to build amazing PLMs that run everywhere. Very hard problem. Much work ahead, but we're making a ton of progress and I'm really excited [00:51:00] about If anyone wants to chat about more technical details about how we're doing this or interest in collaborating, please, please hit me up.[00:51:08] Isaac Robinson: Yeah,[00:51:09] swyx: like, I always, when people say, when people say multi modality, like, you know, I always think about vision as the first among equals in all the modalities. So, I really appreciate having the experts in the room. Get full access to Latent Space at www.latent.space/subscribe

Dr. Olga Russakovsky: Shaping the Next Generation of AI Leaders

Generative Now | AI Builders on Creating the Future

Play Episode Listen Later Dec 19, 2024 44:32

Dr. Olga Russakovsky, Computer Science at Princeton University, joins Lightspeed Partner Michael Mignano to discuss what the next generation of AI talent is learning and where she expects to find the next big innovation in artificial intelligence. From her research in computer vision and human-computer interaction to her work in fairness, accountability, and transparency in AI, Dr. Russakovsky has earned many awards, including the MIT Technology Review's 35-under-35 Innovator award and the Foreign Policy Magazine's 100 Leading Global Thinkers award. Dr. Russakovsky is also the co-founder and Board Chair of AI4ALL, a nonprofit that aims to increase the diversity of thought in Artificial Intelligence. Episode Chapters 00:00 Introduction and Guest Overview 01:17 Olga's Career Journey 02:30 Understanding Computer Vision 04:43 Generative AI and Computer Vision 06:36 Interdisciplinary AI Research 15:00 AI4All: Diversity of Thought 17:44 Challenges and Bias in AI 30:01 Future of AI and Data 40:08 ImageNET 43:38 Closing Thoughts Stay in touch: www.lsvp.com X: https://twitter.com/lightspeedvp LinkedIn: https://www.linkedin.com/company/lightspeed-venture-partners/ Instagram: https://www.instagram.com/lightspeedventurepartners/ Subscribe on your favorite podcast app: generativenow.co Email: generativenow@lsvp.com The content here does not constitute tax, legal, business or investment advice or an offer to provide such advice, should not be construed as advocating the purchase or sale of any security or investment or a recommendation of any company, and is not an offer, or solicitation of an offer, for the purchase or sale of any security or investment product. For more details please see lsvp.com/legal.

ai future challenges data leaders artificial intelligence next generation bias shaping computer science innovators princeton university board chair career journey computer vision mit technology review foreign policy magazine imagenet ai4all

Chuyện đêm - GS Fei Fei Li của ĐH Stanford, Mỹ và hành trình nghiên cứu AI của GS và khuyến nghị cho Việt Nam để phát triển AI

VOV - Sự kiện và Bàn luận

Play Episode Listen Later Dec 13, 2024 13:21

- GS. Fei-Fei Li (Đại học Stanford, Mỹ) là một trong 5 nhà khoa học được vinh danh Giải thưởng Chính VinFuture 2024 trị giá 3 triệu USD vì những đóng góp đột phá để thúc đẩy sự tiến bộ của học sâu. Bà là người đã tạo ra tập dữ liệu ImageNet giúp thúc đẩy sự tiến bộ trong hệ thống nhận diện hình ảnh, giúp huấn luyện các mô hình học sâu ở quy mô lớn. Trong Chuyện đêm hôm nay, chúng tôi mời quý vị và các bạn cùng trò chuyện với nhà khoa học người Mỹ được mệnh danh là “mẹ đỡ đầu” của AI, nổi tiếng với đóng góp đột phá trong lĩnh vực thị giác máy tính. Chủ đề : GS Fei Fei Li, ĐH Stanford, Mỹ, nghiên cứu AI, Việt Nam phát triển AI --- Support this podcast: https://podcasters.spotify.com/pod/show/vov1sukien/support

ai stanford usd gi nam gs chuy nghi khuy fei fei li imagenet nam ph

#220 Terry Sejnowski: The Future of AI, ChatGPT & Deep Learning

Eye On A.I.

Play Episode Listen Later Nov 22, 2024 60:28

This episode of Eye on AI is sponsored by Citrusx. Unlock reliable AI with Citrusx! Our platform simplifies validation and risk management, empowering you to make smarter decisions and stay compliant. Detects and mitigate AI vulnerabilities, biases, and errors with ease. Visit http://email.citrusx.ai/eyeonai to download our free fairness use case and see the solution in action. In this episode of the Eye on AI podcast, Terry Sejnowski, a pioneer in neural networks and computational neuroscience, joins Craig Smith to discuss the future of AI, the evolution of ChatGPT, and the challenges of understanding intelligence. Terry, a key figure in the deep learning revolution, shares insights into how neural networks laid the foundation for modern AI, including ChatGPT's groundbreaking generative capabilities. From its ability to mimic human-like creativity to its limitations in true understanding, we explore what makes ChatGPT remarkable and what it still lacks compared to human cognition. We also dive into fascinating topics like the debate over AI sentience, the concept of "hallucinations" in AI models, and how language models like ChatGPT act as mirrors reflecting user input rather than possessing intrinsic intelligence. Terry explains how understanding language and meaning in AI remains one of the field's greatest challenges. Additionally, Terry shares his perspective on nature-inspired AI and what it will take to develop systems that go beyond prediction to exhibit true autonomy and decision-making. Learn why AI models like ChatGPT are revolutionary yet incomplete, how generative AI might redefine creativity, and what the future holds for AI as we continue to push its boundaries. Don't miss this deep dive into the fascinating world of AI with Terry Sejnowski. Like, subscribe, and hit the notification bell for more cutting-edge AI insights! Stay Updated: Craig Smith Twitter: https://twitter.com/craigss Eye on A.I. Twitter: https://twitter.com/EyeOn_AI (00:00) Introduction to Terry Sejnowski and His Work (03:02) The Origins of Modern AI and Neural Networks (05:29) The Deep Learning Revolution and ImageNet (07:11) Understanding ChatGPT and Generative AI (12:34) Exploring AI Creativity (16:03) Lessons from Gaming AI: AlphaGo and Backgammon (18:37) Early Insights into AI's Affinity for Language (24:48) Syntax vs. Semantics: The Purpose of Language (30:00) How Written Language Transformed AI Training (35:10) Can AI Become Sentient? (41:37) AI Agents and the Next Frontier in Automation (45:43) Nature-Inspired AI: Lessons from Biology (50:02) Digital vs. Biological Computation: Key Differences (54:29) Will AI Replace Jobs? (57:07) The Future of AI

Some Changes at The Gradient

The Gradient Podcast

Play Episode Listen Later Nov 21, 2024 34:25

Hi everyone!If you're a new subscriber or listener, welcome. If you're not new, you've probably noticed that things have slowed down from us a bit recently. Hugh Zhang, Andrey Kurenkov and I sat down to recap some of The Gradient's history, where we are now, and how things will look going forward. To summarize and give some context:The Gradient has been around for around 6 years now – we began as an online magazine, and began producing our own newsletter and podcast about 4 years ago. With a team of volunteers — we take in a bit of money through Substack that we use for subscriptions to tools we need and try to pay ourselves a bit — we've been able to keep this going for quite some time. Our team has less bandwidth than we'd like right now (and I'll admit that at least some of us are running on fumes…) — we'll be making a few changes:* Magazine: We're going to be scaling down our editing work on the magazine. While we won't be accepting pitches for unwritten drafts for now, if you have a full piece that you'd like to pitch to us, we'll consider posting it. If you've reached out about writing and haven't heard from us, we're really sorry. We've tried a few different arrangements to manage the pipeline of articles we have, but it's been difficult to make it work. We still want this to be a place to promote good work and writing from the ML community, so we intend to continue using this Substack for that purpose. If we have more editing bandwidth on our team in the future, we want to continue doing that work. * Newsletter: We'll aim to continue the newsletter as before, but with a “Best from the Community” section highlighting posts. We'll have a way for you to send articles you want to be featured, but for now you can reach us at our editor@thegradient.pub. * Podcast: I'll be continuing this (at a slower pace), but eventually transition it away from The Gradient given the expanded range. If you're interested in following, it might be worth subscribing on another player like Apple Podcasts, Spotify, or using the RSS feed.* Sigmoid Social: We'll keep this alive as long as there's financial support for it.If you like what we do and/or want to help us out in any way, do reach out to editor@thegradient.pub. We love hearing from you.Timestamps* (0:00) Intro* (01:55) How The Gradient began* (03:23) Changes and announcements* (10:10) More Gradient history! On our involvement, favorite articles, and some plugsSome of our favorite articles!There are so many, so this is very much a non-exhaustive list:* NLP's ImageNet moment has arrived* The State of Machine Learning Frameworks in 2019* Why transformative artificial intelligence is really, really hard to achieve* An Introduction to AI Story Generation* The Artificiality of Alignment (I didn't mention this one in the episode, but it should be here)Places you can find us!Hugh:* Twitter* Personal site* Papers/things mentioned!* A Careful Examination of LLM Performance on Grade School Arithmetic (GSM1k)* Planning in Natural Language Improves LLM Search for Code Generation* Humanity's Last ExamAndrey:* Twitter* Personal site* Last Week in AI PodcastDaniel:* Twitter* Substack blog* Personal site (under construction) Get full access to The Gradient at thegradientpub.substack.com/subscribe

spotify community personal state places substack nlp last week papers ml gradient imagenet

Vol.048 AI 教父、诺奖得主辛顿：差点入职百度，坚信神经网络

MacTalk·夜航西飞

Play Episode Listen Later Oct 12, 2024 20:33

ai ai ai qq imagenet

The Frontier of Spatial Intelligence with Fei-Fei Li

a16z

Play Episode Listen Later Sep 19, 2024 44:40

Fei-Fei Li and Justin Johnson are pioneers in AI. While the world has only recently witnessed a surge in consumer AI, our guests have long been laying the groundwork for innovations that are transforming industries today.In this episode, a16z General Partner Martin Casado joins Fei-Fei and Justin to explore the journey from early AI winters to the rise of deep learning and the rapid expansion of multimodal AI. From foundational advancements like ImageNet to the cutting-edge realm of spatial intelligence, Fei-Fei and Justin share the breakthroughs that have shaped the AI landscape and reveal what's next for innovation at World Labs.If you're curious about how AI is evolving beyond language models and into a new realm of 3D, generative worlds, this episode is a must-listen.Resources: Learn more about World Labs: https://www.worldlabs.aiFind Fei-Fei on Twitter: https://x.com/drfeifeiFind Justin on Twitter: https://x.com/jcjohnss Stay Updated: Let us know what you think: https://ratethispodcast.com/a16zFind a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

ai 3d intelligence frontier spatial justin johnson fei fei li imagenet fei fei

Jim Fan on Nvidia's Embodied AI Lab and Jensen Huang's Prediction that All Robots will be Autonomous

Training Data

Play Episode Listen Later Sep 17, 2024 49:13

AI researcher Jim Fan has had a charmed career. He was OpenAI's first intern before he did his PhD at Stanford with “godmother of AI,” Fei-Fei Li. He graduated into a research scientist position at Nvidia and now leads its Embodied AI “GEAR” group. The lab's current work spans foundation models for humanoid robots to agents for virtual worlds. Jim describes a three-pronged data strategy for robotics, combining internet-scale data, simulation data and real world robot data. He believes that in the next few years it will be possible to create a “foundation agent” that can generalize across skills, embodiments and realities—both physical and virtual. He also supports Jensen Huang's idea that “Everything that moves will eventually be autonomous.” Hosted by: Stephanie Zhan and Sonya Huang, Sequoia Capital Mentioned in this episode: World of Bits: Early OpenAI project Jim worked on as an intern with Andrej Karpathy. Part of a bigger initiative called Universe Fei-Fei Li: Jim's PhD advisor at Stanford who founded the ImageNet project in 2010 that revolutionized the field of visual recognition, led the Stanford Vision Lab and just launched her own AI startup, World Labs Project GR00T: Nvidia's “moonshot effort” at a robotic foundation model, premiered at this year's GTC Thinking Fast and Slow: Influential book by Daniel Kahneman that popularized some of his teaching from behavioral economics Jetson Orin chip: The dedicated series of edge computing chips Nvidia is developing to power Project GR00T Eureka: Project by Jim's team that trained a five finger robot hand to do pen spinning MineDojo: A project Jim did when he first got to Nvidia that developed a platform for general purpose agents in the game of Minecraft. Won NeurIPS 2022 Outstanding Paper Award ADI: artificial dog intelligence Mamba: Selective State Space Models, an alternative architecture to Transformers that Jim is interested in (original paper here) 00:00 Introduction 01:35 Jim's journey to embodied intelligence 04:53 The GEAR Group 07:32 Three kinds of data for robotics 10:32 A GPT-3 moment for robotics 16:05 Choosing the humanoid robot form factor 19:37 Specialized generalists 21:59 GR00T gets its own chip 23:35 Eureka and Issac Sim 25:23 Why now for robotics? 28:53 Exploring virtual worlds 36:28 Implications for games 39:13 Is the virtual world in service of the physical world? 42:10 Alternative architectures to Transformers 44:15 Lightning round

Adversarial Examples and Data Modelling - Andrew Ilyas (MIT)

Machine Learning Street Talk

Play Episode Listen Later Aug 22, 2024 88:00

Andrew Ilyas, a PhD student at MIT who is about to start as a professor at CMU. We discuss Data modeling and understanding how datasets influence model predictions, Adversarial examples in machine learning and why they occur, Robustness in machine learning models, Black box attacks on machine learning systems, Biases in data collection and dataset creation, particularly in ImageNet and Self-selection bias in data and methods to address it. MLST is sponsored by Brave: The Brave Search API covers over 20 billion webpages, built from scratch without Big Tech biases or the recent extortionate price hikes on search API access. Perfect for AI model training and retrieval augmentated generation. Try it now - get 2,000 free queries monthly at http://brave.com/api Andrew's site: https://andrewilyas.com/ https://x.com/andrew_ilyas TOC: 00:00:00 - Introduction and Andrew's background 00:03:52 - Overview of the machine learning pipeline 00:06:31 - Data modeling paper discussion 00:26:28 - TRAK: Evolution of data modeling work 00:43:58 - Discussion on abstraction, reasoning, and neural networks 00:53:16 - "Adversarial Examples Are Not Bugs, They Are Features" paper 01:03:24 - Types of features learned by neural networks 01:10:51 - Black box attacks paper 01:15:39 - Work on data collection and bias 01:25:48 - Future research plans and closing thoughts References: Adversarial Examples Are Not Bugs, They Are Features https://arxiv.org/pdf/1905.02175 TRAK: Attributing Model Behavior at Scale https://arxiv.org/pdf/2303.14186 Datamodels: Predicting Predictions from Training Data https://arxiv.org/pdf/2202.00622 Adversarial Examples Are Not Bugs, They Are Features https://arxiv.org/pdf/1905.02175 IMAGENET-TRAINED CNNS https://arxiv.org/pdf/1811.12231 ZOO: Zeroth Order Optimization Based Black-box https://arxiv.org/pdf/1708.03999 A Spline Theory of Deep Networks https://proceedings.mlr.press/v80/balestriero18b/balestriero18b.pdf Scaling Monosemanticity https://transformer-circuits.pub/2024/scaling-monosemanticity/ Adversarial Examples Are Not Bugs, They Are Features https://gradientscience.org/adv/ Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies https://proceedings.mlr.press/v235/bartoldson24a.html Prior Convictions: Black-Box Adversarial Attacks with Bandits and Priors https://arxiv.org/abs/1807.07978 Estimation of Standard Auction Models https://arxiv.org/abs/2205.02060 From ImageNet to Image Classification: Contextualizing Progress on Benchmarks https://arxiv.org/abs/2005.11295 Estimation of Standard Auction Models https://arxiv.org/abs/2205.02060 What Makes A Good Fisherman? Linear Regression under Self-Selection Bias https://arxiv.org/abs/2205.03246 Towards Tracing Factual Knowledge in Language Models Back to the Training Data [Akyürek] https://arxiv.org/pdf/2205.11482

black ai work future phd data mit types big tech api bandits biases modelling cmu adversarial ilyas robustness imagenet linear regression

AF - The 'strong' feature hypothesis could be wrong by lewis smith

The Nonlinear Library

Play Episode Listen Later Aug 2, 2024 31:14

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The 'strong' feature hypothesis could be wrong, published by lewis smith on August 2, 2024 on The AI Alignment Forum. NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position. "It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an "ideal" ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout" Elhage et. al, Toy Models of Superposition Recently, much attention in the field of mechanistic interpretability, which tries to explain the behavior of neural networks in terms of interactions between lower level components, has been focussed on extracting features from the representation space of a model. The predominant methodology for this has used variations on the sparse autoencoder, in a series of papers inspired by Elhage et. als. model of superposition.It's been conventionally understood that there are two key theories underlying this agenda. The first is the 'linear representation hypothesis' (LRH), the hypothesis that neural networks represent many intermediates or variables of the computation (such as the 'features of the input' in the opening quote) as linear directions in it's representation space, or atoms[1]. And second, the theory that the network is capable of representing more of these 'atoms' than it has dimensions in its representation space, via superposition (the superposition hypothesis). While superposition is a relatively uncomplicated hypothesis, I think the LRH is worth examining in more detail. It is frequently stated quite vaguely, and I think there are several possible formulations of this hypothesis, with varying degrees of plausibility, that it is worth carefully distinguishing between. For example, the linear representation hypothesis is often stated as 'networks represent features of the input as directions in representation space'. Here are two importantly different ways to parse this: 1. (Weak LRH) some or many features used by neural networks are represented as atoms in representation space 2. (Strong LRH) all (or the vast majority of) features used by neural networks are represented by atoms. The weak LRH I would say is now well supported by considerable empirical evidence. The strong form is much more speculative: confirming the existence of many linear representations does not necessarily provide strong evidence for the strong hypothesis. Both the weak and the strong forms of the hypothesis can still have considerable variation, depending on what we understand by a feature and the proportion of the model we expect to yield to analysis, but I think that the distinction between just a weak and strong form is clear enough to work with. I think that in addition to the acknowledged assumption of the LRH and superposition hypotheses, much work on SAEs in practice makes the assumption that each atom in the network will represent a "simple feature" or a "feature of the input". These features that the atoms are representations of are assumed to be 'monosemantic': they will all stand for features which are human interpretable in isolation. I will call this the monosemanticity assumption. This is difficult to state precisely, but we might formulate it as the theory that every represented variable will have a single meaning in a good description of a model. This is not a straightforward assumption due to how imprecise the notion of a single meaning is. While various more or less reasonable definitions for features are discussed in the pioneering work of Elhage, these assumptions have different implications. For instance, if one thinks of 'feat...

speech feature ea nb hypothesis google deepmind strong' rationalist imagenet saes lewis smith lrh

LW - The 'strong' feature hypothesis could be wrong by lsgos

The Nonlinear Library

Play Episode Listen Later Aug 2, 2024 30:53

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The 'strong' feature hypothesis could be wrong, published by lsgos on August 2, 2024 on LessWrong. NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position. "It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an "ideal" ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout" : Elhage et. al, Toy Models of Superposition Recently, much attention in the field of mechanistic interpretability, which tries to explain the behavior of neural networks in terms of interactions between lower level components, has been focussed on extracting features from the representation space of a model. The predominant methodology for this has used variations on the sparse autoencoder, in a series of papers inspired by Elhage et. als. model of superposition. Conventionally there understood to be two key theories underlying this agenda. The first is the 'linear representation hypothesis' (LRH), the hypothesis that neural networks represent many intermediates or variables of the computation (such as the 'features of the input' in the opening quote) as linear directions in it's representation space, or atoms[1]. And second, the theory that the network is capable of representing more of these 'atoms' than it has dimensions in its representation space, via superposition (the superposition hypothesis). While superposition is a relatively uncomplicated hypothesis, I think the LRH is worth examining in more detail. It is frequently stated quite vaguely, and I think there are several possible formulations of this hypothesis, with varying degrees of plausibility, that it is worth carefully distinguishing between. For example, the linear representation hypothesis is often stated as 'networks represent features of the input as directions in representation space'. There are a few possible formulations of this: 1. (Weak LRH) some features used by neural networks are represented as atoms in representation space 2. (Strong LRH) all features used by neural networks are represented by atoms. The weak LRH I would say is now well supported by considerable empirical evidence. The strong form is much more speculative: confirming the existence of many linear representations does not necessarily provide strong evidence for the strong hypothesis. Both the weak and the strong forms of the hypothesis can still have considerable variation, depending on what we understand by a feature. I think that in addition to the acknowledged assumption of the LRH and superposition hypotheses, much work on SAEs in practice makes the assumption that each atom in the network will represent a "simple feature" or a "feature of the input". These features that the atoms are representations of are assumed to be 'monosemantic': they will all stand for features which are human interpretable in isolation. I will call this the monosemanticity assumption. This is difficult to state precisely, but we might formulate as the theory that every represented variable will have a single meaning in a good description of a model. This is not a straightforward assumption due to how imprecise the notion of a single meaning is. While various more or less reasonable definitions for features are discussed in the pioneering work of Elhage, these assumptions have different implications. For instance, if one thinks of 'features' as computational intermediates in a broad sense, then superposition and the LRH imply a certain picture of the format of a models internal representation: that what the network is doing is manipulating atoms in superposition (if y...

speech feature ea nb hypothesis google deepmind strong' rationalist imagenet conventionally saes lesswrong lrh

History of AI - EP06 Part 1: The Effortless Podcast

The Effortless Podcast

Play Episode Listen Later Jul 13, 2024 68:16

Key Topics & Chapter Markers:AI's Evolutionary Journey & Key Challenges [00:00:00]Neural Networks: Inspiration from Biology [00:01:00]Weighted Sum, Inputs & Mathematical Functions [00:05:00]Gradient Descent & Optimization in Neural Nets [00:10:15]Computing Architecture: CPUs vs. GPUs [00:39:56]RNNs and Early Problems in Memory & Context [01:03:00]The Emergence of Convolutional Neural Networks (CNNs) [01:10:00]ImageNet, GPUs & Scaling Neural Networks [01:24:00]Share Your Thoughts: Have questions or comments? Drop us a mail at EffortlessPodcastHQ@gmail.com

history drop biology emergence effortless gpus imagenet rnns

Oral History: Human Centered AI

SHIFT

Play Episode Listen Later Jul 10, 2024 17:41

We meet Dr. Fei-Fei Li In the latest installment of our oral history project. She's a Chinese-American computer scientist and the creator of ImageNet - the dataset that made rapid advances possible in this field of AI that helps computers take meaningful information from things like photos and videos.We Meet: Stanford University's Fei-Fei Li, author of "The Worlds I See: Curiosity, Exploration, and Discovery at the Dawn of AI"Credits:This episode of SHIFT was produced by Jennifer Strong with help from Emma Cillekens. It was mixed by Garret Lang, with original music from him and Jacob Gorski. Art by Anthony Green.

ai art shift discovery exploration chinese americans oral history human centered fei fei li imagenet anthony green

LW - Rational Animations' intro to mechanistic interpretability by Writer

The Nonlinear Library

Play Episode Listen Later Jun 15, 2024 16:06

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Rational Animations' intro to mechanistic interpretability, published by Writer on June 15, 2024 on LessWrong. In our new video, we talk about research on interpreting InceptionV1, a convolutional neural network. Researchers have been able to understand the function of neurons and channels inside the network and uncover visual processing algorithms by looking at the weights. The work on InceptionV1 is early but landmark mechanistic interpretability research, and it functions well as an introduction to the field. We also go into the rationale and goals of the field and mention some more recent research near the end. Our main source material is the circuits thread in the Distill journal and this article on feature visualization. The author of the script is Arthur Frost. I have included the script below, although I recommend watching the video since the script has been written with accompanying moving visuals in mind. Intro In 2018, researchers trained an AI to find out if people were at risk of heart conditions based on pictures of their eyes, and somehow the AI also learned to tell people's biological sex with incredibly high accuracy. How? We're not entirely sure. The crazy thing about Deep Learning is that you can give an AI a set of inputs and outputs, and it will slowly work out for itself what the relationship between them is. We didn't teach AIs how to play chess, go, and atari games by showing them human experts - we taught them how to work it out for themselves. And the issue is, now they have worked it out for themselves, and we don't know what it is they worked out. Current state-of-the-art AIs are huge. Meta's largest LLaMA2 model uses 70 billion parameters spread across 80 layers, all doing different things. It's deep learning models like these which are being used for everything from hiring decisions to healthcare and criminal justice to what youtube videos get recommended. Many experts believe that these models might even one day pose existential risks. So as these automated processes become more widespread and significant, it will really matter that we understand how these models make choices. The good news is, we've got a bit of experience uncovering the mysteries of the universe. We know that humans are made up of trillions of cells, and by investigating those individual cells we've made huge advances in medicine and genetics. And learning the properties of the atoms which make up objects has allowed us to develop modern material science and high-precision technology like computers. If you want to understand a complex system with billions of moving parts, sometimes you have to zoom in. That's exactly what Chris Olah and his team did starting in 2015. They focused on small groups of neurons inside image models, and they were able to find distinct parts responsible for detecting everything from curves and circles to dog heads and cars. In this video we'll Briefly explain how (convolutional) neural networks work Visualise what individual neurons are doing Look at how neurons - the most basic building blocks of the neural network - combine into 'circuits' to perform tasks Explore why interpreting networks is so hard There will also be lots of pictures of dogs, like this one. Let's get going. We'll start with a brief explanation of how convolutional neural networks are built. Here's a network that's trained to label images. An input image comes in on the left, and it flows along through the layers until we get an output on the right - the model's attempt to classify the image into one of the categories. This particular model is called InceptionV1, and the images it's learned to classify are from a massive collection called ImageNet. ImageNet has 1000 different categories of image, like "sandal" and "saxophone" and "sarong" (which, if you don't know, is a k...

ai writer current speech researchers ea rational deep learning animations distill mechanistic rationalist imagenet interpretability lesswrong llama2

ICLR 2024 — Best Papers & Talks (ImageGen, Vision, Transformers, State Space Models) ft. Christian Szegedy, Ilya Sutskever, Durk Kingma

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later May 27, 2024 218:03

Speakers for AI Engineer World's Fair have been announced! See our Microsoft episode for more info and buy now with code LATENTSPACE — we've been studying the best ML research conferences so we can make the best AI industry conf! Note that this year there are 4 main tracks per day and dozens of workshops/expo sessions; the free livestream will air much less than half of the content this time.Apply for free/discounted Diversity Program and Scholarship tickets here. We hope to make this the definitive technical conference for ALL AI engineers.ICLR 2024 took place from May 6-11 in Vienna, Austria. Just like we did for our extremely popular NeurIPS 2023 coverage, we decided to pay the $900 ticket (thanks to all of you paying supporters!) and brave the 18 hour flight and 5 day grind to go on behalf of all of you. We now present the results of that work!This ICLR was the biggest one by far, with a marked change in the excitement trajectory for the conference:Of the 2260 accepted papers (31% acceptance rate), of the subset of those relevant to our shortlist of AI Engineering Topics, we found many, many LLM reasoning and agent related papers, which we will cover in the next episode. We will spend this episode with 14 papers covering other relevant ICLR topics, as below.As we did last year, we'll start with the Best Paper Awards. Unlike last year, we now group our paper selections by subjective topic area, and mix in both Outstanding Paper talks as well as editorially selected poster sessions. Where we were able to do a poster session interview, please scroll to the relevant show notes for images of their poster for discussion. To cap things off, Chris Ré's spot from last year now goes to Sasha Rush for the obligatory last word on the development and applications of State Space Models.We had a blast at ICLR 2024 and you can bet that we'll be back in 2025

205 – ImageNet and Flogistix Team Up for Immersive Experience with Ali Sylvester and Kyle Kempf

The Daktronics Experience

Play Episode Listen Later May 22, 2024 37:49

When a massive direct-view LED video wall is installed to make a completely immersive experience, teams collaborate to make sure it's a successful project. To hear all the details of this project for Flogistix, Justin and Matt are joined by Ali Sylvester, Director of Business Solutions at Flogistix, and Kyle Kempf, CTS-I Director of Commercial Audio Video at ImageNet Consulting. They dig into the vision for the space, the architecture that went into it and everything else that brings the project to life. Links: Daktronics News Release: https://www.daktronics.com/news/imagenet-and-daktronics-deliver-led-video-wall-experience-for-flogistix Flogistix Website: https://flogistix.com/ ImageNet Consulting Website: https://www.imagenetconsulting.com/ Rand Elliott Architects Website: https://randelliottarchitects.com/ Daktronics and ImageNet Podcast: https://podcast.daktronics.com/e/143-imagenet-consulting-with-kyle-kempf/

director led immersive sylvester team up business solutions kempf imagenet daktronics

LW - Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers by hugofry

The Nonlinear Library

Play Episode Listen Later Apr 30, 2024 19:44

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers, published by hugofry on April 30, 2024 on LessWrong. Two Minute Summary In this post I present my results from training a Sparse Autoencoder (SAE) on a CLIP Vision Transformer (ViT) using the ImageNet-1k dataset. I have created an interactive web app, 'SAE Explorer', to allow the public to explore the visual features the SAE has learnt, found here: https://sae-explorer.streamlit.app/ (best viewed on a laptop). My results illustrate that SAEs can identify sparse and highly interpretable directions in the residual stream of vision models, enabling inference time inspections on the model's activations. To demonstrate this, I have included a 'guess the input image' game on the web app that allows users to guess the input image purely from the SAE activations of a single layer and token of the residual stream. I have also uploaded a (slightly outdated) accompanying talk of my results, primarily listing SAE features I found interesting: https://youtu.be/bY4Hw5zSXzQ. The primary purpose of this post is to demonstrate and emphasise that SAEs are effective at identifying interpretable directions in the activation space of vision models. In this post I highlight a small number my favourite SAE features to demonstrate some of the abstract concepts the SAE has identified within the model's representations. I then analyse a small number of SAE features using feature visualisation to check the validity of the SAE interpretations. Later in the post, I provide some technical analysis of the SAE. I identify a large cluster of features analogous to the 'ultra-low frequency' cluster that Anthropic identified. In line with existing research, I find that this ultra-low frequency cluster represents a single feature. I then analyse the 'neuron-alignment' of SAE features by comparing the SAE encoder matrix the MLP out matrix. This research was conducted as part of the ML Alignment and Theory Scholars program 2023/2024 winter cohort. Special thanks to Joseph Bloom for providing generous amounts of his time and support (in addition to the SAE Lens code base) as well as LEAP labs for helping to produce the feature visualisations and weekly meetings with Jessica Rumbelow. Example, animals eating other animals feature: (top 16 highest activating images) Example, Italian feature: Note that the photo of the dog has a watermark with a website ending in .it (Italy's domain name). Note also that the bottom left photo is of Italian writing. The number of ambulances present is a byproduct of using ImageNet-1k. Motivation Frontier AI systems are becoming increasingly multimodal, and capabilities may advance significantly as multimodality increases due to transfer learning between different data modalities and tasks. As a heuristic, consider how much intuition humans gain for the world through visual reasoning; even in abstract settings such as in maths and physics, concepts are often understood most intuitively through visual reasoning. Many cutting edge systems today such as DALL-E and Sora use ViTs trained on multimodal data. Almost by definition, AGI is likely to be multimodal. Despite this, very little effort has been made to apply and adapt our current mechanistic interpretability techniques to vision tasks or multimodal models. I believe it is important to check that mechanistic interpretability generalises to these systems in order to ensure they are future-proof and can be applied to safeguard against AGI. In this post, I restrict the scope of my research to specifically investigating SAEs trained on multimodal models. The particular multimodal system I investigate is CLIP, a model trained on image-text pairs. CLIP consists of two encoders: a language model and a vision model that are trained to e...

learning vision italy italian speech leap transformers ea clip sora anthropic agi multimodal sae mlp sparse rationalist imagenet saes interpretability lesswrong

Fei-Fei Li: The “Godmother of AI”, Keeping Humanity at the Heart of the AI Revolution | E285

YAP - Young and Profiting

Play Episode Listen Later Apr 22, 2024 55:28

At 15, Fei-Fei Li transitioned from a middle-class life in China to poverty in America. Despite the pressures of her family's financial situation and her mother's ailing health, her knack for physics never wavered. She went from learning English as a second language to attending and working at prestigious institutions like Princeton and Stanford. Today, she is among a handful of scientists behind the impressive advances of artificial intelligence in recent times. In this episode, she breaks down her human-centered approach to AI and explores the future of the technology. Dr. Fei-Fei Li is a professor of Computer Science at Stanford University and the co-director of the Stanford Institute for Human-Centered AI. She is the creator of ImageNet, a key driver of modern artificial intelligence. With over 20 years at the forefront of the field, Dr. Li is focused on AI research, education, and policy to improve the human condition. In this episode, Hala and Fei-Fei will discuss: - The current capabilities of AI - The difference between machine learning and AI - The training process for AI models - The gaps in our knowledge about how AI learns - Why ChatGPT fails at higher-level reasoning like math - The biological inspiration for vision in computers - Fears and hopes associated with AI - The human element of jobs AI can't replace - Augmentation of human capabilities through AI - The three pillars of her human-centered AI framework - Responsible development and use of AI - The roadblocks to be aware of when using AI - Her advice to young entrepreneurs navigating the AI world - And other topics… Dr. Fei-Fei Li is a professor of Computer Science at Stanford University and the co-director of the Stanford Institute for Human-Centered AI. She is also the creator of ImageNet and the ImageNet Challenge, a key catalyst to the latest developments in deep learning and AI. Sometimes called the ‘Godmother of AI,' she is a pioneer in early computer vision research. Dr. Li is the author of The Worlds I See, one of Barack Obama's recommended books on AI. Her work has been featured in various publications, including the New York Times, Wall Street Journal, Fortune Magazine, Science, and Wired Magazine. Connect with Fei-Fei: Fei-Fei's Bio: https://profiles.stanford.edu/fei-fei-li Fei-Fei's LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247/ Fei-Fei's Twitter: https://twitter.com/drfeifei Resources Mentioned: Fei-Fei's Book, The Worlds I See: Curiosity, Exploration, and Discovery at the Dawn of AI: https://www.amazon.com/Worlds-See-Curiosity-Exploration-Discovery-ebook/dp/B0BPQSLVL6 Stanford Human Center AI Institute Website: https://hai.stanford.edu/ LinkedIn Secrets Masterclass, Have Job Security For Life: Use code ‘podcast' for 30% off at yapmedia.io/course. Sponsored By: Shopify - Sign up for a one-dollar-per-month trial period at youngandprofiting.co/shopify Indeed - Get a $75 job credit at indeed.com/profiting Yahoo Finance - For comprehensive financial news and analysis, visit YahooFinance.com More About Young and Profiting Download Transcripts - youngandprofiting.com Get Sponsorship Deals - youngandprofiting.com/sponsorships Leave a Review - ratethispodcast.com/yap Watch Videos - youtube.com/c/YoungandProfiting Follow Hala Taha LinkedIn - linkedin.com/in/htaha/ Instagram - instagram.com/yapwithhala/ TikTok - tiktok.com/@yapwithhala Twitter - twitter.com/yapwithhala Learn more about YAP Media's Services - yapmedia.io/

Teaching Computers to See

What's Your Problem?

Play Episode Listen Later Jan 25, 2024 28:33 Transcription Available

Fei-Fei Li is a Stanford computer scientist and the former chief scientist of artificial intelligence/machine learning at Google Cloud. When Li entered the field of AI in the 2000s, researchers were making slow progress, optimizing algorithms to incrementally improve outcomes. Li saw that the problem wasn't the algorithm, but the size of the datasets being used. So she built a massive database of images called ImageNet. It was a huge breakthrough, and helped lead the emergence of modern AI.See omnystudio.com/listener for privacy information.

ai teaching stanford computers li google cloud fei fei li imagenet

NeurIPS 2023 Recap — Best Papers

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Dec 23, 2023 200:26

We are running an end of year listener survey! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here.NeurIPS 2023 took place from Dec 10–16 in New Orleans. The Latent Space crew was onsite for as many of the talks and workshops as we could attend (and more importantly, hosted cocktails and parties after hours)!Picking from the 3586 papers accepted to the conference (available online, full schedule here) is an impossible task, but we did our best to present an audio guide with brief commentary on each. We also recommend MLContests.com NeurIPS recap and Seb Ruder's NeurIPS primer. We also found the VizHub guide useful for a t-SNE clustering of papers.We'll start with the NeurIPS Best Paper Awards, and then go to a selection of non-awarded but highly influential papers, and then arbitrary personal picks to round out the selection. Where we were able to do a poster session interview, please scroll to the relevant show notes for images of their poster for discussion. We give Chris Ré the last word due to the Mamba and StripedHyena state space models drawing particular excitement but still being too early to assess impact. Timestamps* [0:01:19] Word2Vec (Jeff Dean, Greg Corrado)* [0:15:28] Emergence Mirage (Rylan Schaeffer)* [0:28:48] DPO (Rafael Rafailov)* [0:41:36] DPO Poster Session (Archit Sharma)* [0:52:03] Datablations (Niklas Muennighoff)* [1:00:50] QLoRA (Tim Dettmers)* [1:12:23] DataComp (Samir Gadre)* [1:25:38] DataComp Poster Session (Samir Gadre, Alex Dimakis)* [1:35:25] LLaVA (Haotian Liu)* [1:47:21] LLaVA Poster Session (Haotian Liu)* [1:59:19] Tree of Thought (Shunyu Yao)* [2:11:27] Tree of Thought Poster Session (Shunyu Yao)* [2:20:09] Toolformer (Jane Dwivedi-Yu)* [2:32:26] Voyager (Guanzhi Wang)* [2:45:14] CogEval (Ida Momennejad)* [2:59:41] State Space Models (Chris Ré)Papers covered* Distributed Representations of Words and Phrases and their Compositionality (Word2Vec) Tomas Mikolov · Ilya Sutskever · Kai Chen · Greg Corrado · Jeff Dean. The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several improvements that make the Skip-gram model more expressive and enable it to learn higher quality vectors more rapidly. We show that by subsampling frequent words we obtain significant speedup, and also learn higher quality representations as measured by our tasks. We also introduce Negative Sampling, a simplified variant of Noise Contrastive Estimation (NCE) that learns more accurate vectors for frequent words compared to the hierarchical softmax. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of Canada'' and "Air'' cannot be easily combined to obtain "Air Canada''. Motivated by this example, we present a simple and efficient method for finding phrases, and show that their vector representations can be accurately learned by the Skip-gram model.* Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al.). Emergent abilities are abilities that are present in large-scale models but not in smaller models and are hard to predict. Rather than being a product of models' scaling behavior, this paper argues that emergent abilities are mainly an artifact of the choice of metric used to evaluate them. Specifically, nonlinear and discontinuous metrics can lead to sharp and unpredictable changes in model performance. Indeed, the authors find that when accuracy is changed to a continuous metric for arithmetic tasks where emergent behavior was previously observed, performance improves smoothly instead. So while emergent abilities may still exist, they should be properly controlled and researchers should consider how the chosen metric interacts with the model.* Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al.)* While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. * In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. * Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.* Scaling Data-Constrained Language Models (Muennighoff et al.)* The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.* QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al.). * This paper proposes QLoRA, a more memory-efficient (but slower) version of LoRA that uses several optimization tricks to save memory. They train a new model, Guanaco, that is fine-tuned only on a single GPU for 24h and outperforms previous models on the Vicuna benchmark. Overall, QLoRA enables using much fewer GPU memory for fine-tuning LLMs. Concurrently, other methods such as 4-bit LoRA quantization have been developed that achieve similar results.* DataComp: In search of the next generation of multimodal datasets (Gadre et al.)* Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. * Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release datanet and all accompanying code at www.datacomp.ai.* Visual Instruction Tuning (Liu et al)* Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. * By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.* Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.* Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al)* Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. * To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. * ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.* Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. * Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.* Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al)* LMs exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller specialized models excel. * In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. * We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. * This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, a search engine, a translation system, and a calendar. * Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.* Voyager: An Open-Ended Embodied Agent with Large Language Models (Wang et al)* We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: * 1) an automatic curriculum that maximizes exploration, * 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and * 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. * Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize.Voyager discovers new Minecraft items and skills continually by self-driven exploration, significantly outperforming the baselines.* Evaluating Cognitive Maps and Planning in Large Language Models with CogEval (Momennejad et al)* Recently an influx of studies claims emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. * First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in LLMs. The CogEval protocol can be followed for the evaluation of various abilities. * * Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets.* * We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and falling in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.* Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu, Tri Dao)* Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. * First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. * Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). * Mamba enjoys fast inference (5x higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-1.4B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.* Get full access to Latent Space at www.latent.space/subscribe

game canada planning language new orleans code tree air picking skip implications survey models minecraft openai transformers motivated chain evaluation api gpt voyager existing papers notably creative writing phrases tot apis mirage transformer secretly clip llm gpu mamba 4b large language models emergent air canada lms lm schaeffer sota concurrently stable diffusion multimodal google bard mlp dpo cohere sne ssm chris r imagenet neurips extrapolating empirically rlhf guanaco vicuna latent space ssms

Fei-Fei Li: Exploring the AI Revolution

Commonwealth Club of California Podcast

Play Episode Listen Later Dec 7, 2023 66:44

Where did AI come from? Who created it, why, and where can it lead? Artificial intelligence (AI) is rapidly developing into a world-changer, affecting every industry and being used by hundreds of millions of people—even when they're unaware they're interacting with an artificial intelligence. And we're only at the early stages of AI's growth. Join us for an in-depth talk with Dr. Fei-Fei Li, whom Wired called "one of a tiny group of scientists―a group perhaps small enough to fit around a kitchen table―who are responsible for AI's recent remarkable advances.” Dr. Li came to America as an immigrant, enduring a shift from Chinese middle class to American poverty. But a tough upbringing did not stop her from becoming a leading mind in the next big technological development. Fei-Fei's adolescent knack for physics endured and positioned her to make a crucial contribution to the breakthrough we now call AI, placing her at the center of a global transformation. Over the last decades, her work has brought her face-to-face with the extraordinary possibilities―and the extraordinary dangers―of the technology she loves. Known as the creator of ImageNet, a key catalyst of modern artificial intelligence, Dr. Li has spent more than two decades at the forefront of the field. Her work has brought her face-to-face with the extraordinary possibilities―and the extraordinary dangers―of the technology she loves. Don't miss this opportunity to learn more about a breakthrough science and one of the breakthrough scientists who is making it happen. This program is part of our Good Lit series, underwritten by the Bernard Osher Foundation. Learn more about your ad choices. Visit megaphone.fm/adchoices

america american ai chinese revolution artificial wired li ai revolution fei fei li imagenet fei fei

Dr. Fei-Fei Li, Artificial Intelligence Pioneer, on Creating Human-Centered AI

KQED’s Forum

Play Episode Listen Later Dec 4, 2023 55:49

Dr. Fei-Fei Li is a literal visionary. Her groundbreaking work on ImageNet, a vast visual recognition database, helped propel artificial intelligence at a critical moment. As one of the key innovators and thinkers in AI, Li has argued for a human-centered artificial intelligence that augments people's capabilities instead of displacing them. We talk to Li about her work, her vision for AI and her new memoir, The Worlds I See, in which she recounts her journey as a scientist and immigrant, and how those two roles inform each other. Guests: Fei-Fei Li, professor of Computer Science Department, Stanford University; author, "The Worlds I See: Curiosity, Exploration, and Discovery at the Dawn of AI"

ai artificial intelligence discovery stanford university exploration pioneer li human centered fei fei li computer science department imagenet

Babbage: Fei-Fei Li on how to really think about the future of AI

Economist Podcasts

Play Episode Listen Later Nov 22, 2023 38:58

A year ago, the public launch of ChatGPT took the world by storm and it was followed by many more generative artificial intelligence tools, all with remarkable, human-like abilities. Fears over the existential risks posed by AI have dominated the global conversation around the technology ever since. A pioneer that helped lay the groundwork that underpins generative AI models, Fei-Fei Li, takes a more nuanced approach to. She's pushing for a human-centred way of dealing with AI—treating it as a tool to help enhance—and not replace—humanity, while focussing on the pressing challenges of disinformation, bias and job disruption.Fei-Fei Li, a pioneer that helped lay the groundwork that underpins modern generative AI models, takes a more nuanced approach. She's pushing for a human-centred way of dealing with AI—treating it as a tool to help enhance—and not replace—humanity, while focussing on the pressing challenges of disinformation, bias and job disruption.Fei-Fei Li is the founding co-director of Stanford University's Institute for Human-Centred Artificial Intelligence. Fei-Fei and her research group created ImageNet, a huge database of images that enabled computers scientists to build algorithms that were able to see and recognise objects in the real world. That endeavour also introduced the world to deep learning, a type of machine learning that is fundamental part of how large-language and image-creation models work.Host: Alok Jha, The Economist's science and technology editor. Sign up for a free trial of Economist Podcasts+. If you're already a subscriber to The Economist, you'll have full access to all our shows as part of your subscription. For more information about how to access Economist Podcasts+, please visit our FAQs page or watch our video explaining how to link your account. Hosted on Acast. See acast.com/privacy for more information.

fear ai institute chatgpt acast stanford university economists faqs future of ai babbage fei fei li imagenet fei fei

Babbage: Fei-Fei Li on how to really think about the future of AI

Babbage from Economist Radio

Play Episode Listen Later Nov 22, 2023 38:58

A year ago, the public launch of ChatGPT took the world by storm and it was followed by many more generative artificial intelligence tools, all with remarkable, human-like abilities. Fears over the existential risks posed by AI have dominated the global conversation around the technology ever since. Fei-Fei Li, a pioneer that helped lay the groundwork that underpins modern generative AI models, takes a more nuanced approach. She's pushing for a human-centred way of dealing with AI—treating it as a tool to help enhance—and not replace—humanity, while focussing on the pressing challenges of disinformation, bias and job disruption.Fei-Fei Li is the founding co-director of Stanford University's Institute for Human-Centred Artificial Intelligence. Fei-Fei and her research group created ImageNet, a huge database of images that enabled computers scientists to build algorithms that were able to see and recognise objects in the real world. That endeavour also introduced the world to deep learning, a type of machine learning that is fundamental part of how large-language and image-creation models work.Host: Alok Jha, The Economist's science and technology editor. Sign up for a free trial of Economist Podcasts+. If you're already a subscriber to The Economist, you'll have full access to all our shows as part of your subscription. For more information about how to access Economist Podcasts+, please visit our FAQs page or watch our video explaining how to link your account. Hosted on Acast. See acast.com/privacy for more information.

fear ai institute chatgpt acast stanford university economists faqs future of ai babbage fei fei li imagenet fei fei

The Worlds She Sees with Godmother of AI, Fei-Fei Li

a16z

Play Episode Listen Later Nov 13, 2023 28:12

Fei-Fei Li, PhD, Professor in the Computer Science Department at Stanford University, and Co-Director of Stanford's Human-Centered AI Institute, joins Bio + Health founding partner Vijay Pande.In this candid conversation, Li unfolds her transformation from a young immigrant to an influential figure in AI. The conversation explores the birth of ImageNet, a pivotal step that bridged the gap between visual intelligence and accessible AI technology. They delve into the notion of a 'Dignity Economy,' hinting at a future where technology serves to elevate human experience rather than undermine it. Li also touches on the delicate balance between relentless innovation and life's humble pursuits. This episode peels back the layers on the human side of AI, offering a rare glimpse into the personal and professional realms of a pioneer shaping the AI landscape.Check out her new book, The Worlds I See, here: https://us.macmillan.com/books/9781250897930/theworldsiseeCheck out other episodes form our sister podcast, Bio Eats World: https://a16z.com/podcasts/bio-eats-world/ Stay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

ai phd professor artificial intelligence silicon valley stanford worlds stanford university sees li universities higher education co director godmothers fei fei li computer science department imagenet vijay pande

AI pioneer Fei-Fei Li on the future of humanity

GeekWire

Play Episode Listen Later Nov 11, 2023 35:09

Fei-Fei Li's new book is the story of her journey from China to the U.S., from small business to Big Tech, and from academic research to corporate life, and back again. But more than that, it's the story of the dawn of artificial intelligence, as told through her experience as one of the people summoning this new day and standing there awestruck, excited and concerned about what it will mean for humanity. Dr. Li joins us on this episode to discuss the book, The Worlds I See: Curiosity, Exploration, and Discovery at the Dawn of AI, published by Moment of Lift Books, an imprint from Melinda French Gates and Flatiron Books. Known for her foundational contributions to AI and computer vision, Dr. Li is the inventor of ImageNet, a large-scale dataset of images that enabled rapid advances in deep learning for visual recognition. She is a professor of computer science at Stanford University and a co-director of the Stanford Institute for Human-Centered Artificial Intelligence, who worked as Google Cloud's chief scientist for AI/ML during a 2017-2018 sabbatical. Note: GeekWire's Todd Bishop will be speaking further with Dr. Li on Monday evening Nov. 13 at Town Hall in Seattle. See this site for details and tickets. Edited by Curt Milton.See omnystudio.com/listener for privacy information.

ai china seattle humanity discovery stanford university exploration pioneer big tech li edited town hall google cloud ai ml melinda french gates flatiron books stanford institute fei fei li imagenet todd bishop

The Worlds She Sees with Fei-Fei Li

Bio Eats World

Play Episode Listen Later Nov 7, 2023 30:15

ai phd professor stanford worlds stanford university sees li co director fei fei li computer science department imagenet vijay pande

Beating GPT-4 with Open Source LLMs — with Michael Royzen of Phind

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 3, 2023 67:21

At the AI Pioneers Summit we announced Latent Space Launchpad, an AI-focused accelerator in partnership with Decibel. If you're an AI founder of enterprise early adopter, fill out this form and we'll be in touch with more details. We also have a lot of events coming up as we wrap up the year, so make sure to check out our community events page and come say hi!We previously interviewed the founders of many developer productivity startups embedded in the IDE, like Codium AI, Cursor, and Codeium. We also covered Replit's (former) SOTA model, replit-code-v1-3b and most recently had Amjad and Michele announce replit-code-v1_5-3b at the AI Engineer Summit.Much has been speculated about the StackOverflow traffic drop since ChatGPT release, but the experience is still not perfect. There's now a new player in the “search for developers” arena: Phind.Phind's goal is to help you find answers to your technical questions, and then help you implement them. For example “What should I use to create a frontend for a Python script?” returns a list of frameworks as well as links to the sources. You can then ask follow up questions on specific implementation details, having it write some code for you, etc. They have both a web version and a VS Code integrationThey recently were top of Hacker News with the announcement of their latest model, which is now the #1 rated model on the BigCode Leaderboard, beating their previous version:TLDR Cheat Sheet:* Based on CodeLlama-34B, which is trained on 500B tokens* Further fine-tuned on 70B+ high quality code and reasoning tokens* Expanded context window to 16k tokens* 5x faster than GPT-4 (100 tok/s vs 20 tok/s on single stream)* 74.7% HumanEval vs 45% for the base modelWe've talked before about HumanEval being limited in a lot of cases and how it needs to be complemented with “vibe based” evals. Phind thinks of evals alongside two axis: * Context quality: when asking the model to generate code, was the context high quality? Did we put outdated examples in it? Did we retrieve the wrong files?* Result quality: was the code generated correct? Did it follow the instructions I gave it or did it misunderstand some of it?If you have bad results with bad context, you might get to a good result by working on better RAG. If you have good context and bad result you might either need to work on your prompting or you have hit the limits of the model, which leads you to fine tuning (like they did). Michael was really early to this space and started working on CommonCrawl filtering and indexing back in 2020, which led to a lot of the insights that now power Phind. We talked about that evolution, his experience at YC, how he got Paul Graham to invest in Phind and invite him to dinner at his house, and how Ron Conway connected him with Jensen Huang to get access to more GPUs!Show Notes* Phind* BigScience T0* InstructGPT Paper* Inception-V3* LMQL* Marginalia Nu* Mistral AI* People:* Paul Graham (pg)* Ron Conway* Yacine Jernite from HuggingFace* Jeff DelaneyTimestamps* [00:00:00] Intros & Michael's early interest in computer vision* [00:03:14] Pivoting to NLP and natural language question answering models* [00:07:20] Building a search engine index of Common Crawl and web pages* [00:11:26] Releasing the first version of Hello based on the search index and BigScience T0 model* [00:14:02] Deciding to focus the search engine specifically for programmers* [00:17:39] Overview of Phind's current product and focus on code reasoning* [00:21:51] The future vision for Phind to go from idea to complete code* [00:24:03] Transitioning to using the GPT-4 model and the impact it had* [00:29:43] Developing the Phind model based on CodeLlama and additional training* [00:32:28] Plans to continue improving the Phind model with open source technologies* [00:43:59] The story of meeting Paul Graham and Ron Conway and how that impacted the company* [00:53:02] How Ron Conway helped them get GPUs from Nvidia* [00:57:12] Tips on how Michael learns complex AI topics* [01:01:12] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence and Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:19]Swyx: Hey, and today we have in the studio Michael Royzen from Phind. Welcome. [00:00:23]Michael: Thank you so much. [00:00:24]Alessio: It's great to be here. [00:00:25]Swyx: Yeah, we are recording this in a surprisingly hot October in San Francisco. And sometimes the studio works, but the blue angels are flying by right now, so sorry about the noise. So welcome. I've seen Phind blow up this year, mostly, I think since your launch in Feb and V2 and then your Hacker News posts. We tend to like to introduce our guests, but then obviously you can fill in the blanks with the origin story. You actually were a high school entrepreneur. You started SmartLens, which is a computer vision startup in 2017. [00:00:59]Michael: That's right. I remember when like TensorFlow came out and people started talking about, obviously at the time after AlexNet, the deep learning revolution was already in flow. Good computer vision models were a thing. And what really made me interested in deep learning was I got invited to go to Apple's WWDC conference as a student scholar because I was really into making iOS apps at the time. So I go there and I go to this talk where they added an API that let people run computer vision models on the device using far more efficient GPU primitives. After seeing that, I was like, oh, this is cool. This is going to have a big explosion of different computer vision models running locally on the iPhone. And so I had this crazy idea where it was like, what if I could just make this model that could recognize just about anything and have it run on the device? And that was the genesis for what eventually became SmartLens. I took this data set called ImageNet 22K. So most people, when they think of ImageNet, think of ImageNet 1K. But the full ImageNet actually has, I think, 22,000 different categories. So I took that, filtered it, pre-processed it, and then did a massive fine tune on Inception V3, which was, I think, the state of the art deep convolutional computer vision model at the time. And to my surprise, it actually worked insanely well. I had no idea what would happen if I give a single model. I think it ended up being 17,000 categories approximately that I collapsed them into. It worked so well that it actually worked better than Google Lens, which released its V1 around the same time. And on top of this, the model ran on the device. So it didn't need an internet connection. A big part of the issue with Google Lens at the time was that connections were slower. 4G was around, but it wasn't nearly as fast. So there was a noticeable lag having to upload an image to a server and get it back. But just processing it locally, even on the iPhones of the day in 2017, much faster. It was a cool little project. It got some traction. TechCrunch wrote about it. There was kind of like one big spike in usage, and then over time it tapered off. But people still pay for it, which is wild. [00:03:14]Swyx: That's awesome. Oh, it's like a monthly or annual subscription? [00:03:16]Michael: Yeah, it's like a monthly subscription. [00:03:18]Swyx: Even though you don't actually have any servers? [00:03:19]Michael: Even though we don't have any servers. That's right. I was in high school. I had a little bit of money. I was like, yeah. [00:03:25]Swyx: That's awesome. I always wonder what the modern equivalents kind of "Be my eyes". And it would be actually disclosed in the GPT-4 Vision system card recently that the usage was surprisingly not that frequent. The extent to which all three of us have our sense of sight. I would think that if I lost my sense of sight, I would use Be My Eyes all the time. The average usage of Be My Eyes per day is 1.5 times. [00:03:49]Michael: Exactly. I was thinking about this as well, where I was also looking into image captioning, where you give a model an image and then it tells you what's in the image. But it turns out that what people want is the exact opposite. People want to give a description of an image and then have the AI generate the image. [00:04:04]Alessio: Oh, the other way. [00:04:06]Michael: Exactly. And so at the time, I think there were some GANs, NVIDIA was working on this back in 2019, 2020. They had some impressive, I think, face GANs where they had this model that would produce these really high quality portraits, but it wasn't able to take a natural language description the way Midjourney or DALL-E 3 can and just generate you an image with exactly what you described in it. [00:04:32]Swyx: And how did that get into NLP? [00:04:35]Michael: Yeah, I released the SmartLens app and that was around the time I was a senior in high school. I was applying to college. College rolls around. I'm still sort of working on updating the app in college. But I start thinking like, hey, what if I make an enterprise version of this as well? At the time, there was Clarify that provided some computer vision APIs, but I thought this massive classification model works so well and it's so small and so fast, might as well build an enterprise product. And I didn't even talk to users or do any of those things that you're supposed to do. I was just mainly interested in building a type of backend I've never built before. So I was mainly just doing it for myself just to learn. I built this enterprise classification product and as part of it, I'm also building an invoice processing product where using some of the aspects that I built previously, although obviously it's very different from classification, I wanted to be able to just extract a bunch of structured data from an unstructured invoice through our API. And that's what led me to Hugnyface for the first time because that involves some natural language components. And so I go to Hugnyface and with various encoder models that were around at the time, I used the standard BERT and also Longformer, which came out around the same time. And Longformer was interesting because it had a much bigger context window than those models at the time, like BERT, all of the first gen encoder only models, they only had a context window of 512 tokens and it's fixed. There's none of this alibi or ROPE that we have now where we can basically massage it to be longer. They're fixed, 512 absolute encodings. Longformer at the time was the only way that you can fit, say, like a sequence length or ask a question about like 4,000 tokens worth of text. Implemented Longformer, it worked super well, but like nobody really kind of used the enterprise product and that's kind of what I expected because at the end of the day, it was COVID. I was building this kind of mostly for me, mostly just kind of to learn. And so nobody really used it and my heart wasn't in it and I kind of just shelved it. But a little later, I went back to HugMeFace and I saw this demo that they had, and this is in the summer of 2020. They had this demo made by this researcher, Yacine Jernite, and he called it long form question answering. And basically, it was this self-contained notebook demo where you can ask a question the way that we do now with ChatGPT. It would do a lookup into some database and it would give you an answer. And it absolutely blew my mind. The demo itself, it used, I think, BART as the model and in the notebook, it had support for both an Elasticsearch index of Wikipedia, as well as a dense index powered by Facebook's FAISS. I think that's how you pronounce it. It was very iffy, but when it worked, I think the question in the demo was, why are all boats white? When it worked, it blew my mind that instead of doing this few shot thing, like people were doing with GPT-3 at the time, which is all the rage, you could just ask a model a question, provide no extra context, and it would know what to do and just give you the answer. It blew my mind to such an extent that I couldn't stop thinking about that. When I started thinking about ways to make it better, I tried training, doing the fine tune with a larger BART model. And this BART model, yeah, it was fine tuned on this Reddit data set called Eli5. So basically... [00:08:02]Alessio: Subreddit. [00:08:03]Swyx: Yeah, subreddit. [00:08:04]Alessio: Yeah. [00:08:05]Michael: And put it into like a well-formatted, relatively clean data set of like human questions and human answers. And that was a really great bootstrap for that model to be able to answer these types of questions. And so Eli5 actually turned out to be a good data set for training these types of question answering models, because the question is written by a human, the answer is written by a human, and at least helps the model get the format right, even if the model is still very small and it can't really think super well, at least it gets the format right. And so it ends up acting as kind of a glorified summarization model, where if it's fed in high quality context from the retrieval system, it's able to have a reasonably high quality output. And so once I made the model as big as I can, just fine tuning on BART large, I started looking for ways to improve the index. So in the demo, in the notebook, there were instructions for how to make an Elasticsearch index just for Wikipedia. And I was like, why not do all of Common Crawl? So I downloaded Common Crawl, and thankfully, I had like 10 or $15,000 worth of AWS credits left over from the SmartLens project. And that's what really allowed me to do this, because there's no other funding. I was still in college, not a lot of money, and so I was able to spin up a bunch of instances and just process all of Common Crawl, which is massive. So it's roughly like, it's terabytes of text. I went to Alexa to get the top 1,000 websites or 10,000 websites in the world, then filtered only by those websites, and then indexed those websites, because the web pages were already included in Dump. [00:09:38]Swyx: You mean to supplement Common Crawl or to filter Common Crawl? [00:09:41]Michael: Filter Common Crawl. [00:09:42]Alessio: Oh, okay. [00:09:43]Michael: Yeah, sorry. So we filtered Common Crawl just by the top, I think, 10,000, just to limit this, because obviously there's this massive long tail of small sites that are really cool, actually. There's other projects like, shout out to Marginalia Nu, which is a search engine specialized on the long tail. I think they actually exclude the top 10,000. [00:10:03]Swyx: That's what they do. [00:10:04]Alessio: Yeah. [00:10:05]Swyx: I've seen them around, I just don't really know what their pitch is. Okay, that makes sense. [00:10:08]Michael: So they exclude all the top stuff. So the long tail is cool, but for this, that was kind of out of the question, and that was most of the data anyway. So we've removed that. And then I indexed the remaining approximately 350 million webpages through Elasticsearch. So I built this index running on AWS with these webpages, and it actually worked quite well. You can ask it general common knowledge, history, politics, current events, questions, and it would be able to do a fast lookup in the index, feed it into the model, and it would give a surprisingly good result. And so when I saw that, I thought that this is definitely doable. And it kind of shocked me that no one else was doing this. And so this was now the fall of 2020. And yeah, I was kind of shocked no one was doing this, but it costs a lot of money to keep it up. I was still in college. There are things going on. I got bogged down by classes. And so I ended up shelving this for almost a full year, actually. When I returned to it in fall of 2021, when BigScience released T0, when BigScience released the T0 models, that was a massive jump in the reasoning ability of the model. And it was better at reasoning, it was better at summarization, it was still a glorified summarizer basically. [00:11:26]Swyx: Was this a precursor to Bloom? Because Bloom's the one that I know. [00:11:29]Alessio: Yeah. [00:11:30]Michael: Actually coming out in 2022. But Bloom had other problems where for whatever reason, the Bloom models just were never really that good, which is so sad because I really wanted to use them. But I think they didn't turn on that much data. I think they used like the original, they were trying to replicate GPT-3. So they just use those numbers, which we now know are like far below Chinchilla Optimal and even Chinchilla Optimal, which we can like talk about later, like what we're currently doing with MIMO goes, yeah, it goes way beyond that. But they weren't trying enough data. I'm not sure how that data was clean, but it probably wasn't super clean. And then they didn't really do any fine tuning until much later. So T0 worked well because they took the T5 models, which were closer to Chinchilla Optimal because I think they were trained on also like 300 something billion tokens, similar to GPT-3, but the models were much smaller. I think T0 is the first model that did large scale instruction tuning from diverse data sources in the fall of 2021. This is before Instruct GPT. This is before Flan T5, which came out in 2022. This is the very, very first, at least well-known example of that. And so it came out and then I did, on top of T0, I also did the Reddit Eli5 fine tune. And that was the first model and system that actually worked well enough to where I didn't get discouraged like I did previously, because the failure cases of the BART based system was so egregious. Sometimes it would just miss a question so horribly that it was just extremely discouraging. But for the first time, it was working reasonably well. Also using a much bigger model. I think the BART model is like 800 million parameters, but T0, we were using 3B. So it was T0, 3B, bigger model. And that was the very first iteration of Hello. So I ended up doing a show HN on Hacker News in January 2022 of that system. Our fine tune T0 model connected to our Elasticsearch index of those 350 million top 10,000 common crawl websites. And to the best of my knowledge, I think that's the first example that I'm aware of a LLM search engine model that's effectively connected to like a large enough index that I consider like an internet scale. So I think we were the first to release like an internet scale LLM powered rag search system In January 2022, around the time me and my future co-founder, Justin, we were like, this seems like the future. [00:14:02]Alessio: This is really cool. [00:14:03]Michael: I couldn't really sleep even like I was going to bed and I was like, I was thinking about it. Like I would say up until like 2.30 AM, like reading papers on my phone in bed, go to sleep, wake up the next morning at like eight and just be super excited to keep working. And I was also doing my thesis at the same time, my senior honors thesis at UT Austin about something very similar. We were researching factuality in abstractive question answering systems. So a lot of overlap with this project and the conclusions of my research actually kind of helped guide the development path of Hello. In the research, we found that LLMs, they don't know what they don't know. So the conclusion was, is that you always have to do a search to ensure that the model actually knows what it's talking about. And my favorite example of this even today is kind of with chat GPT browsing, where you can ask chat GPT browsing, how do I run llama.cpp? And chat GPT browsing will think that llama.cpp is some file on your computer that you can just compile with GCC and you're all good. It won't even bother doing a lookup, even though I'm sure somewhere in their internal prompts they have something like, if you're not sure, do a lookup. [00:15:13]Alessio: That's not good enough. So models don't know what they don't know. [00:15:15]Michael: You always have to do a search. And so we approached LLM powered question answering from the search angle. We pivoted to make this for programmers in June of 2022, around the time that we were getting into YC. We realized that what we're really interested in is the case where the models actually have to think. Because up until then, the models were kind of more glorified summarization models. We really thought of them like the Google featured snippets, but on steroids. And so we saw a future where the simpler questions would get commoditized. And I still think that's going to happen with like Google SGE and like it's nowadays, it's really not that hard to answer the more basic kind of like summarization, like current events questions with lightweight models that'll only continue to get cheaper over time. And so we kind of started thinking about this trade off where LLM models are going to get both better and cheaper over time. And that's going to force people who run them to make a choice. Either you can run a model of the same intelligence that you could previously for cheaper, or you can run a better model for the same price. So someone like Google, once the price kind of falls low enough, they're going to deploy and they're already doing this with SGE, they're going to deploy a relatively basic glorified summarizer model that can answer very basic questions about like current events, who won the Super Bowl, like, you know, what's going on on Capitol Hill, like those types of things. The flip side of that is like more complex questions where like you have to reason and you have to solve problems and like debug code. And we realized like we're much more interested in kind of going along the bleeding edge of that frontier case. And so we've optimized everything that we do for that. And that's a big reason of why we've built Phind specifically for programmers, as opposed to saying like, you know, we're kind of a search engine for everyone because as these models get more capable, we're very interested in seeing kind of what the emergent properties are in terms of reasoning, in terms of being able to solve complex multi-step problems. And I think that some of those emerging capabilities like we're starting to see, but we don't even fully understand. So I think there's always an opportunity for us to become more general if we wanted, but we've been along this path of like, what is the best, most advanced reasoning engine that's connected to your code base, that's connected to the internet that we can just provide. [00:17:39]Alessio: What is Phind today, pragmatically, from a product perspective, how do people interact with it? Yeah. Or does it plug into your workflow? [00:17:46]Michael: Yeah. [00:17:47]Alessio: So Phind is really a system. [00:17:48]Michael: Phind is a system for programmers when they have a question or when they're frustrated or when something's not working. [00:17:54]Swyx: When they're frustrated. [00:17:55]Alessio: Yeah. [00:17:56]Michael: For them to get on block. I think like the single, the most abstract page for Phind is like, if you're experiencing really any kind of issue as a programmer, we'll solve that issue for you in 15 seconds as opposed to 15 minutes or longer. Phind has an interface on the web. It has an interface in VS code and more IDEs to come, but ultimately it's just a system where a developer can paste in a question or paste in code that's not working and Phind will do a search on the internet or they will find other code in your code base perhaps that's relevant. And then we'll find the context that it needs to answer your question and then feed it to a reasoning engine powerful enough to actually answer it. So that's really the philosophy behind Phind. It's a system for getting developers the answers that they're looking for. And so right now from a product perspective, this means that we're really all about getting the right context. So the VS code extension that we launched recently is a big part of this because you can just ask a question and it knows where to find the right code context in your code. It can do an internet search as well. So it's up to date and it's not just reliant on what the model knows and it's able to figure out what it needs by itself and answer your question based on that. If it needs some help, you can also get yourself kind of just, there's opportunities for you yourself to put in all that context in. But the issue is also like not everyone wants these VS code. Some people like are real Neovim sticklers or they're using like PyCharm or other IDEs, JetBrains. And so for those people, they're actually like okay with switching tabs, at least for now, if it means them getting their answer. Because really like there's been an explosion of all these like startups doing code, doing search, etc. But really who everyone's competing with is ChatGPT, which only has like that one web interface. Like ChatGPT is really the bar. And so that's what we're up against. [00:19:50]Alessio: And so your idea, you know, we have Amman from Cursor on the podcast and they've gone through the we need to own the IDE thing. Yours is more like in order to get the right answer, people are happy to like go somewhere else basically. They're happy to get out of their IDE. [00:20:05]Michael: That was a great podcast, by the way. But yeah, so part of it is that people sometimes perhaps aren't even in an IDE. So like the whole task of software engineering goes way beyond just running code, right? There's also like a design stage. There's a planning stage. A lot of this happens like on whiteboards. It happens in notebooks. And so the web part also exists for that where you're not even coding it and you're just trying to get like a more conceptual understanding of what you're trying to build first. The podcast with Amman was great, but somewhere where I disagree with him is that you need to own the IDE. I think like he made some good points about not having platform risk in the long term. But some of the features that were mentioned like suggesting diffs, for example, those are all doable with an extension. We haven't yet seen with VS Code in particular any functionality that we'd like to do yet in the IDE that we can't either do through directly supported VS Code functionality or something that we kind of hack into there, which we've also done a fair bit of. And so I think it remains to be seen where that goes. But I think what we're looking to be is like we're not trying to just be in an IDE or be an IDE. Like Phind is a system that goes beyond the IDE and like is really meant to cover the entire lifecycle of a developer's thought process in going about like, hey, like I have this idea and I want to get from that idea to a working product. And so then that's what the long term vision of Phind is really about is starting with that. In the future, I think programming is just going to be really just the problem solving. Like you come up with an idea, you come up with like the basic design for the algorithm in your head, and you just tell the AI, hey, just like just do it, just make it work. And that's what we're building towards. [00:21:51]Swyx: I think we might want to give people an impression about like type of traffic that you have, because when you present it with a text box, you could type in anything. And I don't know if you have some mental categorization of like what are like the top three use cases that people tend to coalesce around. [00:22:08]Alessio: Yeah, that's a great question. [00:22:09]Michael: The two main types of searches that we see are how-to questions, like how to do X using Y tool. And this historically has been our bread and butter, because with our embeddings, like we're really, really good at just going over a bunch of developer documentation and figuring out exactly the part that's relevant and just telling you, OK, like you can use this method. But as LLMs have gotten better, and as we've really transitioned to using GPT-4 a lot in our product, people organically just started pasting in code that's not working and just said, fix it for me. [00:22:42]Swyx: Fix this. [00:22:43]Alessio: Yeah. [00:22:44]Michael: And what really shocks us is that a lot of the people who do that, they're coming from chat GPT. So they tried it in chat GPT with chat GPT-4. It didn't work. Maybe it required like some multi-step reasoning. Maybe it required some internet context or something found in either a Stack Overflow post or some documentation to solve it. And so then they paste it into find and then find works. So those are really those two different cases. Like, how can I build this conceptually or like remind me of this one detail that I need to build this thing? Or just like, here's this code. Fix it. And so that's what a big part of our VS Code extension is, is like enabling a much smoother here just like fix it for me type of workflow. That's really its main benefits. Like it's in your code base. It's in the IDE. It knows how to find the relevant context to answer that question. But at the end of the day, like I said previously, that's still a relatively, not to say it's a small part, but it's a limited part of the entire mental life cycle of a programmer. [00:23:47]Swyx: Yep. So you launched in Feb and then you launched V2 in August. You had a couple other pretty impactful posts slash feature launches. The web search one was massive. So you were mostly a GPT-4 wrapper. We were for a long time. [00:24:03]Michael: For a long time until recently. Yeah. [00:24:05]Alessio: Until recently. [00:24:06]Swyx: So like people coming over from ChatGPT were saying, we're going to say model with your version of web search. Would that be the primary value proposition? [00:24:13]Michael: Basically yeah. And so what we've seen is that any model plus web search is just significantly better than [00:24:18]Alessio: that model itself. Do you think that's what you got right in April? [00:24:21]Swyx: Like so you got 1500 points on Hacking News in April, which is like, if you live on Hacking News a lot, that is unheard of for someone so early on in your journey. [00:24:31]Alessio: Yeah. [00:24:32]Michael: We're super, super grateful for that. Definitely was not expecting it. So what we've done with Hacker News is we've just kept launching. [00:24:38]Alessio: Yeah. [00:24:39]Michael: Like what they don't tell you is that you can just keep launching. That's what we've been doing. So we launched the very first version of Find in its current incarnation after like the previous demo connected to our own index. Like once we got into YC, we scrapped our own index because it was too cumbersome at the time. So we moved over to using Bing as kind of just the raw source data. We launched as Hello Cognition. Over time, every time we like added some intelligence to the product, a better model, we just keep launching. And every additional time we launched, we got way more traffic. So we actually silently rebranded to Find in late December of last year. But like we didn't have that much traffic. Nobody really knew who we were. [00:25:18]Swyx: How'd you pick the name out of it? [00:25:19]Michael: Paul Graham actually picked it for us. [00:25:21]Swyx: All right. [00:25:22]Alessio: Tell the story. Yeah. So, oh boy. [00:25:25]Michael: So this is the biggest side. Should we go for like the full Paul Graham story or just the name? [00:25:29]Swyx: Do you want to do it now? Or do you want to do it later? I'll give you a choice. [00:25:32]Alessio: Hmm. [00:25:33]Michael: I think, okay, let's just start with the name for now and then we can do the full Paul Graham story later. But basically, Paul Graham, when we were lucky enough to meet him, he saw our name and our domain was at the time, sayhello.so and he's just like, guys, like, come on, like, what is this? You know? And we were like, yeah, but like when we bought it, you know, we just kind of broke college students. Like we didn't have that much money. And like, we really liked hello as a name because it was the first like conversational search engine. And that's kind of, that's the angle that we were approaching it from. And so we had sayhello.so and he's like, there's so many problems with that. Like, like, like the say hello, like, what does that even mean? And like .so, like, it's gotta be like a .com. And so we did some time just like with Paul Graham in the room. We just like looked at different domain names, like different things that like popped into our head. And one of the things that popped into like Paul Graham said was fine with the Phind spelling in particular. [00:26:33]Swyx: Yeah. Which is not typical naming advice, right? Yes. Because it's not when people hear it, they don't spell it that way. [00:26:38]Michael: Exactly. It's hard to spell. And also it's like very 90s. And so at first, like, we didn't like, I was like, like, ah, like, I don't know. But over time it kept growing on us. And eventually we're like, okay, we like the name. It's owned by this elderly Canadian gentleman who we got to know, and he was willing to sell it to us. [00:26:57]Michael: And so we bought it and we changed the name. Yeah. [00:27:01]Swyx: Anyways, where were you? [00:27:02]Alessio: I had to ask. [00:27:03]Swyx: I mean, you know, everyone who looks at you is wondering. [00:27:06]Michael: And a lot of people actually pronounce it Phind, which, you know, by now it's part of the game. But eventually we want to buy Phind.com and then just have that redirect to Phind. So Phind is like definitely the right spelling. But like, we'll just, yeah, we'll have all the cases addressed. [00:27:23]Swyx: Cool. So Bing web search, and then August you launched V2. Is V2 the Phind as a system pitch? Or have you moved, evolved since then? [00:27:31]Michael: Yeah, so I don't, like the V2 moniker, like, I don't really think of it that way in my mind. There's like, there's the version we launched during, last summer during YC, which was the Bing version directed towards programmers. And that's kind of like, that's why I call it like the first incarnation of what we currently are. Because it was already directed towards programmers. We had like a code snippet search built in as well, because at the time, you know, the models we were using weren't good enough to generate code snippets. Even GPT, like the text DaVinci 2 was available at the time, wasn't that good at generating code and it would generate like very, very short, very incomplete code snippets. And so we launched that last summer, got some traction, but really like we were only doing like, I don't know, maybe like 10,000 searches a day. [00:28:15]Alessio: Some people knew about it. [00:28:16]Michael: Some people used it, which is impressive because looking back, the product like was not that good. And every time we've like made an improvement to the way that we retrieve context through better embeddings, more intelligent, like HTML parsers, and importantly, like better underlying models. Every major version after that was when we introduced a better underlying answering model. Like in February, we had to swallow a bit of our pride when we were like, okay, our own models aren't good enough. We have to go to open AI. And actually that did lead to kind of like our first decent bump of traffic in February. And people kept using it, like our attention was way better too. But we were still kind of running into problems of like more advanced reasoning. Some people tried it, but people were leaving because even like GPT 3.5, both turbo and non-turbo, like still not that great at doing like code related reasoning beyond the how do you do X, like documentation search type of use case. And so it was really only when GPT 4 came around in April that we were like, okay, like this is like our first real opportunity to really make this thing like the way that it should have been all along. And having GPT 4 as the brain is what led to that Hacker News post. And so what we did was we just let anyone use GPT 4 on Fyne for free without a login, [00:29:43]Alessio: which I actually don't regret. [00:29:45]Michael: So it was very expensive, obviously. But like at that stage, all we needed to do was show like, we just needed to like show people here's what Fyne can do. That was the main thing. And so that worked. That worked. [00:29:58]Alessio: Like we got a lot of users. [00:29:59]Michael: Do you know Fireship? [00:30:01]Swyx: Yeah. YouTube, Jeff Delaney. [00:30:03]Michael: Yeah. He made a short about Fyne. [00:30:06]Alessio: Oh. [00:30:07]Michael: And that's on top of the Hacker News post. And that's what like really, really made it blow up. It got millions of views in days. And he's just funny. Like what I love about Fireship is like he like you guys, yeah, like humor goes a long a long way towards like really grabbing people's attention. And so that blew up. [00:30:25]Swyx: Something I would be anxious about as a founder during that period, so obviously we all remember that pretty closely. So there were a couple of people who had access to the GPT-4 API doing this, which is unrestricted access to GPT-4. And I have to imagine OpenAI wasn't that happy about that because it was like kind of de facto access to GPT-4 before they released it. [00:30:46]Alessio: No, no. [00:30:47]Michael: GPT-4 was in chat GPT from day one. I think. OpenAI actually came to our support because what happened was we had people building unofficial APIs around to try to get free access to it. And I think OpenAI actually has the right perspective on this where they're like, OK, people can do whatever they want with the API if they're paying for it, like they can do whatever they want, but it's like not OK if, you know, paying customers are being exploite by these other actors. They actually got in touch with us and they helped us like set up better Cloudflare bot monitoring controls to effectively like crack down on those unofficial APIs, which we're very happy about. But yeah, so we launched GPT-4. A lot of people come to the product and yeah, for a long time, we're just we're figuring out like what do we make of this, right? How do we a make it better, but also deal with like our costs, which have just like massively, massively ballooned. Over time, it's become more clear with the release of Llama 2 and Llama 3 on the horizon that we will once again see a return to vertical applications running their own models. As was true last year and before, I think that GPT-4, my hypothesis is that the jump from 4 to 4.5 or 4 to 5 will be smaller than the jump from 3 to 4. And the reason why is because there were a lot of different things. Like there was two plus, effectively two, two and a half years of research that went into going from 3 to 4. Like more data, bigger model, all of the instruction tuning techniques, RLHF, all of that is known. And like Meta, for example, and now there's all these other startups like Mistral too, like there's a bunch of very well-funded open source players that are now working on just like taking the recipe that's now known and scaling it up. So I think that even if a delta exists, the delta between in 2024, the delta between proprietary and open source won't be large enough that a startup like us with a lot of data that we've collected can take the data that we have, fine tune an open source model, and like be able to have it be better than whatever the proprietary model is at the time. That's my hypothesis.Michael: But we'll once again see a return to these verticalized models. And that's something that we're super excited about because, yeah, that brings us to kind of the fine model because the plan from kind of the start was to be able to return to that if that makes sense. And I think now we're definitely at a point where it does make sense because we have requests from users who like, they want longer context in the model, basically, like they want to be able to ask questions about their entire code base without, you know, context and retrieval and taking a chance of that. Like, I think it's generally been shown that if you have the space to just put the raw files inside of a big context window, that is still better than chunking and retrieval. So there's various things that we could do with longer context, faster speed, lower cost. Super excited about that. And that's the direction that we're going with the fine model. And our big hypothesis there is precisely that we can take a really good open source model and then just train it on absolutely all of the high quality data that we can find. And there's a lot of various, you know, interesting ideas for this. We have our own techniques that we're kind of playing with internally. One of the very interesting ideas that I've seen, I think it's called Octopack from BigCode. I don't think that it made that big waves when it came out, I think in August. But the idea is that they have this data set that maps GitHub commits to a change. So basically there's all this really high quality, like human made, human written diff data out there on every time someone makes a commit in some repo. And you can use that to train models. Take the file state before and like given a commit message, what should that code look like in the future? [00:34:52]Swyx: Got it. [00:34:53]Alessio: Do you think your HumanEval is any good?Michael: So we ran this experiment. We trained the Phind model. And if you go to the BigCode leaderboard, as of today, October 5th, all of our models are at the top of the BigCode leaderboard by far. It's not close, particularly in languages other than Python. We have a 10 point gap between us and the next best model on JavaScript. I think C sharp, multilingual. And what we kind of learned from that whole experience releasing those models is that human eval doesn't really matter. Not just that, but GPT-4 itself has been trained on human eval. And we know this because GPT-4 is able to predict the exact docstring in many of the problems. I've seen it predict like the specific example values in the docstring, which is extremely improbable. So I think there's a lot of dataset contamination and it only captures a very limited subset of what programmers are actually doing. What we do internally for evaluations are we have GPT-4 score answers. GPT-4 is a really good evaluator. I mean, obviously it's by really good, I mean, it's the best that we have. I'm sure that, you know, a couple of months from now, next year, we'll be like, oh, you know, like GPT-4.5, GPT-5, it's so much better. Like GPT-4 is terrible, but like right now it's the best that we have short of humans. And what we found is that when doing like temperature zero evals, it's actually mostly deterministic GPT-4 across runs in assigning scores to two different answers. So we found it to be a very useful tool in comparing our model to say, GPT-4, but yeah, on our like internal real world, here's what people will be asking this model dataset. And the other thing that we're running is just like releasing the model to our users and just seeing what they think. Because that's like the only thing that really matters is like releasing it for the application that it's intended for, and then seeing how people react. And for the most part, the incredible thing is, is that people don't notice a difference between our model and GPT-4 for the vast majority of searches. There's some reasoning problems that GPT-4 can still do better. We're working on addressing that. But in terms of like the types of questions that people are asking on find, there's not that much difference. And in fact, I've been running my own kind of side by side comparisons, shout out to GodMode, by the way. [00:37:16]Michael: And I've like myself, I've kind of confirmed this to be the case. And even sometimes it gives a better answer, perhaps like more concise or just like better implementation than GPT-4, which that's what surprises me. And by now we kind of have like this reasoning is all you need kind of hypothesis where we've seen emerging capabilities in the find model, whereby training it on high quality code, it can actually like reason better. It went from not being able to solve world problems, where riddles were like with like temporal placement of objects and moving and stuff like that, that GPT-4 can do pretty well. We went from not being able to do those at all to being able to do them just by training on more code, which is wild. So we're already like starting to see like these emerging capabilities. [00:37:59]Swyx: So I just wanted to make sure that we have the, I guess, like the model card in our heads. So you started from Code Llama? [00:38:07]Alessio: Yes. [00:38:08]Swyx: 65, 34? 34. [00:38:10]Michael: So unfortunately, there's no Code Llama 70b. If there was, that would be super cool. But there's not. [00:38:15]Swyx: 34. And then, which in itself was Llama 2, which is on 2 trillion tokens and the added 500 billion code tokens. Yes. [00:38:22]Michael: And you just added a bunch more. [00:38:23]Alessio: Yeah. [00:38:24]Michael: And they also did a couple of things. So they did, I think they did 500 billion, like general pre-training and then they did an extra 20 billion long context pre-training. So they actually increased the like max position tokens to 16k up from 8k. And then they changed the theta parameter for the ROPE embeddings as well to give it theoretically better long context support up to 100k tokens. But yeah, but otherwise it's like basically Llama 2. [00:38:50]Swyx: And so you just took that and just added data. [00:38:52]Michael: Exactly. [00:38:53]Swyx: You didn't do any other fundamental. [00:38:54]Michael: Yeah. So we didn't actually, we haven't yet done anything with the model architecture and we just trained it on like many, many more billions of tokens on our own infrastructure. And something else that we're taking a look at now is using reinforcement learning for correctness. One of the interesting pitfalls that we've noticed with the Phind model is that in cases where it gets stuff wrong, it sometimes is capable of getting the right answer. It's just, there's a big variance problem. It's wildly inconsistent. There are cases when it is able to get the right chain of thought and able to arrive [00:39:25]Alessio: at the right answer, but not always. [00:39:27]Michael: And so like one of our hypotheses is something that we're going to try is that like we can actually do reinforcement learning on, for a given problem, generate a bunch of completions and then like use the correct answer as like a loss basically to try to get it to be more correct. And I think there's a high chance I think of this working because it's very similar to the like RLHF method where you basically show pairs of completions for a given question except the criteria is like which one is like less harmful. But here we have a different criteria. But if the model is already capable of getting the right answer, which it is, we're just, we just need to cajole it into being more consistent. [00:40:06]Alessio: There were a couple of things that I noticed in the product that were not strange but unique. So first of all, the model can talk multiple times in a row, like most other applications is like human model, human model. And then you had outside of the thumbs up, thumbs down, you have things like have DLLM prioritize this message and its answers or then continue from this message to like go back. How does that change the flow of the user and like in terms of like prompting it, yeah, what are like some tricks or learnings you've had? [00:40:37]Michael: So yeah, that's specifically in our pair programmer mode, which is a more conversational mode that also like asks you clarifying questions back if it doesn't fully understand what you're doing and it kind of it holds your hand a bit more. And so from user feedback, we had requests to make more of an auto GPT where you can kind of give it this problem that might take multiple searches or multiple different steps like multiple reasoning steps to solve. And so that's the impetus behind building that product. Being able to do multiple steps and also be able to handle really long conversations. Like people are really trying to use the pair programmer to go from like sometimes really from like basic idea to like complete working code. And so we noticed was is that we were having like these very, very long threads, sometimes with like 60 messages, like 100 messages. And like those become really, really challenging to manage the appropriate context window of what should go inside of the context and how to preserve the context so that the model can continue or the product can continue giving good responses, even if you're like 60 messages deep in a conversation. So that's where the prioritized user messages like comes from. It's like people have asked us to just like let them pin messages that they want to be left in the conversation. And yeah, and then that seems to have like really gone a long way towards solving that problem, yeah. [00:41:54]Alessio: And then you have a run on Replit thing. Are you planning to build your own repl? Like learning some people trying to run the wrong code, unsafe code? [00:42:03]Michael: Yes. Yes. So I think like in the long term vision of like being a place where people can go from like idea to like fully working code, having a code sandbox, like a natively integrated code sandbox makes a lot of sense. And replit is great and people use that feature. But yeah, I think there's more we can do in terms of like having something a bit closer to code interpreter where it's able to run the code and then like recursively iterate on it. Exactly. [00:42:31]Swyx: So you're working on APIs to enable you to do that? Yep. So Amjad has specifically told me in person that he wants to enable that for people at the same time. He's also working on his own models, and Ghostwriter and you know, all the other stuff. So it's going to get interesting. Like he wants to power you, but also compete with you. Yeah. [00:42:47]Michael: And like, and we love replit. I think that a lot of the companies in our space, like we're all going to converge to solving a very similar problem, but from a different angle. So like replit approaches this problem from the IDE side. Like they started as like this IDE that you can run in the browser. And they started from that side, making coding just like more accessible. And we're approaching it from the side of like an LLM that's just like connected to everything that it needs to be connected to, which includes your code context. So that's why we're kind of making inroads into IDEs, but we're kind of, we're approaching this problem from different sides. And I think it'll be interesting to see where things end up. But I think that in the long, long term, we have an opportunity to also just have like this general technical reasoning engine product that's potentially also not just for, not just for programmers. It's also powered in this web interface, like where there's potential, I think other things that we will build that eventually might go beyond like our current scope. [00:43:49]Swyx: Exciting. We'll look forward to that. We're going to zoom out a little bit into sort of AI ecosystem stories, but first we got to get the Paul Graham, Ron Conway story. [00:43:59]Alessio: Yeah. [00:44:00]Michael: So flashback to last summer, we're in the YC batch. We're doing the summer batch, summer 22. So the summer batch runs from June to September, approximately. And so this was late July, early August, right around the time that many like YC startups start like going out, like during up, here's how we're going to pitch investors and everything. And at the same time, me and my co-founder, Justin, we were planning on moving to New York. So for a long time, actually, we were thinking about building this company in New York, mainly for personal reasons, actually, because like during the pandemic, pre-ChatGPT, pre last year, pre the AI boom, SF unfortunately really kind of, you know, like lost its luster. Yeah. Like no one was here. It was far from clear, like if there would be an AI boom, if like SF would be like... [00:44:49]Alessio: Back. [00:44:50]Michael: Yeah, exactly. Back. As everyone is saying these days, it was far from clear. And so, and all of our friends, we were graduating college because like we happened to just graduate college and immediately start YC, like we didn't even have, I think we had a week in between. [00:45:06]Swyx: You didn't bother looking for jobs. You were just like, this is what we want to do. [00:45:08]Michael: Well, actually both me and my co-founder, we had jobs that we secured in 2021 from previous internships, but we both, funny enough, when I spoke to my boss's boss at the company at where I reneged my offer, I told him we got into YC, they actually said, yeah, you should do YC. [00:45:27]Swyx: Wow. [00:45:28]Alessio: That's very selfless. [00:45:29]Swyx: That was really great that they did that. But in San Francisco, they would have offered to invest as well. [00:45:33]Michael: Yes, they would have. But yeah, but we were both planning to be in New York and all of our friends were there from college at this point, like we have this whole plan where like on August 1st, we're going to move to New York and we had like this Airbnb for the month of New York. We're going to stay there and we're going to work and like all of that. The day before we go to New York, I called Justin and I just, I tell him like, why are we doing this? Because in our batch, by the time August 1st rolled around, all of our mentors at YC were saying like, hey, like you should really consider staying in SF. [00:46:03]Swyx: It's the hybrid batch, right? [00:46:04]Michael: Yeah, it was the hybrid batch, but like there were already signs that like something was kind of like afoot in SF, even if like we didn't fully want to admit it yet. And so we were like, I don't know, I don't know. Something kind of clicked when the rubber met the road and it was time to go to New York. We're like, why are we doing this? And like, we didn't have any good reasons for staying in New York at that point beyond like our friends are there. So we still go to New York because like we have the Airbnb, like we don't have any other kind of place to go for the next few weeks. We're in New York and New York is just unfortunately too much fun. Like all of my other friends from college who are just, you know, basically starting their jobs, starting their lives as adults. They just stepped into these jobs, they're making all this money and they're like partying and like all these things are happening. And like, yeah, it's just a very distracting place to be. And so we were just like sitting in this like small, you know, like cramped apartment, terrible posture, trying to get as much work done as we can, too many distractions. And then we get this email from YC saying that Paul Graham is in town in SF and he is doing office hours with a certain number of startups in the current batch. And whoever signs up first gets it. And I happen to be super lucky. I was about to go for a run, but I just, I saw the email notification come across the street. I immediately clicked on the link and like immediately, like half the spots were gone, but somehow the very last spot was still available. And so I picked the very, very last time slot at 7 p.m. semi-strategically, you know, so we would have like time to go over. And also because I didn't really know how we're going to get to SF yet. And so we made a plan that we're going to fly from New York to SF and back to New York in one day and do like the full round trip. And we're going to meet with PG at the YC Mountain View office. And so we go there, we do that, we meet PG, we tell him about the startup. And one thing I love about PG is that he gets like, he gets so excited. Like when he gets excited about something, like you can see his eyes like really light up. And he'll just start asking you questions. In fact, it's a little challenging sometimes to like finish kind of like the rest of like the description of your pitch because like, he'll just like asking all these questions about how it works. And I'm like, you know, what's going on? [00:48:19]Swyx: What was the most challenging question that he asked you? [00:48:21]Michael: I think that like really how it worked. Because like as soon as like we told him like, hey, like we think that the future of search is answers, not links. Like we could really see like the gears turning in his head. I think we were like the first demo of that. [00:48:35]Swyx: And you're like 10 minutes with him, right? [00:48:37]Michael: We had like 45, yeah, we had a decent chunk of time. And so we tell him how it works. Like he's very excited about it. And I just like, I just blurted out, I just like asked him to invest and he hasn't even seen the product yet. We just asked him to invest and he says, yeah. And like, we're super excited about that. [00:48:55]Swyx: You haven't started your batch. [00:48:56]Michael: No, no, no. This is about halfway through the batch or two, two, no, two thirds of the batch. [00:49:02]Swyx: And you're like not technically fundraising yet. We're about to start fundraising. Yeah. [00:49:06]Michael: So we have like this demo and like we showed him and like there was still a lot of issues with the product, but I think like it must have like still kind of like blown his mind in some way. So like we're having fun. He's having fun. We have this dinner planned with this other friend that we had in SF because we were only there for that one day. So we thought, okay, you know, after an hour we'll be done, you know, we'll grab dinner with our friend and we'll fly back to New York. But PG was like, like, I'm having so much fun. Do you want to have dinner? Yeah. Come to my house. Or he's like, I gotta go have dinner with my wife, Jessica, who's also awesome, by the way. [00:49:40]Swyx: She's like the heart of YC. Yeah. [00:49:42]Michael: Jessica does not get enough credit as an aside for her role. [00:49:46]Swyx: He tries. [00:49:47]Michael: He understands like the technical side and she understands people and together they're just like a phenomenal team. But he's like, yeah, I got to go see Jessica, but you guys are welcome to come with. Do you want to come with? And we're like, we have this friend who's like right now outside of like literally outside the door who like we also promised to get dinner with. It's like, we'd love to, but like, I don't know if we can. He's like, oh, he's welcome to come too. So all of us just like hop in his car and we go to his house and we just like have this like we have dinner and we have this just chat about the future of search. Like I remember him telling Jessica distinctly, like our kids as kids are not going to know what like a search result is. Like they're just going to like have answers. That was really like a mind blowing, like inflection point moment for sure. [00:50:34]Swyx: Wow, that email changed your life. [00:50:35]Michael: Absolutely. [00:50:36]Swyx: And you also just spoiled the booking system for PG because now everyone's just going to go after the last slot. Oh man. [00:50:42]Michael: Yeah. But like, I don't know if he even does that anymore. [00:50:46]Swyx: He does. He does. Yeah. I've met other founders that he did it this year. [00:50:49]Michael: This year. Gotcha. But when we told him about how we did it, he was like, I am like frankly shocked that YC just did like a random like scheduling system. [00:50:55]Alessio: They didn't like do anything else. But, um. [00:50:58]Swyx: Okay. And then he introduces Duron Conway. Yes. Who is one of the most legendary angels in Silicon Valley. [00:51:04]Michael: Yes.So after PG invested, the rest of our round came together pretty quickly. [00:51:10]Swyx: I'm, by the way, I'm surprised. Like it's, it might feel like playing favorites right within the current batch to be like, yo, PG invested in this one. Right. [00:51:17]Alessio: Too bad for the others. [00:51:18]Swyx: Too bad for the others, I guess. [00:51:19]Michael: I think this is a bigger point about YC and like these accelerators in general is like YC gets like a lot of criticism from founders who feel like they didn't get value out of it. But like, in my view, YC is what you make of it. And YC tells you this. They're like, you really got to grab this opportunity, like buy the balls and make the most of it. And if you do, then it could be the best thing in the world. And if you don't, and if you're just kind of like a passive, even like an average founder in YC, you're still going to fail. And they tell you that. They're like, if you're average in your batch, you're going to fail. Like you have to just be exceptional in every way. With that in mind, perhaps that's even part of the reason why we asked PG to invest. And so yeah, after PG invested, the rest of our round came together pretty quickly, which I'm very fortunate for. And yeah, he introduced us to Ron. And after he did, I get a call from Ron. And then Ron says like, hey, like PG tells me what you're working on. I'd love to come meet you guys. And I'm like, wait, no way. And then we're just holed up in this like little house in San Mateo, which is a little small, but you know, it had a nice patio. In fact, we had like a monitor set up outside on the deck out there. And so Ron Conway comes over, we go over to the patio where like our workstation is. And Ron Conway, he's known for having like this notebook that he goes around with where he like sits down with the notebook and like takes very, very detailed notes. So he never like forgets anything. So he sits down with his notebook and he asks us like, hey guys, like, what do you need? And we're like, oh, we need GPUs. Back then, the GPU shortage wasn't even nearly as bad as it is now. But like even then, it was still challenging to get like the quota that we needed. And he's like, okay, no problem. And then like he leaves a couple hours later, we get an email and we're CC'd on an email that Ron wrote to Jensen, the CEO of Nvidia, saying like, hey, these guys need GPUs. [00:53:02]Swyx: You didn't say how much? It was just like, just give them GPUs. [00:53:04]Alessio: Basically, yeah. [00:53:05]Michael: Ron is known for writing these like one-liner emails that are like very short, but very to the point. And I think that's why like everyone responds to Ron. Everyone loves Ron. And so Jensen responds. He responds quickly, like tagging this VP of AI at Nvidia. And we start working with Nvidia, which is great. And something that I love about Nvidia, by the way, is that after that intro, we got matched with like a dedicated team. And at Nvidia, they know that they're going to win regardless. So they don't care where you get the GPUs from. They're like, they're truly neutral, unlike various sales reps that you might encounter at various like clouds and, you know, hardware companies, et cetera. They actually just want to help you because they know they don't care. Like regardless, they know that if you're getting Nvidia GPUs, they're still winning. So I guess that's a tip is that like if you're looking for GPUs like Nvidia, they'll help you do it. [00:53:54]Swyx: So just to tie up this thing, because so first of all, that's a fantastic story. And I just wanted to let you tell that because it's special. That is a strategic shift, right? That you already decided to make by the time you met Ron, which is we are going to have our own hardware. We're going to rack him in a data center somewhere. [00:54:11]Michael: Well, not even that we need our own hardware because actually we don't. Right. But we just we just need GPUs, period. And like every cloud loves like they have their own sales tactics and like they want to make you commit to long terms and like very non-flexible terms. And like there's a web of different things that you kind of have to navigate. Nvidia will kind of be to the point like, OK, you can do this on this cloud, this on this cloud. Like this is your budget. Maybe you want to consider buying as well. Like they'll help you walk through what the options are. And the reason why they're helpful is because like they look at the full picture. So they'll help you with the hardware. And in terms of software, they actually implemented a custom feature for us in Faster Transformer, which is one of their libraries.Swyx: For you? [00:54:53]Michael: For us. Yeah. Which is wild. I don't think they would have done it otherwise. They implemented streaming generation for T5 based models, which we were running at the time up until we switched to GPT in February, March of this year. So they implemented that just for us, actually, in Faster Transformer. And so like they'll help you like look at the complete picture and then just help you get done what you need to get done. I know one of your interests is also local models, open source models and hardware kind of goes hand in hand.Alessio: Any fun projects, explorations in the space that you want to share with local llamas and stuff? [00:55:27]Michael: Yeah, it's something that we're very interested in because something that kind of we're hearing a lot about is like people want something like find, especially comp

covid-19 ceo new york ai google apple vision college super bowl building canadian san francisco tips iphone developing chatgpt silicon valley airbnb reddit transitioning wikipedia ios context releasing beating lightning plans cto deciding bloom nlp bart openai pg capitol hill fix sf residence nvidia result pivoting ux api dump open source gpt bing cc python ui aws rope github llama expanded apis wwdc llm da vinci javascript html techcrunch clarify gpu macs 3b 4g midjourney ghostwriters ides ide mimo cloudflare rag gpus gotcha ut austin san mateo v2 alessio dmca yc amman gcc gans stack overflow cursor mistral sota v1 jensen huang paul graham decibel vs code tensorflow god mode hn hacker news elasticsearch google lens lose yourself nvidia gpus sge replit jetbrains t5 500b be my eyes amjad andrej karpathy 70b imagenet eli5 rlhf neovim michael you google sge pycharm michael yeah code llama ron conway michael well michael so michael yes codeium michael thank michael for michael absolutely

Podcasts about imagenet

Best podcasts about imagenet

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

This Week in Machine Learning & Artificial Intelligence (AI) Podcast

The Nonlinear Library

Papers Read on AI

a16z

AI with AI

Machine Learning Street Talk

The Lunar Society

WIRED Business – Spoken Edition

Yannic Kilcher Videos (Audio Only)

Eye On A.I.

The Daktronics Experience

London Futurists

The Robot Brains Podcast

Stanford MLSys Seminar

Short And Sweet AI

PaperPlayer biorxiv neuroscience

The Nonlinear Library: LessWrong

Latest news about imagenet

Latest podcast episodes about imagenet

Building Human-Centered AI

Teknik - Et si l'IA n'avait plus besoin de vos données? - Parce que... c'est l'épisode 0x735!

Unified Latents (UL): How to train your latents (Teaser for Feb 28th Technical Update)

Episode #524: The 500-Year Prophecy: Why Buddhism and AI Are Colliding Right Now

Episode 153: 2025 Holiday Gift Guide

What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

The Godmother of AI on jobs, robots & why world models are next | Dr. Fei-Fei Li

The Frontier of Spatial Intelligence with Fei-Fei Li

What's Your Problem: "Teaching Computers to See"

How to Benchmark Your Pricing Like AI Models with Steven Forth

Teaching AI to Understand the Physical World, with Dr. Fei-Fei Li of World Labs

408. Synthetic Text Extruder Hype (ft. Emily Bender, Alex Hanna)

Understanding the Elegant Math Behind Modern Machine Learning

[Edito] Comparer les IA entre elles n'a pas de sens

Prof. Jakob Foerster - ImageNet Moment for Reinforcement Learning?

Fei-Fei Li: Staying curious at the forefront of AI

Why Machines Learn: The Elegant Math Behind AI with Anil Ananthaswamy | SparX by Mukesh Bansal

Fei-Fei Li on spatial intelligence and human-centered AI

NASA's Moon Micro-Mission, AgiBot's Humanoid Robot Training Dataset, and Microsoft's $100B AGI Definition

Classifying Images: Massive Parallelism And Surface Features

2024 in Vision [LS Live @ NeurIPS]

Dr. Olga Russakovsky: Shaping the Next Generation of AI Leaders

Chuyện đêm - GS Fei Fei Li của ĐH Stanford, Mỹ và hành trình nghiên cứu AI của GS và khuyến nghị cho Việt Nam để phát triển AI

#220 Terry Sejnowski: The Future of AI, ChatGPT & Deep Learning

Some Changes at The Gradient

Vol.048 AI 教父、诺奖得主辛顿：差点入职百度，坚信神经网络

The Frontier of Spatial Intelligence with Fei-Fei Li

Jim Fan on Nvidia's Embodied AI Lab and Jensen Huang's Prediction that All Robots will be Autonomous

Adversarial Examples and Data Modelling - Andrew Ilyas (MIT)

AF - The 'strong' feature hypothesis could be wrong by lewis smith

LW - The 'strong' feature hypothesis could be wrong by lsgos

History of AI - EP06 Part 1: The Effortless Podcast

Oral History: Human Centered AI

LW - Rational Animations' intro to mechanistic interpretability by Writer

ICLR 2024 — Best Papers & Talks (ImageGen, Vision, Transformers, State Space Models) ft. Christian Szegedy, Ilya Sutskever, Durk Kingma

205 – ImageNet and Flogistix Team Up for Immersive Experience with Ali Sylvester and Kyle Kempf

LW - Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers by hugofry

Fei-Fei Li: The “Godmother of AI”, Keeping Humanity at the Heart of the AI Revolution | E285

Teaching Computers to See

NeurIPS 2023 Recap — Best Papers

Fei-Fei Li: Exploring the AI Revolution

Dr. Fei-Fei Li, Artificial Intelligence Pioneer, on Creating Human-Centered AI

Babbage: Fei-Fei Li on how to really think about the future of AI

Babbage: Fei-Fei Li on how to really think about the future of AI

The Worlds She Sees with Godmother of AI, Fei-Fei Li

AI pioneer Fei-Fei Li on the future of humanity

The Worlds She Sees with Fei-Fei Li

Beating GPT-4 with Open Source LLMs — with Michael Royzen of Phind