Podcasts about huggingface

90PODCASTS
183EPISODES
59mAVG DURATION
1EPISODE EVERY OTHER WEEK
Nov 7, 2025LATEST

POPULARITY

20172018201920202021202220232024

Best podcasts about huggingface

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

14 episodes with huggingface

Thinking Elixir Podcast

6 episodes with huggingface

The Nonlinear Library

5 episodes with huggingface

The AI Breakdown: Daily Artificial Intelligence News and Discussions

3 episodes with huggingface

Bittensor Guru

3 episodes with huggingface

Papers Read on AI

5 episodes with huggingface

The Gradient Podcast

3 episodes with huggingface

The Machine Learning Podcast

3 episodes with huggingface

GPT Reviews

5 episodes with huggingface

The top AI news from the past week, every ThursdAI

27 episodes with huggingface

We Decentralize Tech

2 episodes with huggingface

programmier.bar – der Podcast für App- und Webentwicklung

8 episodes with huggingface

The Nonlinear Library: LessWrong

3 episodes with huggingface

Latest podcast episodes about huggingface

The top AI news from the past week, every ThursdAI

Play Episode Listen Later Nov 7, 2025 92:45

Hey, Alex here! Quick note, while preparing for this week, I posted on X that I don't remember such a quiet week in AI since I started doing ThursdAI regularly, but then 45 min before the show started, Kimi dropped a SOTA oss reasoning model, turning a quiet week into an absolute banger. Besides Kimi, we covered the updated MCP thinking from Anthropic, and had Kenton Varda from cloudflare as a guest to talk about Code Mode, chatted about Windsurf and Cursor latest updates and covered OpenAI's insane deals. Also, because it was a quiet week, I figured I'd use the opportunity to create an AI powered automation, and used N8N for that, and shared it on the stream, so if you're interested in automating with AI with relatively low code, this episode is for you. Let's dive inThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Kimi K2 Thinking is Here and It's a 1 Trillion Parameter Beast! (X, HF, Tech Blog)Let's start with the news that got everyone's energy levels skyrocketing right as we went live. Moonshot AI dropped Kimi K2 Thinking, an open-source, 1 trillion-parameter Mixture-of-Experts (MoE) model, and it's an absolute monster.This isn't just a numbers game; Kimi K2 Thinking is designed from the ground up to be a powerful agent. With just around 32 billion active parameters during inference, a massive 256,000 token context window, and an insane tool-calling capacity. They're claiming it can handle 200-300 sequential tool calls without any human intervention. The benchmarks are just as wild. On the Humanities Last Exam (HLE), they're reporting a score of 44.9%, beating out both GPT-5 and Claude 4.5 Thinking. While it doesn't quite top the charts on SWE-bench verified, it's holding its own against the biggest closed-source models out there. Seeing an open-source model compete at this level is incredibly exciting.During the show, we saw some truly mind-blowing demos, from a beautiful interactive visualization of gradient descent to a simulation of a virus attacking cells, all generated by the model. The model's reasoning traces, which are exposed through the API, also seem qualitatively different from other models, showing a deep and thoughtful process. My co-hosts and I were blown away. The weights and a very detailed technical report are available on Hugging Face, so you can dive in and see for yourself. Shout out to the entire Moonshot AI team for this incredible release!Other open source updates from this week* HuggingFace released an open source “Smol Training Playbook” on training LLMs, it's a 200+ interactive beast with visualizations, deep dives into pretraining, dataset, postraining and more! (HF)* Ai2 launches OlmoEarth — foundation models + open, end-to-end platform for fast, high-resolution Earth intelligence (X, Blog)* LongCat-Flash-Omni — open-source omni-modal system with millisecond E2E spoken interaction, 128K context and a 560B ScMoE backbone (X, HF, Announcement)Big Tech's Big Moves: Apple, Amazon, and OpenAIThe big companies were making waves this week, starting with a blockbuster deal that might finally make Siri smart. Apple is reportedly will be paying Google around $1 billion per year to license a custom 1.2 trillion-parameter version of Gemini to power a revamped Siri.This is a massive move. The Gemini model will run on Apple's Private Cloud Compute, keeping user data walled off from Google, and will handle Siri's complex summarizer and planner functions. After years of waiting for Apple to make a significant move in GenAI, it seems they're outsourcing the heavy lifting for now while they work to catch up with their own in-house models. As a user, I don't really care who builds the model, as long as Siri stops being dumb!In more dramatic news, Perplexity revealed that Amazon sent them a legal threat to block their Comet AI assistant from shopping on Amazon.com. This infuriated me. My browser is my browser, and I should be able to use whatever tools I want to interact with the web. Perplexity took a strong stand with their blog post, “Bullying is Not Innovation,” arguing that user agents are distinct from scrapers and act on behalf of the user with their own credentials. An AI assistant is just that—an assistant. It shouldn't matter if I ask my wife or my AI to buy something for me on Amazon. This feels like a move by Amazon to protect its ad revenue at the expense of user choice and innovation, and I have to give major props to Perplexity for being so transparent and fighting back.Finally, OpenAI continues its quest for infinite compute, announcing a multi-year strategic partnership with AWS. This comes on top of massive deals with NVIDIA, Microsoft, Oracle, and others, bringing their total commitment to compute into the trillions of dollars. It's getting to a point where OpenAI seems “too big to fail,” as any hiccup could have serious repercussions for the entire tech economy, which is now heavily propped up by AI investment. Sam has clarified that they don't think OpenAI wants to be too big to fail in a recent post on X, and that the recent miscommunications around the US government backstopping OpenAI's infrastructure bailouts were taken out of context.

LCC 330 - Nano banana l'AI de Julia

Les Cast Codeurs Podcast

Play Episode Listen Later Sep 15, 2025 108:38

Katia, Emmanuel et Guillaume discutent Java, Kotlin, Quarkus, Hibernate, Spring Boot 4, intelligence artificielle (modèles Nano Banana, VO3, frameworks agentiques, embedding). On discute les vulnerabilités OWASP pour les LLMs, les personalités de codage des différents modèles, Podman vs Docker, comment moderniser des projets legacy. Mais surtout on a passé du temps sur les présentations de Luc Julia et les différents contre points qui ont fait le buzz sur les réseaux. Enregistré le 12 septembre 2025 Téléchargement de l'épisode LesCastCodeurs-Episode-330.mp3 ou en vidéo sur YouTube. News Langages Dans cette vidéo, José détaille les nouveautés de Java entre Java 21 et 25 https://inside.java/2025/08/31/roadto25-java-language/ Aperçu des nouveautés du JDK 25 : Introduction des nouvelles fonctionnalités du langage Java et des changements à venir [00:02]. Programmation orientée données et Pattern Matching [00:43] : Évolution du “pattern matching” pour la déconstruction des “records” [01:22]. Utilisation des “sealed types” dans les expressions switch pour améliorer la lisibilité et la robustesse du code [01:47]. Introduction des “unnamed patterns” (_) pour indiquer qu'une variable n'est pas utilisée [04:47]. Support des types primitifs dans instanceof et switch (en preview) [14:02]. Conception d'applications Java [00:52] : Simplification de la méthode main [21:31]. Exécution directe des fichiers .java sans compilation explicite [22:46]. Amélioration des mécanismes d'importation [23:41]. Utilisation de la syntaxe Markdown dans la Javadoc [27:46]. Immuabilité et valeurs nulles [01:08] : Problème d'observation de champs final à null pendant la construction d'un objet [28:44]. JEP 513 pour contrôler l'appel à super() et restreindre l'usage de this dans les constructeurs [33:29]. JDK 25 sort le 16 septembre https://openjdk.org/projects/jdk/25/ Scoped Values (JEP 505) - alternative plus efficace aux ThreadLocal pour partager des données immutables entre threads Structured Concurrency (JEP 506) - traiter des groupes de tâches concurrentes comme une seule unité de travail, simplifiant la gestion des threads Compact Object Headers (JEP 519) - Fonctionnalité finale qui réduit de 50% la taille des en-têtes d'objets (de 128 à 64 bits), économisant jusqu'à 22% de mémoire heap Flexible Constructor Bodies (JEP 513) - Relaxation des restrictions sur les constructeurs, permettant du code avant l'appel super() ou this() Module Import Declarations (JEP 511) - Import simplifié permettant d'importer tous les éléments publics d'un module en une seule déclaration Compact Source Files (JEP 512) - Simplification des programmes Java basiques avec des méthodes main d'instance sans classe wrapper obligatoire Primitive Types in Patterns (JEP 455) - Troisième preview étendant le pattern matching et instanceof aux types primitifs dans switch et instanceof Generational Shenandoah (JEP 521) - Le garbage collector Shenandoah passe en mode générationnel pour de meilleures performances JFR Method Timing & Tracing (JEP 520) - Nouvel outillage de profilage pour mesurer le temps d'exécution et tracer les appels de méthodes Key Derivation API (JEP 510) - API finale pour les fonctions de dérivation de clés cryptographiques, remplaçant les implémentations tierces Améliorations du traitement des annotations dans Kotlin 2.2 https://blog.jetbrains.com/idea/2025/09/improved-annotation-handling-in-kotlin-2-2-less-boilerplate-fewer-surprises/ Avant Kotlin 2.2, les annotations sur les paramètres de constructeur n'étaient appliquées qu'au paramètre, pas à la propriété ou au champ Cela causait des bugs subtils avec Spring et JPA où la validation ne fonctionnait qu'à la création d'objet, pas lors des mises à jour La solution précédente nécessitait d'utiliser explicitement @field: pour chaque annotation, créant du code verbeux Kotlin 2.2 introduit un nouveau comportement par défaut qui applique les annotations aux paramètres ET aux propriétés/champs automatiquement Le code devient plus propre sans avoir besoin de syntaxe @field: répétitive Pour l'activer, ajouter -Xannotation-default-target=param-property dans les options du compilateur Gradle IntelliJ IDEA propose un quick-fix pour activer ce comportement à l'échelle du projet Cette amélioration rend l'intégration Kotlin plus fluide avec les frameworks majeurs comme Spring et JPA Le comportement peut être configuré pour garder l'ancien mode ou activer un mode transitoire avec avertissements Cette mise à jour fait partie d'une initiative plus large pour améliorer l'expérience Kotlin + Spring Librairies Sortie de Quarkus 3.26 avec mises à jour d'Hibernate et autres fonctionnalités - https://quarkus.io/blog/quarkus-3-26-released/ mettez à jour vers la 3.26.x car il y a eu une regression vert.x Jalon important vers la version LTS 3.27 prévue fin septembre, basée sur cette version Mise à jour vers Hibernate ORM 7.1, Hibernate Search 8.1 et Hibernate Reactive 3.1 Support des unités de persistance nommées et sources de données dans Hibernate Reactive Démarrage hors ligne et configuration de dialecte pour Hibernate ORM même si la base n'est pas accessible Refonte de la console HQL dans Dev UI avec fonctionnalité Hibernate Assistant intégrée Exposition des capacités Dev UI comme fonctions MCP pour pilotage via outils IA Rafraîchissement automatique des tokens OIDC en cas de réponse 401 des clients REST Extension JFR pour capturer les données runtime (nom app, version, extensions actives) Bump de Gradle vers la version 9.0 par défaut, suppression du support des classes config legacy Guide de démarrage avec Quarkus et A2A Java SDK 0.3.0 (pour faire discuter des agents IA avec la dernière version du protocole A2A) https://quarkus.io/blog/quarkus-a2a-java-0-3-0-alpha-release/ Sortie de l'A2A Java SDK 0.3.0.Alpha1, aligné avec la spécification A2A v0.3.0. Protocole A2A : standard ouvert (Linux Foundation), permet la communication inter-agents IA polyglottes. Version 0.3.0 plus stable, introduit le support gRPC. Mises à jour générales : changements significatifs, expérience utilisateur améliorée (côté client et serveur). Agents serveur A2A : Support gRPC ajouté (en plus de JSON-RPC). HTTP+JSON/REST à venir. Implémentations basées sur Quarkus (alternatives Jakarta existent). Dépendances spécifiques pour chaque transport (ex: a2a-java-sdk-reference-jsonrpc, a2a-java-sdk-reference-grpc). AgentCard : décrit les capacités de l'agent. Doit spécifier le point d'accès primaire et tous les transports supportés (additionalInterfaces). Clients A2A : Dépendance principale : a2a-java-sdk-client. Support gRPC ajouté (en plus de JSON-RPC). HTTP+JSON/REST à venir. Dépendance spécifique pour gRPC : a2a-java-sdk-client-transport-grpc. Création de client : via ClientBuilder. Sélectionne automatiquement le transport selon l'AgentCard et la configuration client. Permet de spécifier les transports supportés par le client (withTransport). Comment générer et éditer des images en Java avec Nano Banana, le “photoshop killer” de Google https://glaforge.dev/posts/2025/09/09/calling-nano-banana-from-java/ Objectif : Intégrer le modèle Nano Banana (Gemini 2.5 Flash Image preview) dans des applications Java. SDK utilisé : GenAI Java SDK de Google. Compatibilité : Supporté par ADK for Java ; pas encore par LangChain4j (limitation de multimodalité de sortie). Capacités de Nano Banana : Créer de nouvelles images. Modifier des images existantes. Assembler plusieurs images. Mise en œuvre Java : Quelle dépendance utiliser Comment s'authentifier Comment configurer le modèle Nature du modèle : Nano Banana est un modèle de chat qui peut retourner du texte et une image (pas simplement juste un modèle générateur d'image) Exemples d'utilisation : Création : Via un simple prompt textuel. Modification : En passant l'image existante (tableau de bytes) et les instructions de modification (prompt). Assemblage : En passant plusieurs images (en bytes) et les instructions d'intégration (prompt). Message clé : Toutes ces fonctionnalités sont accessibles en Java, sans nécessiter Python. Générer des vidéos IA avec le modèle Veo 3, mais en Java ! https://glaforge.dev/posts/2025/09/10/generating-videos-in-java-with-veo3/ Génération de vidéos en Java avec Veo 3 (via le GenAI Java SDK de Google). Veo 3: Annoncé comme GA, prix réduits, support du format 9:16, résolution jusqu'à 1080p. Création de vidéos : À partir d'une invite textuelle (prompt). À partir d'une image existante. Deux versions différentes du modèle : veo-3.0-generate-001 (qualité supérieure, plus coûteux, plus lent). veo-3.0-fast-generate-001 (qualité inférieure, moins coûteux, mais plus rapide). Rod Johnson sur ecrire des aplication agentic en Java plus facilement qu'en python avec Embabel https://medium.com/@springrod/you-can-build-better-ai-agents-in-java-than-python-868eaf008493 Rod the papa de Spring réécrit un exemple CrewAI (Python) qui génère un livre en utilisant Embabel (Java) pour démontrer la supériorité de Java L'application utilise plusieurs agents AI spécialisés : un chercheur, un planificateur de livre et des rédacteurs de chapitres Le processus suit trois étapes : recherche du sujet, création du plan, rédaction parallèle des chapitres puis assemblage CrewAI souffre de plusieurs problèmes : configuration lourde, manque de type safety, utilisation de clés magiques dans les prompts La version Embabel nécessite moins de code Java que l'original Python et moins de fichiers de configuration YAML Embabel apporte la type safety complète, éliminant les erreurs de frappe dans les prompts et améliorant l'outillage IDE La gestion de la concurrence est mieux contrôlée en Java pour éviter les limites de débit des APIs LLM L'intégration avec Spring permet une configuration externe simple des modèles LLM et hyperparamètres Le planificateur Embabel détermine automatiquement l'ordre d'exécution des actions basé sur leurs types requis L'argument principal : l'écosystème JVM offre un meilleur modèle de programmation et accès à la logique métier existante que Python Il y a pas mal de nouveaux framework agentic en Java, notamment le dernier LAngchain4j Agentic Spring lance un serie de blog posts sur les nouveautés de Spring Boot 4 https://spring.io/blog/2025/09/02/road_to_ga_introduction baseline JDK 17 mais rebase sur Jakarta 11 Kotlin 2, Jackson 3 et JUnit 6 Fonctionnalités de résilience principales de Spring : @ConcurrencyLimit, @Retryable, RetryTemplate Versioning d'API dans Spring Améliorations du client de service HTTP L'état des clients HTTP dans Spring Introduction du support Jackson 3 dans Spring Consommateur partagé - les queues Kafka dans Spring Kafka Modularisation de Spring Boot Autorisation progressive dans Spring Security Spring gRPC - un nouveau module Spring Boot Applications null-safe avec Spring Boot 4 OpenTelemetry avec Spring Boot Repos Ahead of Time (Partie 2) Web Faire de la recherche sémantique directement dans le navigateur en local, avec EmbeddingGemma et Transformers.js https://glaforge.dev/posts/2025/09/08/in-browser-semantic-search-with-embeddinggemma/ EmbeddingGemma: Nouveau modèle d'embedding (308M paramètres) de Google DeepMind. Objectif: Permettre la recherche sémantique directement dans le navigateur. Avantages clés de l'IA côté client: Confidentialité: Aucune donnée envoyée à un serveur. Coûts réduits: Pas besoin de serveurs coûteux (GPU), hébergement statique. Faible latence: Traitement instantané sans allers-retours réseau. Fonctionnement hors ligne: Possible après le chargement initial du modèle. Technologie principale: Modèle: EmbeddingGemma (petit, performant, multilingue, support MRL pour réduire la taille des vecteurs). Moteur d'inférence: Transformers.js de HuggingFace (exécute les modèles AI en JavaScript dans le navigateur). Déploiement: Site statique avec Vite/React/Tailwind CSS, déployé sur Firebase Hosting via GitHub Actions. Gestion du modèle: Fichiers du modèle trop lourds pour Git; téléchargés depuis HuggingFace Hub pendant le CI/CD. Fonctionnement de l'app: Charge le modèle, génère des embeddings pour requêtes/documents, calcule la similarité sémantique. Conclusion: Démonstration d'une recherche sémantique privée, économique et sans serveur, soulignant le potentiel de l'IA embarquée dans le navigateur. Data et Intelligence Artificielle Docker lance Cagent, une sorte de framework multi-agent IA utilisant des LLMs externes, des modèles de Docker Model Runner, avec le Docker MCP Tookit. Il propose un format YAML pour décrire les agents d'un système multi-agents. https://github.com/docker/cagent des agents “prompt driven” (pas de code) et une structure pour decrire comment ils sont deployés pas clair comment ils sont appelés a part dans la ligne de commande de cagent fait par david gageot L'owasp décrit l'independance excessive des LLM comme une vulnerabilité https://genai.owasp.org/llmrisk2023-24/llm08-excessive-agency/ L'agence excessive désigne la vulnérabilité qui permet aux systèmes LLM d'effectuer des actions dommageables via des sorties inattendues ou ambiguës. Elle résulte de trois causes principales : fonctionnalités excessives, permissions excessives ou autonomie excessive des agents LLM. Les fonctionnalités excessives incluent l'accès à des plugins qui offrent plus de capacités que nécessaire, comme un plugin de lecture qui peut aussi modifier ou supprimer. Les permissions excessives se manifestent quand un plugin accède aux systèmes avec des droits trop élevés, par exemple un accès en lecture qui inclut aussi l'écriture. L'autonomie excessive survient quand le système effectue des actions critiques sans validation humaine préalable. Un scénario d'attaque typique : un assistant personnel avec accès email peut être manipulé par injection de prompt pour envoyer du spam via la boîte de l'utilisateur. La prévention implique de limiter strictement les plugins aux fonctions minimales nécessaires pour l'opération prévue. Il faut éviter les fonctions ouvertes comme “exécuter une commande shell” au profit d'outils plus granulaires et spécifiques. L'application du principe de moindre privilège est cruciale : chaque plugin doit avoir uniquement les permissions minimales requises. Le contrôle humain dans la boucle reste essentiel pour valider les actions à fort impact avant leur exécution. Lancement du MCP registry, une sorte de méta-annuaire officiel pour référencer les serveurs MCP https://www.marktechpost.com/2025/09/09/mcp-team-launches-the-preview-version-of-the-mcp-registry-a-federated-discovery-layer-for-enterprise-ai/ MCP Registry : Couche de découverte fédérée pour l'IA d'entreprise. Fonctionne comme le DNS pour le contexte de l'IA, permettant la découverte de serveurs MCP publics ou privés. Modèle fédéré : Évite les risques de sécurité et de conformité d'un registre monolithique. Permet des sous-registres privés tout en conservant une source de vérité “upstream”. Avantages entreprises : Découverte interne sécurisée. Gouvernance centralisée des serveurs externes. Réduction de la prolifération des contextes. Support pour les agents IA hybrides (données privées/publiques). Projet open source, actuellement en version preview. Blog post officiel : https://blog.modelcontextprotocol.io/posts/2025-09-08-mcp-registry-preview/ Exploration des internals du transaction log SQL Server https://debezium.io/blog/2025/09/08/sqlserver-tx-log/ C'est un article pour les rugeux qui veulent savoir comment SQLServer marche à l'interieur Debezium utilise actuellement les change tables de SQL Server CDC en polling périodique L'article explore la possibilité de parser directement le transaction log pour améliorer les performances Le transaction log est divisé en Virtual Log Files (VLFs) utilisés de manière circulaire Chaque VLF contient des blocs (512B à 60KB) qui contiennent les records de transactions Chaque record a un Log Sequence Number (LSN) unique pour l'identifier précisément Les données sont stockées dans des pages de 8KB avec header de 96 bytes et offset array Les tables sont organisées en partitions et allocation units pour gérer l'espace disque L'utilitaire DBCC permet d'explorer la structure interne des pages et leur contenu Cette compréhension pose les bases pour parser programmatiquement le transaction log dans un prochain article Outillage Les personalités des codeurs des différents LLMs https://www.sonarsource.com/blog/the-coding-personalities-of-leading-llms-gpt-5-update/ GPT-5 minimal ne détrône pas Claude Sonnet 4 comme leader en performance fonctionnelle malgré ses 75% de réussite GPT-5 génère un code extrêmement verbeux avec 490 000 lignes contre 370 000 pour Claude Sonnet 4 sur les mêmes tâches La complexité cyclomatique et cognitive du code GPT-5 est dramatiquement plus élevée que tous les autres modèles GPT-5 introduit 3,90 problèmes par tâche réussie contre seulement 2,11 pour Claude Sonnet 4 Point fort de GPT-5 : sécurité exceptionnelle avec seulement 0,12 vulnérabilité par 1000 lignes de code Faiblesse majeure : densité très élevée de “code smells” (25,28 par 1000 lignes) nuisant à la maintenabilité GPT-5 produit 12% de problèmes liés à la complexité cognitive, le taux le plus élevé de tous les modèles Tendance aux erreurs logiques fondamentales avec 24% de bugs de type “Control-flow mistake” Réapparition de vulnérabilités classiques comme les failles d'injection et de traversée de chemin Nécessité d'une gouvernance renforcée avec analyse statique obligatoire pour gérer la complexité du code généré Pourquoi j'ai abandonné Docker pour Podman https://codesmash.dev/why-i-ditched-docker-for-podman-and-you-should-too Problème Docker : Le daemon dockerd persistant s'exécute avec des privilèges root, posant des risques de sécurité (nombreuses CVEs citées) et consommant des ressources inutilement. Solution Podman : Sans Daemon : Pas de processus d'arrière-plan persistant. Les conteneurs s'exécutent comme des processus enfants de la commande Podman, sous les privilèges de l'utilisateur. Sécurité Renforcée : Réduction de la surface d'attaque. Une évasion de conteneur compromet un utilisateur non privilégié sur l'hôte, pas le système entier. Mode rootless. Fiabilité Accrue : Pas de point de défaillance unique ; le crash d'un conteneur n'affecte pas les autres. Moins de Ressources : Pas de daemon constamment actif, donc moins de mémoire et de CPU. Fonctionnalités Clés de Podman : Intégration Systemd : Génération automatique de fichiers d'unité systemd pour gérer les conteneurs comme des services Linux standards. Alignement Kubernetes : Support natif des pods et capacité à générer des fichiers Kubernetes YAML directement (podman generate kube), facilitant le développement local pour K8s. Philosophie Unix : Se concentre sur l'exécution des conteneurs, délègue les tâches spécialisées à des outils dédiés (ex: Buildah pour la construction d'images, Skopeo pour leur gestion). Migration Facile : CLI compatible Docker : podman utilise les mêmes commandes que docker (alias docker=podman fonctionne). Les Dockerfiles existants sont directement utilisables. Améliorations incluses : Sécurité par défaut (ports privilégiés en mode rootless), meilleure gestion des permissions de volume, API Docker compatible optionnelle. Option de convertir Docker Compose en Kubernetes YAML. Bénéfices en Production : Sécurité améliorée, utilisation plus propre des ressources. Podman représente une évolution plus sécurisée et mieux alignée avec les pratiques modernes de gestion Linux et de déploiement de conteneurs. Guide Pratique (Exemple FastAPI) : Le Dockerfile ne change pas. podman build et podman run remplacent directement les commandes Docker. Déploiement en production via Systemd. Gestion d'applications multi-services avec les “pods” Podman. Compatibilité Docker Compose via podman-compose ou kompose. Détection améliorée des APIs vulnérables dans les IDEs JetBrains et Qodana - https://blog.jetbrains.com/idea/2025/09/enhanced-vulnerable-api-detection-in-jetbrains-ides-and-qodana/ JetBrains s'associe avec Mend.io pour renforcer la sécurité du code dans leurs outils Le plugin Package Checker bénéficie de nouvelles données enrichies sur les APIs vulnérables Analyse des graphes d'appels pour couvrir plus de méthodes publiques des bibliothèques open-source Support de Java, Kotlin, C#, JavaScript, TypeScript et Python pour la détection de vulnérabilités Activation des inspections via Paramètres > Editor > Inspections en recherchant “Vulnerable API” Surlignage automatique des méthodes vulnérables avec détails des failles au survol Action contextuelle pour naviguer directement vers la déclaration de dépendance problématique Mise à jour automatique vers une version non affectée via Alt+Enter sur la dépendance Fenêtre dédiée “Vulnerable Dependencies” pour voir l'état global des vulnérabilités du projet Méthodologies Le retour de du sondage de Stack Overflow sur l'usage de l'IA dans le code https://medium.com/@amareshadak/stack-overflow-just-exposed-the-ugly-truth-about-ai-coding-tools-b4f7b5992191 84% des développeurs utilisent l'IA quotidiennement, mais 46% ne font pas confiance aux résultats. Seulement 3,1% font “hautement confiance” au code généré. 66% sont frustrés par les solutions IA “presque correctes”. 45% disent que déboguer le code IA prend plus de temps que l'écrire soi-même. Les développeurs seniors (10+ ans) font moins confiance à l'IA (2,6%) que les débutants (6,1%), créant un écart de connaissances dangereux. Les pays occidentaux montrent moins de confiance - Allemagne (22%), UK (23%), USA (28%) - que l'Inde (56%). Les créateurs d'outils IA leur font moins confiance. 77% des développeurs professionnels rejettent la programmation en langage naturel, seuls 12% l'utilisent réellement. Quand l'IA échoue, 75% se tournent vers les humains. 35% des visites Stack Overflow concernent maintenant des problèmes liés à l'IA. 69% rapportent des gains de productivité personnels, mais seulement 17% voient une amélioration de la collaboration d'équipe. Coûts cachés : temps de vérification, explication du code IA aux équipes, refactorisation et charge cognitive constante. Les plateformes humaines dominent encore : Stack Overflow (84%), GitHub (67%), YouTube (61%) pour résoudre les problèmes IA. L'avenir suggère un “développement augmenté” où l'IA devient un outil parmi d'autres, nécessitant transparence et gestion de l'incertitude. Mentorat open source et défis communautaires par les gens de Microcks https://microcks.io/blog/beyond-code-open-source-mentorship/ Microcks souffre du syndrome des “utilisateurs silencieux” qui bénéficient du projet sans contribuer Malgré des milliers de téléchargements et une adoption croissante, l'engagement communautaire reste faible Ce manque d'interaction crée des défis de durabilité et limite l'innovation du projet Les mainteneurs développent dans le vide sans feedback des vrais utilisateurs Contribuer ne nécessite pas de coder : documentation, partage d'expérience, signalement de bugs suffisent Parler du project qu'on aime autour de soi est aussi super utile Microcks a aussi des questions specifiques qu'ils ont posé dans le blog, donc si vous l'utilisez, aller voir Le succès de l'open source dépend de la transformation des utilisateurs en véritables partenaires communautaires c'est un point assez commun je trouve, le ratio parlant / silencieux est tres petit et cela encourage les quelques grandes gueules La modernisation du systemes legacy, c'est pas que de la tech https://blog.scottlogic.com/2025/08/27/holistic-approach-successful-legacy-modernisation.html Un artcile qui prend du recul sur la modernisation de systemes legacy Les projets de modernisation legacy nécessitent une vision holistique au-delà du simple focus technologique Les drivers business diffèrent des projets greenfield : réduction des coûts et mitigation des risques plutôt que génération de revenus L'état actuel est plus complexe à cartographier avec de nombreuses dépendances et risques de rupture Collaboration essentielle entre Architectes, Analystes Business et Designers UX dès la phase de découverte Approche tridimensionnelle obligatoire : Personnes, Processus et Technologie (comme un jeu d'échecs 3D) Le leadership doit créer l'espace nécessaire pour la découverte et la planification plutôt que presser l'équipe Communication en termes business plutôt que techniques vers tous les niveaux de l'organisation Planification préalable essentielle contrairement aux idées reçues sur l'agilité Séquencement optimal souvent non-évident et nécessitant une analyse approfondie des interdépendances Phases projet alignées sur les résultats business permettent l'agilité au sein de chaque phase Sécurité Cyber Attaque su Musée Histoire Naturelle https://www.franceinfo.fr/internet/securite-sur-internet/cyberattaques/le-museum-nati[…]e-d-une-cyberattaque-severe-une-plainte-deposee_7430356.html Compromission massive de packages npm populaires par un malware crypto https://www.aikido.dev/blog/npm-debug-and-chalk-packages-compromised 18 packages npm très populaires compromis le 8 septembre 2025, incluant chalk, debug, ansi-styles avec plus de 2 milliards de téléchargements hebdomadaires combinés duckdb s'est rajouté à la liste Code malveillant injecté qui intercepte silencieusement l'activité crypto et web3 dans les navigateurs des utilisateurs Le malware manipule les interactions de wallet et redirige les paiements vers des comptes contrôlés par l'attaquant sans signes évidents Injection dans les fonctions critiques comme fetch, XMLHttpRequest et APIs de wallets (window.ethereum, Solana) pour intercepter le trafic Détection et remplacement automatique des adresses crypto sur multiple blockchains (Ethereum, Bitcoin, Solana, Tron, Litecoin, Bitcoin Cash) Les transactions sont modifiées en arrière-plan même si l'interface utilisateur semble correcte et légitime Utilise des adresses “sosies” via correspondance de chaînes pour rendre les échanges moins évidents à détecter Le mainteneur compromis par email de phishing provenant du faux domaine “mailto:support@npmjs.help|support@npmjs.help” enregistré 3 jours avant l'attaque sur une demande de mise a jour de son autheotnfication a deux facteurs après un an Aikido a alerté le mainteneur via Bluesky qui a confirmé la compromission et commencé le nettoyage des packages Attaque sophistiquée opérant à plusieurs niveaux: contenu web, appels API et manipulation des signatures de transactions Les anti-cheats de jeux vidéo : une faille de sécurité majeure ? - https://tferdinand.net/jeux-video-et-si-votre-anti-cheat-etait-la-plus-grosse-faille/ Les anti-cheats modernes s'installent au Ring 0 (noyau système) avec privilèges maximaux Ils obtiennent le même niveau d'accès que les antivirus professionnels mais sans audit ni certification Certains exploitent Secure Boot pour se charger avant le système d'exploitation Risque de supply chain : le groupe APT41 a déjà compromis des jeux comme League of Legends Un attaquant infiltré pourrait désactiver les solutions de sécurité et rester invisible Menace de stabilité : une erreur peut empêcher le démarrage du système (référence CrowdStrike) Conflits possibles entre différents anti-cheats qui se bloquent mutuellement Surveillance en temps réel des données d'utilisation sous prétexte anti-triche Dérive dangereuse selon l'auteur : des entreprises de jeux accèdent au niveau EDR Alternatives limitées : cloud gaming ou sandboxing avec impact sur performances donc faites gaffe aux jeux que vos gamins installent ! Loi, société et organisation Luc Julia au Sénat - Monsieur Phi réagi et publie la vidéo Luc Julia au Sénat : autopsie d'un grand N'IMPORTE QUOI https://www.youtube.com/watch?v=e5kDHL-nnh4 En format podcast de 20 minutes, sorti au même moment et à propos de sa conf à Devoxx https://www.youtube.com/watch?v=Q0gvaIZz1dM Le lab IA - Jérôme Fortias - Et si Luc Julia avait raison https://www.youtube.com/watch?v=KScI5PkCIaE Luc Julia au Senat https://www.youtube.com/watch?v=UjBZaKcTeIY Luc Julia se défend https://www.youtube.com/watch?v=DZmxa7jJ8sI Intelligence artificielle : catastrophe imminente ? - Luc Julia vs Maxime Fournes https://www.youtube.com/watch?v=sCNqGt7yIjo Tech and Co Monsieur Phi vs Luc Julia (put a click) https://www.youtube.com/watch?v=xKeFsOceT44 La tronche en biais https://www.youtube.com/live/zFwLAOgY0Wc Conférences La liste des conférences provenant de Developers Conferences Agenda/List par Aurélie Vache et contributeurs : 12 septembre 2025 : Agile Pays Basque 2025 - Bidart (France) 15 septembre 2025 : Agile Tour Montpellier - Montpellier (France) 18-19 septembre 2025 : API Platform Conference - Lille (France) & Online 22-24 septembre 2025 : Kernel Recipes - Paris (France) 22-27 septembre 2025 : La Mélée Numérique - Toulouse (France) 23 septembre 2025 : OWASP AppSec France 2025 - Paris (France) 23-24 septembre 2025 : AI Engineer Paris - Paris (France) 25 septembre 2025 : Agile Game Toulouse - Toulouse (France) 25-26 septembre 2025 : Paris Web 2025 - Paris (France) 30 septembre 2025-1 octobre 2025 : PyData Paris 2025 - Paris (France) 2 octobre 2025 : Nantes Craft - Nantes (France) 2-3 octobre 2025 : Volcamp - Clermont-Ferrand (France) 3 octobre 2025 : DevFest Perros-Guirec 2025 - Perros-Guirec (France) 6-7 octobre 2025 : Swift Connection 2025 - Paris (France) 6-10 octobre 2025 : Devoxx Belgium - Antwerp (Belgium) 7 octobre 2025 : BSides Mulhouse - Mulhouse (France) 7-8 octobre 2025 : Agile en Seine - Issy-les-Moulineaux (France) 8-10 octobre 2025 : SIG 2025 - Paris (France) & Online 9 octobre 2025 : DevCon #25 : informatique quantique - Paris (France) 9-10 octobre 2025 : Forum PHP 2025 - Marne-la-Vallée (France) 9-10 octobre 2025 : EuroRust 2025 - Paris (France) 16 octobre 2025 : PlatformCon25 Live Day Paris - Paris (France) 16 octobre 2025 : Power 365 - 2025 - Lille (France) 16-17 octobre 2025 : DevFest Nantes - Nantes (France) 17 octobre 2025 : Sylius Con 2025 - Lyon (France) 17 octobre 2025 : ScalaIO 2025 - Paris (France) 17-19 octobre 2025 : OpenInfra Summit Europe - Paris (France) 20 octobre 2025 : Codeurs en Seine - Rouen (France) 23 octobre 2025 : Cloud Nord - Lille (France) 30-31 octobre 2025 : Agile Tour Bordeaux 2025 - Bordeaux (France) 30-31 octobre 2025 : Agile Tour Nantais 2025 - Nantes (France) 30 octobre 2025-2 novembre 2025 : PyConFR 2025 - Lyon (France) 4-7 novembre 2025 : NewCrafts 2025 - Paris (France) 5-6 novembre 2025 : Tech Show Paris - Paris (France) 5-6 novembre 2025 : Red Hat Summit: Connect Paris 2025 - Paris (France) 6 novembre 2025 : dotAI 2025 - Paris (France) 6 novembre 2025 : Agile Tour Aix-Marseille 2025 - Gardanne (France) 7 novembre 2025 : BDX I/O - Bordeaux (France) 12-14 novembre 2025 : Devoxx Morocco - Marrakech (Morocco) 13 novembre 2025 : DevFest Toulouse - Toulouse (France) 15-16 novembre 2025 : Capitole du Libre - Toulouse (France) 19 novembre 2025 : SREday Paris 2025 Q4 - Paris (France) 19-21 novembre 2025 : Agile Grenoble - Grenoble (France) 20 novembre 2025 : OVHcloud Summit - Paris (France) 21 novembre 2025 : DevFest Paris 2025 - Paris (France) 27 novembre 2025 : DevFest Strasbourg 2025 - Strasbourg (France) 28 novembre 2025 : DevFest Lyon - Lyon (France) 1-2 décembre 2025 : Tech Rocks Summit 2025 - Paris (France) 4-5 décembre 2025 : Agile Tour Rennes - Rennes (France) 5 décembre 2025 : DevFest Dijon 2025 - Dijon (France) 9-11 décembre 2025 : APIdays Paris - Paris (France) 9-11 décembre 2025 : Green IO Paris - Paris (France) 10-11 décembre 2025 : Devops REX - Paris (France) 10-11 décembre 2025 : Open Source Experience - Paris (France) 11 décembre 2025 : Normandie.ai 2025 - Rouen (France) 14-17 janvier 2026 : SnowCamp 2026 - Grenoble (France) 2-6 février 2026 : Web Days Convention - Aix-en-Provence (France) 3 février 2026 : Cloud Native Days France 2026 - Paris (France) 12-13 février 2026 : Touraine Tech #26 - Tours (France) 22-24 avril 2026 : Devoxx France 2026 - Paris (France) 23-25 avril 2026 : Devoxx Greece - Athens (Greece) 17 juin 2026 : Devoxx Poland - Krakow (Poland) 4 septembre 2026 : JUG SUmmer Camp 2026 - La Rochelle (France) Nous contacter Pour réagir à cet épisode, venez discuter sur le groupe Google https://groups.google.com/group/lescastcodeurs Contactez-nous via X/twitter https://twitter.com/lescastcodeurs ou Bluesky https://bsky.app/profile/lescastcodeurs.com Faire un crowdcast ou une crowdquestion Soutenez Les Cast Codeurs sur Patreon https://www.patreon.com/LesCastCodeurs Tous les épisodes et toutes les infos sur https://lescastcodeurs.com/

united states ai power google uk guide france action nature spring data blog ring code bitcoin collaboration ga charge option bananas jos ia exploration pas quand transformers faire rod analyse ils import blue sky api relaxation ethereum surveillance agile technologie phases parler gpt python cela tron toutes bump menace linux java github guillaume activation apis malgr aur jakarta probl num lam mus certains conception javascript mise moins exposition projet nano nouvel kafka llm mod gestion cpu gpu param troisi allemagne dns injection normandie mend solana vall katia personnes risque docker git loi seulement sortie sig lancement fen sdks aikido faible mises attaque simplification enregistr stack overflow approche veo litecoin ci cd utilise processus shenandoah capacit tendance paris france avantages senat permet mcp moteur typescript traitement annonc fonctionne google deepmind utilisation programmation markdown capitole exemples gouvernance linux foundation planification kotlin sql server owasp fonctionnement aper modifier jep jvm lts vache jetbrains hibernate github actions yaml faiblesse grpc a2a k8s contribuer cves mentorat impl devcon architectes spring boot secure boot assembler pattern matching gradle jdk lyon france huggingface podman systemd adk bordeaux france luc julia mrl histoire naturelle jpa rod johnson junit provence france devoxx toulouse france strasbourg france alpha1 oidc lille france codeurs dbcc dijon france devoxx france javadoc

Trust at Scale: Security and Governance for Open Source Models // Hudson Buzby // #338

MLOps.community

Play Episode Listen Later Sep 9, 2025 59:22

Trust at Scale: Security and Governance for Open Source Models // MLOps Podcast #338 with Hudson Buzby, Solutions Architect at JFrog.Appreciate JFrog for their support in bringing this blog to life.Join the Community: https://go.mlops.community/YTJoinInGet the newsletter: https://go.mlops.community/YTNewsletter// AbstractFor better or for worse, machine learning has traditionally escaped the gaze of security and infrastructure teams, operating outside traditional DevOps practices and not always adhering to organizations development or security standards. With the introduction of open source catalogs like HuggingFace and Ollama, a new standard has been established for locating, identifying, and deploying machine learning and AI models. But with this new standard comes a plethora of security, governance, and legal challenges that organizations need to address before they can comfortably allow developers to freely build and deploy ML/AI applications. In this conversation will discuss ways that enterprise scale organizations are addressing these challenges to safely and securely build these development environments. // BioHudson Buzby is a solution engineer with an emphasis on MLOps, LLMOps, Big Data, and Distributed Systems, leveraging his expertise to help organizations optimize their machine learning operations and large language model deployments. His role involves providing technical solutions and guidance to enhance the efficiency and effectiveness of AI-driven projects.// Related Linkshttps://www.youtube.com/channel/UCh2hNg76zo3d1qQqTWIQxDg~~~~~~~~ ✌️Connect With Us ✌️ ~~~~~~~Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExploreJoin our Slack community [https://go.mlops.community/slack]Follow us on X/Twitter [@mlopscommunity](https://x.com/mlopscommunity) or [LinkedIn](https://go.mlops.community/linkedin)] Sign up for the next meetup: [https://go.mlops.community/register]MLOps Swag/Merch: [https://shop.mlops.community/]Connect with Demetrios on LinkedIn: /dpbrinkmConnect with Hudson on LinkedIn: /hudson-buzby/

community trust ai security scale models slack governance big data open source devops solutions architect ml ai distributed systems demetrios huggingface buzby

腾讯混元翻译模型登顶HuggingFace全球热榜

网事头条｜听见新鲜事

Play Episode Listen Later Sep 7, 2025 0:23

huggingface

SANS Stormcast Friday, September 5th, 2025: Cloudflare Response to 1.1.1.1 Certificate; AI Modem Namespace Reuse; macOS Vulnerability Allowed Keychain Decryption

SANS Internet Stormcenter Daily Network/Cyber Security and Information Security Stormcast

Play Episode Listen Later Sep 5, 2025 8:18

Unauthorized Issuance of Certificate for 1.1.1.1 Cloudflare published a blog post with more details regarding the bad 1.1.1.1 certificate that was issued by Fina. https://blog.cloudflare.com/unauthorized-issuance-of-certificates-for-1-1-1-1/ AI Model Namespace Reuse Deleted accounts on Huggingface can be taken over by other entities unrelated to the original owner. https://unit42.paloaltonetworks.com/model-namespace-reuse/ macOS vulnerability allowed Keychain and iOS app decryption without a password Excessive entitlements for the gcore binary facilitated access to key material that was sufficient to access secrets stored in Apple s keychain. https://www.helpnetsecurity.com/2025/09/04/macos-gcore-vulnerability-cve-2025-24204/

Episode 56: DeepMind Just Dropped Gemma 270M... And Here's Why It Matters

Vanishing Gradients

Play Episode Listen Later Aug 14, 2025 45:40

While much of the AI world chases ever-larger models, Ravin Kumar (Google DeepMind) and his team build across the size spectrum, from billions of parameters down to this week's release: Gemma 270M, the smallest member yet of the Gemma 3 open-weight family. At just 270 million parameters, a quarter the size of Gemma 1B, it's designed for speed, efficiency, and fine-tuning. We explore what makes 270M special, where it fits alongside its billion-parameter siblings, and why you might reach for it in production even if you think “small” means “just for experiments.” We talk through: - Where 270M fits into the Gemma 3 lineup — and why it exists - On-device use cases where latency, privacy, and efficiency matter - How smaller models open up rapid, targeted fine-tuning - Running multiple models in parallel without heavyweight hardware - Why “small” models might drive the next big wave of AI adoption If you've ever wondered what you'd do with a model this size (or how to squeeze the most out of it) this episode will show you how small can punch far above its weight. LINKS Introducing Gemma 3 270M: The compact model for hyper-efficient AI (Google Developer Blog) (https://developers.googleblog.com/en/introducing-gemma-3-270m/) Full Model Fine-Tune Guide using Hugging Face Transformers (https://ai.google.dev/gemma/docs/core/huggingface_text_full_finetune) The Gemma 270M model on HuggingFace (https://huggingface.co/google/gemma-3-270m) The Gemma 270M model on Ollama (https://ollama.com/library/gemma3:270m) Building AI Agents with Gemma 3, a workshop with Ravin and Hugo (https://www.youtube.com/live/-IWstEStqok) (Code here (https://github.com/canyon289/ai_agent_basics)) From Images to Agents: Building and Evaluating Multimodal AI Workflows, a workshop with Ravin and Hugo (https://www.youtube.com/live/FNlM7lSt8Uk)(Code here (https://github.com/canyon289/ai_image_agent)) Evaluating AI Agents: From Demos to Dependability, an upcoming workshop with Ravin and Hugo (https://lu.ma/ezgny3dl) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Watch the podcast video on YouTube (https://youtu.be/VZDw6C2A_8E)

ai running machine learning dropped upcoming events software engineers genai data scientists deepmind luma ravin dependability 270m huggingface

921: AI Coding Roadmap for Newbies (And Skeptics)

Syntax - Tasty Web Development Treats

Play Episode Listen Later Jul 21, 2025 48:58

Scott and Wes break down how to code with and for AI; perfect for skeptics, beginners, and curious devs. They cover everything from Ghost Text and CLI agents to building your own AI-powered apps with embeddings, function calling, and multi-model workflows. Show Notes 00:00 Welcome to Syntax! 03:56 How to interface with AI. 04:07 IDE Ghost Text. 05:45 IDE Chat, Agents. 08:00 CLI Agents. Claude Code. Open Code. Gemini. 11:13 MCP Servers. Context7 14:47 GUI apps. v0. Bolt.new. Lovable. Windsurf. 19:07 Existing Chat app like ChatGPT. 22:37 Building things WITH AI. 23:32 Prompting. 26:53 Streaming VS not streaming. 28:14 Embeddings and Rag. 31:09 MCP Server. CJ's MCP Deep Dive. 32:36 Brought to you by Sentry.io. 33:25 Multi-model, multi-provider. 36:27 npm libs to use to code with AI. OpenAI SDK. AI SDK. Cloudflare Agents. Langchain. Local AI Tensorflow. Transformers.js. Huggingface. 44:12 Processes and exploring. Hit us up on Socials! Syntax: X Instagram Tiktok LinkedIn Threads Wes: X Instagram Tiktok LinkedIn Threads Scott: X Instagram Tiktok LinkedIn Threads Randy: X Instagram YouTube Threads

ИИ-МехаГитлер, Вайфу, Тян и Задача Двух Бабушек / Китайский опенсорс снова на коне /AIA Podcast #114

AIA Podcast

Play Episode Listen Later Jul 19, 2025 149:19

The PHP Podcast: 2025.07.17

php[podcast] episodes from php[architect]

Play Episode Listen Later Jul 18, 2025 61:04

This week on the PHP Podcast, Eric and John discuss Spec-driven Development with Kiro, JetBrains on Huggingface, Event Sourcing with Laravel Verbs, Automating your life with n8n, PHP Tek 2026 Website development using vibe coding, and more. Links from the show: Introducing Kiro – Kiro JetBrains (JetBrains) Verbs About Grokability – Snipe-IT Free open […] The post The PHP Podcast: 2025.07.17 appeared first on PHP Architect.

development automating spec kiro jetbrains event sourcing huggingface

Content Consolidation, AI Browsers, and Mixed Economic Signals

This Week Next Week

Play Episode Listen Later Jul 11, 2025 29:08

Hosts Kate and Jeff dive into everything from the U.S. economy to big changes in media and entertainment. They chat about the future of search, how AI is shaking up copyright and content creation, and even affordable robots for learning. It's all about how tech is reshaping industries and what it means for the future.00:00 - Introduction01:34 - US Economic Data - Unemployment trends, AI's role in jobs, and economic data insights.05:38 - Consumer Spending - Credit card debt, tariffs, and shifting consumer habits.07:50 - New Business Trends - Growth in business applications and manufacturing orders.10:14 - A&E Sale - What A&E's potential sale means for the cable industry.12:23 - Disney-ITV Partnership - A unique content-sharing deal between Disney and ITV.16:16 - F1 Media Rights - Apple's bid for F1 rights and changes in sports media.18:28 - AI Browsers and Cloudflare - AI-powered browsers and Cloudflare's move to block crawlers.21:35 - AI Copyright Cases - Court rulings on AI copyright and their impact on creators.24:47 - Robots in Education - Hugging Face's Ricci Mini robot and its potential in coding education.27:58 - What's Next - Upcoming earnings reports, CPI data, and next week's highlights.LinksCloudflare blog: https://www.cloudflare.com/press-releases/2025/cloudflare-just-changed-how-ai-crawlers-scrape-the-internet-at-large/ Huggingface robot: https://huggingface.co/blog/reachy-mini

ai disney robots economic mixed f1 signals itv cpi browsers consolidation cloudflare huggingface

The top AI news from the past week, every ThursdAI

Play Episode Listen Later Jul 11, 2025 109:46

Hey everyone, Alex hereDon't you just love "new top LLM" drop weeks? I sure do! This week, we had a watch party for Grok-4, with over 20K tuning in to watch together, as the folks at XAI unveiled their newest and best model around. Two models in fact, Grok-4 and Grok-4 Heavy. We also had a very big open source week, we had the pleasure to chat with the creators of 3 open source models on the show, first with Elie from HuggingFace who just released SmoLM3, then with our friend Maxime Labonne who together with Liquid released a beautiful series of tiny on device models. Finally we had a chat with folks from Reka AI, and as they were on stage, someone in their org published a new open source Reka Flash model

En fait, les LLMs ne stagnent pas du tout — Grégoire Mialon (Meta) & Clémentine Fourrier (HuggingFace)

Underscore_

Play Episode Listen Later Jul 3, 2025 36:11

Deux chercheurs présentent GAIA, un benchmark qui évalue la capacité des IA à mener des recherches complexes et à raisonner étape par étape. On explore ce que les “thinking models” et l'usage d'outils (web, PDF, images) changent vraiment, et pourquoi cela débloque des résultats concrets. Ils détaillent aussi la “sauce secrète” derrière Deep Research d'OpenAI et comparent ces approches aux autres méthodes du marché.Sources Article central Autre ressource intéressanteEn plateau Michaël de Marliave — animateur Grégoire Mialon — invité Clémentine Fourrier — invité Hébergé par Acast. Visitez acast.com/privacy pour plus d'informations.

acast tout ia ils gaia micha visitez autre huggingface

Эмоции у ИИ и бои РОБОТОВ! / Anthropic против OpenAI, новый Gemini и закрытие Arc / AIA Podcast #112

AIA Podcast

Play Episode Listen Later Jun 8, 2025 142:16

spotify chatgpt reddit windows openai gemini arc playground flux trae anthropic manus conversational ai tts elevenlabs windsurf huggingface

The top AI news from the past week, every ThursdAI

Play Episode Listen Later May 16, 2025 88:56

Hey yall, this is Alex

The Rise of AI for Serious Sellers — Insights from Danny & Dorian I Part 4

Seller Sessions

Play Episode Listen Later May 12, 2025 25:13

In this high-impact episode of Seller Sessions, Danny McMillan is joined by Dorian Gorski for a no-fluff exploration of how AI is shifting the Amazon ecosystem. The conversation orbits around a powerful new tool called "Manus" — an AI-driven platform built to go beyond surface-level product research and tap into rich demographic insights, customer motivations, and actionable listing data.

amazon ai cutting solve sellers gpt poe manus business solutions huggingface danny mcmillan seller sessions

Первый ИИ в суде и в ТАНКЕ! / Gemini 2.5 Pro I/O, Qwen 3 и никакой Атаки Титанов / AIA Podcast #110

AIA Podcast

Play Episode Listen Later May 11, 2025 182:55

spotify ai microsoft model chatgpt openai gemini genie ernie alibaba tencent grok phi deepmind google gemini mistral suno meta ai future house windsurf huggingface

HuggingFace Buys Pollen Robotics, DHH & Bezos Founder Advice & a JCal Origin Story | E2111

This Week in Startups

Play Episode Listen Later Apr 15, 2025 63:11

Today's show: In this episode, Jason, Alex, and Lon dive into Blue Origin's all-female celeb spaceflight (yes, Katy Perry sang on reentry), Hugging Face's unexpected move into robotics, and Jack Dorsey's wild take that we should “delete all IP law.” Plus, they break down Figure AI's eye-popping $39B valuation, the risks of SPVs, and what founders and investors can learn from the SPAC boom. As Jason puts it: “You just have to assume an 80% failure rate.”*Timestamps:(0:00) Jason kicks off the show!(1:34) Blue Origin all-female crew launch and space tourism(7:17) Emerging technologies and tech adoption trends(10:07) Northwest Registered Agent. Form your entire business identity in just 10 clicks and 10 minutes. Get more privacy, more options, and more done—visit https://www.northwestregisteredagent.com/twist today!(12:38) Hugging Face acquires Pollen Robotics; Open AI and robotics debate(19:42) Squarespace - Use offer code TWIST to save 10% off your first purchase of a website or domain at https://www.Squarespace.com/TWIST(20:42) Significance of Hugging Face in generative AI; Jack Dorsey's IP law stance(25:01) U.S. high-tech job market; revisiting IP law discussions(30:03) Lemon.io - Get 15% off your first 4 weeks of developer time at https://Lemon.io/twist(31:03) IP law and American innovation(32:22) Challenges in startup exits and secondary trading platforms(37:09) Figure AI's valuation controversy(46:37) Startup insights and investing perspectives(50:39) Jeff Bezos on risk assessment(57:03) Jason's personal journey and reflections(1:02:06) Developing a samurai mindset; societal systems abstraction*Subscribe to the TWiST500 newsletter: https://ticker.thisweekinstartups.comCheck out the TWIST500: https://www.twist500.comSubscribe to This Week in Startups on Apple: https://rb.gy/v19fcp*Follow Lon:X: https://x.com/lons*Follow Alex:X: https://x.com/alexLinkedIn: ⁠https://www.linkedin.com/in/alexwilhelmFollow Jason:X: https://twitter.com/JasonLinkedIn: https://www.linkedin.com/in/jasoncalacanisThank you to our partners:(10:07) Northwest Registered Agent. Form your entire business identity in just 10 clicks and 10 minutes. Get more privacy, more options, and more done—visit https://www.northwestregisteredagent.com/twist today!(19:42) Squarespace - Use offer code TWIST to save 10% off your first purchase of a website or domain at https://www.Squarespace.com/TWIST(30:03) Lemon.io - Get 15% off your first 4 weeks of developer time at https://Lemon.io/twistGreat TWIST interviews: Will Guidara, Eoghan McCabe, Steve Huffman, Brian Chesky, Bob Moesta, Aaron Levie, Sophia Amoruso, Reid Hoffman, Frank Slootman, Billy McFarlandCheck out Jason's suite of newsletters: https://substack.com/@calacanisFollow TWiST:Twitter: https://twitter.com/TWiStartupsYouTube: https://www.youtube.com/thisweekinInstagram: https://www.instagram.com/thisweekinstartupsTikTok: https://www.tiktok.com/@thisweekinstartupsSubstack: https://twistartups.substack.com*Subscribe to the Founder University Podcast: https://www.youtube.com/@founderuniversity1916

Relocalisation : Nvidia fait le pari américain – 15/04

Tech&Co

Play Episode Listen Later Apr 15, 2025 29:10

Mardi 15 avril, François Sorel a reçu Frédéric Simottel, journaliste BFM Business, Thomas Serval, PDG de Baracoda, ainsi que Christophe Aulnette, senior advisor chez Seven2 et ancien président de Microsoft France et Asie du Sud. Ils ont parlé de Nvidia qui va construire pour 500 milliards de dollars de serveurs aux États-Unis, de l'atmosphère et l'ambiance dans la Silicon Valley actuellement, du projet d'OpenAI de développer un réseau social, et du rachat par Hugging face de Pollen Robotics, dans l'émission Tech & Co, la quotidienne, sur BFM Business. Retrouvez l'émission du lundi au jeudi et réécoutez-la en podcast.

Actionable AI for Marketers – The Human in the Loop With Britney Muller

Up Arrow Podcast

Play Episode Listen Later Apr 1, 2025 74:13

Britney Muller is an AI consultant and keynote speaker advising tech companies on AI strategies, machine learning, and workflow automation. With over 10 years of experience in generative AI, she has developed over a dozen in-house AI applications. Britney was the former Marketing Manager at Hugging Face, where she launched the largest open-source, multilingual model. In this episode… AI is changing the game for marketers, but many don't leverage it to its fullest potential for their businesses. Rather than producing AI-driven content at a faster rate, companies should focus on building a strong brand presence and leveraging AI for intentional automation. How can marketers cut through the noise, avoid common pitfalls, and harness AI to drive measurable results? While AI can be used for pattern recognition, automation, and audience research, AI optimizer Britney Muller warns against relying on it for fact-based decision-making. You can leverage AI without losing the critical human element by transforming website content into vector embeddings. Analyzing these embeddings allows marketers to identify content clusters, uncover gaps in their website's information architecture, and optimize internal linking structures to improve search engine rankings. Britney also recommends utilizing Reddit APIs to extract real-time customer sentiments, uncover trending pain points, and analyze top-performing content in specific communities. In this week's episode of the Up Arrow Podcast, William Harris chats with AI consultant Britney Muller about practical AI strategies for marketers. Britney explains why brand mentions are the new backlinks, how to build AI-powered internal tools, and the ethical concerns marketers should consider when adopting AI.

ai analyzing loop marketers actionable marketing managers muller william harris huggingface

[EN] ByteSized RSE: AI assisted coding - with Liam (Jianliang) Gao

Code for Thought

Play Episode Listen Later Mar 31, 2025 34:24

English Edition: In this last episode for the ByteSized RSE "miniseries" we talk about AI assisted coding - and the (long) history how engineers tried to come up with assisting tools to make our code better and more robust. My guest is Liam Gao from Imperial College, London, UK. Links:https://github.com/features/copilot GitHub Co-Pilothttps://huggingface.co HuggingFace another AI toolhttps://spacelift.io/blog/ai-coding-assistant-tools a summary of current tools (non exhaustive)https://platform.openai.com/docs/guides/prompt-engineering OpenAI's take on prompt engineeringhttps://www.promptingguide.aihttps://web.archive.org/web/20121022091418/http://www.stanford.edu/~learnest/spelling.pdf some of the attempts to come up with spelling checkshttps://en.wikipedia.org/wiki/Code_completionhttps://www.gnu.org/software/emacs/ Good old Emacshttps://en.wikipedia.org/wiki/Vi_(text_editor) vi editor (not for the faint hearted)https://winworldpc.com/product/turbo-pascal/7x Borland's Turbo Pascal with IDEhttps://survey.stackoverflow.co/2024/ Stackoverflow survey from 2024 with ca 65000 respondents And here the YouTube clips mentionedhttps://www.youtube.com/watch?v=MvEXkd3O2ow Cypher musing why he didn't take the "blue pill"https://www.youtube.com/watch?v=L0mRMp2kbQY Star Trek TNG, S3E6 - Geordie LaForge talking to the computerGet in touchThank you for listening! Merci de votre écoute! Vielen Dank für´s Zuhören! Contact Details/ Coordonnées / Kontakt: Email mailto:peter@code4thought.org UK RSE Slack (ukrse.slack.com): @code4thought or @piddie US RSE Slack (usrse.slack.com): @Peter Schmidt Mastodon: https://fosstodon.org/@code4thought or @code4thought@fosstodon.org Bluesky: https://bsky.app/profile/code4thought.bsky.social LinkedIn: https://www.linkedin.com/in/pweschmidt/ (personal Profile)LinkedIn: https://www.linkedin.com/company/codeforthought/ (Code for Thought Profile) This podcast is licensed under the Creative Commons Licence: https://creativecommons.org/licenses/by-sa/4.0/

ai uk code openai blue sky coding vielen dank assisted government accountability office imperial college stack overflow borland huggingface bytesized turbo pascal

208.开源VS闭源：谁将变成AI的主流？DeepSeek安卓时刻后的竞争、套壳和商业化

乱翻书

Play Episode Listen Later Mar 11, 2025 92:05

【本期嘉宾】王铁震、石洪竺、郭炜王铁震（Huggingface高级工程师）石洪竺（魔搭社区运营负责人）郭炜（白鲸开源 CEO Apache基金会成员）主播：潘乱（「乱翻书」主理人）【时间线】开源与闭源02:12 闭源和开源的概念科普04:55 开源模型和开源代码的区别07:40 开源模型，开源了哪些东西，是“真”开源么？08:14 权重开源10:51 开源最后的核心竞争力是什么？12:40 阿里为什么会选择开源路径，阿里的开源之路是如何走来的？14:51 从闭源的OpenAI，到开源的Llama、Qwen崛起，这期间业界经历了什么？19:18 开发者角度看到的开源力量23:01 开源模型与闭源模型在技术迭代速度上有何差异？26:47 最牛的工程师们，更在意技术品牌的什么？“DeepSeek时刻”之后31:40 面对DeepSeek的开源，OpenAI是否会调整自己的开源策略？36:23 如果OpenAI被迫开源O3，DeepSeek的先发优势是否就没有了？39:39 DeepSeek开源之后，AI小公司闭源是否还有生存空间？46:07 「大模型自己还没出现幻觉，大模型使用者先自己出现幻觉了」47:07 开源时代是否也是渠道为王？49:09 DeepSeek出现后，Huggingface社区里有哪些讨论？51:53 魔搭社区关于DeepSeek的讨论55:10 如何界定“模型影响力”这一指标？开源社区和商业化58:36 开源大模型如何平衡"技术普惠"与"商业化盈利"？59:08 DeepSeek可能有哪些比较可行的商业模式？63:57 Huggingface作为开源模型社区，目前的商业化模式是怎样的？64:49 魔搭社区作为阿里系的平台，是否会平等的对待内外部模型？开源生态69:23 DeepSeek开源之后，套壳好像不再是问题了？73:41 当年QQ也是QICQ的套壳75:04 Agent开发门槛已大幅降低，套壳产品护城河又在哪里？78:17 大模型如何重构开源生态？81:36 DeepSeek会构建端侧轻量化的生态吗？85:25 怎么看待开源和闭源的关系，开源和闭源各自的上限会在哪里？哪个更有未来？【开场&结尾音乐】开场音乐：Richard Stallman - 《自由软件之歌》（The Free Software Song）结尾音乐：虞霞/李小龙 - 侠客行（电视剧《武林外传》片尾曲TV Verison）《自由软件之歌》（英语：The Free Software Song）是由自由软件基金会主席理查德·斯托曼（Richard Stallman）作词的自由软件宣传曲，它采用保加利亚民歌《Sadi Moma》的旋律。目的是要鼓励程序员与大众分享其源代码及推动软件自由化。歌词片段：Join us now and share the software;You'll be free, hackers, you'll be free.……【关于「乱翻书」】「乱翻书」是一档关注商业、科技和互联网的圆桌对话节目。关心How和Why，以及少有人注意到的What。内容主要方向是科技考古、行业观察和前沿思考，研究公司的兴衰循环，希望能够为你带来信息增量。「乱翻书」主理人是潘乱，代表作品有《腾讯没有梦想》、字节跳动/快手早期关键节点的系列特写。【关于主播】视频号/即刻/小红书：潘乱公众号/播客：乱翻书【图】直播截图微信公众号：乱翻书视频号：潘乱商业合作：联系微信 tongxing717本期编辑：怀杭

agent huggingface how why what

Ce bras open source à 110 € va bouleverser la robotique — Rémi Cadene (HuggingFace)

Underscore_

Play Episode Listen Later Mar 11, 2025 32:25

Rémi Cadene, ex‑ingénieur impliqué dans l'humanoïde Optimus chez Tesla, explique pourquoi la robotique n'a pas encore vécu son moment « ChatGPT ». Il détaille le projet LeRobot chez Hugging Face et la conception d'un bras robotique open source à 110 €, imprimé en 3D et actionné au fil de pêche, pensé pour produire enfin les données d'entraînement qui manquent aux robots. Démo à l'appui, on voit comment matériel accessible et jeux de données ouverts peuvent accélérer l'apprentissage de tâches du quotidien et bousculer les acteurs établis.Sources Projet LeRobot (GitHub)En plateau Michaël de Marliave — animateur Tiffany Souterre — chroniqueuse Rémi Cadene — invité Hébergé par Acast. Visitez acast.com/privacy pour plus d'informations.

3d chatgpt tesla acast open source bras micha visitez optimus robotique huggingface

EDyO 96 - Fosdem 2025

Entre Dev y Ops Podcast

Play Episode Listen Later Mar 5, 2025

En el episodio 96 del podcast de Entre Dev y Ops hablaremos del veinticinco aniversario de la FOSDEM. Blog Entre Dev y Ops - https://www.entredevyops.es Telegram Entre Dev y Ops - https://t.me/entredevyops Twitter Entre Dev y Ops - https://twitter.com/entredevyops LinkedIn Entre Dev y Ops - https://www.linkedin.com/company/entredevyops/ Patreon Entre Dev y Ops - https://www.patreon.com/edyo Amazon Entre Dev y Ops - https://amzn.to/2HrlmRw Enlaces comentados: Fosdem 2025 - https://fosdem.org/2025/ Fosdem Treasure Hunt - https://fosdem.org/2025/news/2025-01-30-treasure-hunt/ Curl - https://curl.se/ Luanti (formerly Minetest) - https://www.luanti.org/ 0 A.D. - https://play0ad.com/ The Battle for Wesnoth - https://www.wesnoth.org Charla optimización JavaScript - https://fosdem.org/2025/schedule/event/fosdem-2025-4391-how-to-lose-weight-optimising-memory-usage-in-javascript-and-beyond/ Charla DuckDB y graph queries - https://fosdem.org/2025/schedule/event/fosdem-2025-4135-empowering-data-analytics-high-performance-graph-queries-in-duckdb-with-duckpgq/ Charla segundo cerebro - https://fosdem.org/2025/schedule/event/fosdem-2025-6542-building-your-local-llm-second-brain/ Charla ecosistema Huggingface - https://fosdem.org/2025/schedule/event/fosdem-2025-6341-hugging-face-ecosystem-for-local-ai-ml/ DuckDB - https://duckdb.org DuckDB Con en Amsterdam - https://duckdb.org/events/2025/01/31/duckcon6/ Charla Leslie Lamport - https://fosdem.org/2025/schedule/event/fosdem-2025-4941-was-leslie-lamport-right-/ Paper sobre consistencia - https://www.scs.stanford.edu/17au-cs244b/labs/projects/clow_jiang.pdf immich - https://immich.app/ FuriLabs - https://furilabs.com/ TinyGo - https://tinygo.org Gopher Badge - https://gopherbadge.com/ FastHMTL - https://fastht.ml/ Contexto de FastHTML para LLMs - https://docs.fastht.ml/llms-ctx.txt Xwiki - https://www.xwiki.org EL BOLI de la discordia - https://www.amazon.com/Tactical-Multi-Tool-Utility-Screwdriver-Touchscreen/dp/B0BGQXVCFD

battle amsterdam javascript ops contexto curl fosdem huggingface duckdb wesnoth minetest tinygo

S2E6 - Chutes.ai Subnet 64 w/ Namoray and Jon Durbin

Bittensor Guru

Play Episode Listen Later Jan 13, 2025 70:28

What do you get when Bittensor's most talented subnet builder and a prolific miner combine forces? Subnet 64 Chutes. Built for dtao with optimized efficiency on every level, monetization included through micropayments in TAO, a host of groundbreaking security features for miners, a rewrite of validation methodology to lower expenses and a front end so smooth you don't need to know squat to kick the tires on all of Huggingface...this was such a treat. I love these dudes, what assets to have involved in Bittensor. Head to chutes.ai to use the future now. PS. More Chutes apps coming soon. Try out text-to-voice now and EASY Agents deployed with no coding shortly. While you wait you can deploy ANY custom code now, for free. https://chutes.ai/ https://x.com/namoray_dev https://x.com/jon_durbin https://x.com/KeithSingery https://bittensor.guru https://taostats.io/validators/bittensor-guru-podcast/ https://bittensor.com

head ai network built ps ethereum incentives tao btc decentralized eth chutes artifical intelligence rayon huggingface bittensor subnet jon durbin

2024 in Synthetic Data and Smol Models [LS Live @ NeurIPS]

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Dec 24, 2024 28:36

Happy holidays! We'll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production!For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver. Today, we're proud to share Loubna's highly anticipated talk (slides here)!Synthetic DataWe called out the Synthetic Data debate at last year's NeurIPS, and no surprise that 2024 was dominated by the rise of synthetic data everywhere:* Apple's Rephrasing the Web, Microsoft's Phi 2-4 and Orca/AgentInstruct, Tencent's Billion Persona dataset, DCLM, and HuggingFace's FineWeb-Edu, and Loubna's own Cosmopedia extended the ideas of synthetic textbook and agent generation to improve raw web scrape dataset quality* This year we also talked to the IDEFICS/OBELICS team at HuggingFace who released WebSight this year, the first work on code-vs-images synthetic data.* We called Llama 3.1 the Synthetic Data Model for its extensive use (and documentation!) of synthetic data in its pipeline, as well as its permissive license. * Nemotron CC and Nemotron-4-340B also made a big splash this year for how they used 20k items of human data to synthesize over 98% of the data used for SFT/PFT.* Cohere introduced Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress observing gains of up to 56.5% improvement in win rates comparing multiple teachers vs the single best teacher model* In post training, AI2's Tülu3 (discussed by Luca in our Open Models talk) and Loubna's Smol Talk were also notable open releases this year.This comes in the face of a lot of scrutiny and criticism, with Scale AI as one of the leading voices publishing AI models collapse when trained on recursively generated data in Nature magazine bringing mainstream concerns to the potential downsides of poor quality syndata:Part of the concerns we highlighted last year on low-background tokens are coming to bear: ChatGPT contaminated data is spiking in every possible metric:But perhaps, if Sakana's AI Scientist pans out this year, we will have mostly-AI AI researchers publishing AI research anyway so do we really care as long as the ideas can be verified to be correct?Smol ModelsMeta surprised many folks this year by not just aggressively updating Llama 3 and adding multimodality, but also adding a new series of “small” 1B and 3B “on device” models this year, even working on quantized numerics collaborations with Qualcomm, Mediatek, and Arm. It is near unbelievable that a 1B model today can qualitatively match a 13B model of last year:and the minimum size to hit a given MMLU bar has come down roughly 10x in the last year. We have been tracking this proxied by Lmsys Elo and inference price:The key reads this year are:* MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases* Apple Intelligence Foundation Language Models* Hymba: A Hybrid-head Architecture for Small Language Models* Loubna's SmolLM and SmolLM2: a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters on the pareto efficiency frontier.* and Moondream, which we already covered in the 2024 in Vision talkFull Talk on YouTubeplease like and subscribe!Timestamps* [00:00:05] Loubna Intro* [00:00:33] The Rise of Synthetic Data Everywhere* [00:02:57] Model Collapse* [00:05:14] Phi, FineWeb, Cosmopedia - Synthetic Textbooks* [00:12:36] DCLM, Nemotron-CC* [00:13:28] Post Training - AI2 Tulu, Smol Talk, Cohere Multilingual Arbitrage* [00:16:17] Smol Models* [00:18:24] On Device Models* [00:22:45] Smol Vision Models* [00:25:14] What's NextTranscript2024 in Synthetic Data and Smol Models[00:00:00] [00:00:05] Loubna Intro[00:00:05] Speaker: I'm very happy to be here. Thank you for the invitation. So I'm going to be talking about synthetic data in 2024. And then I'm going to be talking about small on device models. So I think the most interesting thing about synthetic data this year is that like now we have it everywhere in the large language models pipeline.[00:00:33] The Rise of Synthetic Data Everywhere[00:00:33] Speaker: I think initially, synthetic data was mainly used just for post training, because naturally that's the part where we needed human annotators. And then after that, we realized that we don't really have good benchmarks to [00:01:00] measure if models follow instructions well, if they are creative enough, or if they are chatty enough, so we also started using LLMs as judges.[00:01:08] Speaker: Thank you. And I think this year and towards the end of last year, we also went to the pre training parts and we started generating synthetic data for pre training to kind of replace some parts of the web. And the motivation behind that is that you have a lot of control over synthetic data. You can control your prompt and basically also the kind of data that you generate.[00:01:28] Speaker: So instead of just trying to filter the web, you could try to get the LLM to generate what you think the best web pages could look like and then train your models on that. So this is how we went from not having synthetic data at all in the LLM pipeline to having it everywhere. And so the cool thing is like today you can train an LLM with like an entirely synthetic pipeline.[00:01:49] Speaker: For example, you can use our Cosmopedia datasets and you can train a 1B model on like 150 billion tokens that are 100 percent synthetic. And those are also of good quality. And then you can [00:02:00] instruction tune the model on a synthetic SFT dataset. You can also do DPO on a synthetic dataset. And then to evaluate if the model is good, you can use.[00:02:07] Speaker: A benchmark that uses LLMs as a judge, for example, MTBench or AlpacaEvil. So I think this is like a really mind blowing because like just a few years ago, we wouldn't think this is possible. And I think there's a lot of concerns about model collapse, and I'm going to talk about that later. But we'll see that like, if we use synthetic data properly and we curate it carefully, that shouldn't happen.[00:02:29] Speaker: And the reason synthetic data is very popular right now is that we have really strong models, both open and closed. It is really cheap and fast to use compared to human annotations, which cost a lot and take a lot of time. And also for open models right now, we have some really good inference frameworks.[00:02:47] Speaker: So if you have enough GPUs, it's really easy to spawn these GPUs and generate like a lot of synthetic data. Some examples are VLM, TGI, and TensorRT.[00:02:57] Model Collapse[00:02:57] Speaker: Now let's talk about the elephant in the room, model [00:03:00] collapse. Is this the end? If you look at the media and all of like, for example, some papers in nature, it's really scary because there's a lot of synthetic data out there in the web.[00:03:09] Speaker: And naturally we train on the web. So we're going to be training a lot of synthetic data. And if model collapse is going to happen, we should really try to take that seriously. And the other issue is that, as I said, we think, a lot of people think the web is polluted because there's a lot of synthetic data.[00:03:24] Speaker: And for example, when we're building fine web datasets here at Guillerm and Hinek, we're interested in like, how much synthetic data is there in the web? So there isn't really a method to properly measure the amount of synthetic data or to save a webpage synthetic or not. But one thing we can do is to try to look for like proxy words, for example, expressions like as a large language model or words like delve that we know are actually generated by chat GPT.[00:03:49] Speaker: We could try to measure the amount of these words in our data system and compare them to the previous years. For example, here, we measured like a, these words ratio in different dumps of common crawl. [00:04:00] And we can see that like the ratio really increased after chat GPT's release. So if we were to say that synthetic data amount didn't change, you would expect this ratio to stay constant, which is not the case.[00:04:11] Speaker: So there's a lot of synthetic data probably on the web, but does this really make models worse? So what we did is we trained different models on these different dumps. And we then computed their performance on popular, like, NLP benchmarks, and then we computed the aggregated score. And surprisingly, you can see that the latest DOMs are actually even better than the DOMs that are before.[00:04:31] Speaker: So if there's some synthetic data there, at least it did not make the model's worse. Yeah, which is really encouraging. So personally, I wouldn't say the web is positive with Synthetic Data. Maybe it's even making it more rich. And the issue with like model collapse is that, for example, those studies, they were done at like a small scale, and you would ask the model to complete, for example, a Wikipedia paragraph, and then you would train it on these new generations, and you would do that every day.[00:04:56] Speaker: iteratively. I think if you do that approach, it's normal to [00:05:00] observe this kind of behavior because the quality is going to be worse because the model is already small. And then if you train it just on its generations, you shouldn't expect it to become better. But what we're really doing here is that we take a model that is very large and we try to distill its knowledge into a model that is smaller.[00:05:14] Phi, FineWeb, Cosmopedia - Synthetic Textbooks[00:05:14] Speaker: And in this way, you can expect to get like a better performance for your small model. And using synthetic data for pre-training has become really popular. After the textbooks are all you need papers where Microsoft basically trained a series of small models on textbooks that were using a large LLM.[00:05:32] Speaker: And then they found that these models were actually better than models that are much larger. So this was really interesting. It was like first of its time, but it was also met with a lot of skepticism, which is a good thing in research. It pushes you to question things because the dataset that they trained on was not public, so people were not really sure if these models are really good or maybe there's just some data contamination.[00:05:55] Speaker: So it was really hard to check if you just have the weights of the models. [00:06:00] And as Hugging Face, because we like open source, we tried to reproduce what they did. So this is our Cosmopedia dataset. We basically tried to follow a similar approach to what they documented in the paper. And we created a synthetic dataset of textbooks and blog posts and stories that had almost 30 billion tokens.[00:06:16] Speaker: And we tried to train some models on that. And we found that like the key ingredient to getting a good data set that is synthetic is trying as much as possible to keep it diverse. Because if you just throw the same prompts as your model, like generate like a textbook about linear algebra, and even if you change the temperature, the textbooks are going to look alike.[00:06:35] Speaker: So there's no way you could scale to like millions of samples. And the way you do that is by creating prompts that have some seeds that make them diverse. In our case, the prompt, we would ask the model to generate a textbook, but make it related to an extract from a webpage. And also we try to frame it within, to stay within topic.[00:06:55] Speaker: For example, here, we put like an extract about cardiovascular bioimaging, [00:07:00] and then we ask the model to generate a textbook related to medicine that is also related to this webpage. And this is a really nice approach because there's so many webpages out there. So you can. Be sure that your generation is not going to be diverse when you change the seed example.[00:07:16] Speaker: One thing that's challenging with this is that you want the seed samples to be related to your topics. So we use like a search tool to try to go all of fine web datasets. And then we also do a lot of experiments with the type of generations we want the model to generate. For example, we ask it for textbooks for middle school students or textbook for college.[00:07:40] Speaker: And we found that like some generation styles help on some specific benchmarks, while others help on other benchmarks. For example, college textbooks are really good for MMLU, while middle school textbooks are good for benchmarks like OpenBookQA and Pico. This is like a sample from like our search tool.[00:07:56] Speaker: For example, you have a top category, which is a topic, and then you have some [00:08:00] subtopics, and then you have the topic hits, which are basically the web pages in fine web does belong to these topics. And here you can see the comparison between Cosmopedia. We had two versions V1 and V2 in blue and red, and you can see the comparison to fine web, and as you can see throughout the training training on Cosmopedia was consistently better.[00:08:20] Speaker: So we managed to get a data set that was actually good to train these models on. It's of course so much smaller than FineWeb, it's only 30 billion tokens, but that's the scale that Microsoft data sets was, so we kind of managed to reproduce a bit what they did. And the data set is public, so everyone can go there, check if everything is all right.[00:08:38] Speaker: And now this is a recent paper from NVIDIA, Neumatron CC. They took things a bit further, and they generated not a few billion tokens, but 1. 9 trillion tokens, which is huge. And we can see later how they did that. It's more of, like, rephrasing the web. So we can see today that there's, like, some really huge synthetic datasets out there, and they're public, so, [00:09:00] like, you can try to filter them even further if you want to get, like, more high quality corpses.[00:09:04] Speaker: So for this, rephrasing the web this approach was suggested in this paper by Pratyush, where basically in this paper, they take some samples from C4 datasets, and then they use an LLM to rewrite these samples into a better format. For example, they ask an LLM to rewrite the sample into a Wikipedia passage or into a Q& A page.[00:09:25] Speaker: And the interesting thing in this approach is that you can use a model that is Small because it doesn't, rewriting doesn't require knowledge. It's just rewriting a page into a different style. So the model doesn't need to have like knowledge that is like extensive of what is rewriting compared to just asking a model to generate a new textbook and not giving it like ground truth.[00:09:45] Speaker: So here they rewrite some samples from C4 into Q& A, into Wikipedia, and they find that doing this works better than training just on C4. And so what they did in Nemo Trans CC is a similar approach. [00:10:00] They rewrite some pages from Common Crawl for two reasons. One is to, like improve Pages that are low quality, so they rewrite them into, for example, Wikipedia page, so they look better.[00:10:11] Speaker: And another reason is to create more diverse datasets. So they have a dataset that they already heavily filtered, and then they take these pages that are already high quality, and they ask the model to rewrite them in Question and Answer format. into like open ended questions or like multi choice questions.[00:10:27] Speaker: So this way they can reuse the same page multiple times without fearing like having multiple duplicates, because it's the same information, but it's going to be written differently. So I think that's also a really interesting approach for like generating synthetic data just by rephrasing the pages that you already have.[00:10:44] Speaker: There's also this approach called Prox where they try to start from a web page and then they generate a program which finds how to write that page to make it better and less noisy. For example, here you can see that there's some leftover metadata in the web page and you don't necessarily want to keep that for training [00:11:00] your model.[00:11:00] Speaker: So So they train a model that can generate programs that can like normalize and remove lines that are extra. So I think this approach is also interesting, but it's maybe less scalable than the approaches that I presented before. So that was it for like rephrasing and generating new textbooks.[00:11:17] Speaker: Another approach that I think is really good and becoming really popular for using synthetic data for pre training is basically building a better classifiers. For filtering the web for example, here we release the data sets called fine web edu. And the way we built it is by taking Llama3 and asking it to rate the educational content of web pages from zero to five.[00:11:39] Speaker: So for example, if a page is like a really good textbook that could be useful in a school setting, it would get a really high score. And if a page is just like an advertisement or promotional material, it would get a lower score. And then after that, we take these synthetic annotations and we train a classifier on them.[00:11:57] Speaker: It's a classifier like a BERT model. [00:12:00] And then we run this classifier on all of FineWeb, which is a 15 trillion tokens dataset. And then we only keep the pages that have like a score that's higher than 3. So for example, in our case, we went from 15 trillion tokens to 3. to just 1. 5 trillion tokens. Those are really highly educational.[00:12:16] Speaker: And as you can see here, a fine web EDU outperforms all the other public web datasets by a larger margin on a couple of benchmarks here, I show the aggregated score and you can see that this approach is really effective for filtering web datasets to get like better corpuses for training your LLMs.[00:12:36] DCLM, Nemotron-CC[00:12:36] Speaker: Others also try to do this approach. There's, for example, the DCLM datasets where they also train the classifier, but not to detect educational content. Instead, they trained it on OpenHermes dataset, which is a dataset for instruction tuning. And also they explain like IAM5 subreddits, and then they also get really high quality dataset which is like very information dense and can help [00:13:00] you train some really good LLMs.[00:13:01] Speaker: And then Nemotron Common Crawl, they also did this approach, but instead of using one classifier, they used an ensemble of classifiers. So they used, for example, the DCLM classifier, and also classifiers like the ones we used in FineWebEducational, and then they combined these two. Scores into a, with an ensemble method to only retain the best high quality pages, and they get a data set that works even better than the ones we develop.[00:13:25] Speaker: So that was it for like synthetic data for pre-training.[00:13:28] Post Training - AI2 Tulu, Smol Talk, Cohere Multilingual Arbitrage[00:13:28] Speaker: Now we can go back to post training. I think there's a lot of interesting post training data sets out there. One that was released recently, the agent instructs by Microsoft where they basically try to target some specific skills. And improve the performance of models on them.[00:13:43] Speaker: For example, here, you can see code, brain teasers, open domain QA, and they managed to get a dataset that outperforms that's when fine tuning Mistral 7b on it, it outperforms the original instruct model that was released by Mistral. And as I said, to get good synthetic data, you really [00:14:00] have to have a framework to make sure that your data is diverse.[00:14:03] Speaker: So for example, for them, they always. And then they see the generations on either source code or raw text documents, and then they rewrite them to make sure they're easier to generate instructions from, and then they use that for their like instruction data generation. There's also the Tool3SFT mixture, which was released recently by Allen AI.[00:14:23] Speaker: It's also really good quality and it covers a wide range of tasks. And the way they make sure that this dataset is diverse is by using personas from the persona hub datasets. Which is basically a data set of like I think over a million personas. And for example, in the tool mixture to generate like a new code snippet, they would give like the model persona, for example, a machine learning researcher interested in neural networks, and then ask it to generate like a coding problem.[00:14:49] Speaker: This way you make sure that your data set is really diverse, and then you can further filter the data sets, for example, using the reward models. We also released a dataset called Smalltalk, [00:15:00] and we also tried to cover the wide range of tasks, and as you can see here, for example, when fine tuning Mistral 7b on the dataset, we also outperformed the original Mistral instructs on a number of benchmarks, notably on mathematics and instruction following with ifevil.[00:15:18] Speaker: Another paper that's really interesting I wanted to mention is this one called Multilingual Data Arbitrage by Cohere. And basically they want to generate a data set for post training that is multilingual. And they have a really interesting problem. It's the fact that there isn't like one model that's really good at all the languages they wanted.[00:15:36] Speaker: So what they do is that like they use not just one teacher model, but multiple teachers. And then they have a router which basically sends the prompts they have to all these models. And then they get the completions and they have a reward model that traces all these generations and only keeps the best one.[00:15:52] Speaker: And this is like arbitrage and finance. So well, I think what's interesting in this, it shows that like synthetic data, it doesn't have to come from a single model. [00:16:00] And because we have so many good models now, you could like pull these models together and get like a dataset that's really high quality and that's diverse and that's covers all your needs.[00:16:12] Speaker: I was supposed to put a meme there, but. Yeah, so that was it for like a synthetic data.[00:16:17] Smol Models[00:16:17] Speaker: Now we can go to see what's happening in the small models field in 2024. I don't know if you know, but like now we have some really good small models. For example, Lama 3. 2 1B is. It matches Lama 2. 13b from, that was released last year on the LMSYS arena, which is basically the default go to leaderboard for evaluating models using human evaluation.[00:16:39] Speaker: And as you can see here, the scores of the models are really close. So I think we've made like hugely forward in terms of small models. Of course, that's one, just one data point, but there's more. For example, if you look at this chart from the Quint 2. 5 blog post, it shows that today we have some really good models that are only like 3 billion parameters [00:17:00] and 4 billion that score really high on MMLU.[00:17:03] Speaker: Which is a really popular benchmark for evaluating models. And you can see here that the red, the blue dots have more than 65 on MMLU. And the grey ones have less. And for example, Llama33b had less. So now we have a 3b model that outperforms a 33b model that was released earlier. So I think now people are starting to realize that like, we shouldn't just scale and scale models, but we should try to make them more efficient.[00:17:33] Speaker: I don't know if you knew, but you can also chat with a 3B plus model on your iPhone. For example, here, this is an app called PocketPal, where you can go and select a model from Hugging Face. It has a large choice. For example, here we loaded the 5. 3. 5, which is 3. 8 billion parameters on this iPhone. And we can chat with this and you can see that even the latency is also acceptable.[00:17:57] Speaker: For example, here, I asked it to give me a joke about [00:18:00] NeurIPS. So let's see what it has to say.[00:18:06] Speaker: Okay, why did the neural network attend NeurIPS? Because it heard there would be a lot of layers and fun and it wanted to train its sense of humor. So not very funny, but at least it can run on device. Yeah, so I think now we have good small models, but we also have like good frameworks and tools to use these small models.[00:18:24] On Device Models[00:18:24] Speaker: So I think we're really close to having like really on edge and on device models that are really good. And I think for a while we've had this narrative. But just training larger models is better. Of course, this is supported by science scaling laws. As you can see here, for example, when we scale the model size, the loss is lower and obviously you get a better model.[00:18:46] Speaker: But and we can see this, for example, in the GPT family of models, how we went from just a hundred million parameters to more than a trillion. parameters. And of course, we all observed the performance improvement when using the latest model. But [00:19:00] one thing that we shouldn't forget is that when we scale the model, we also scale the inference costs and time.[00:19:05] Speaker: And so the largest models were are going to cost so much more. So I think now instead of just building larger models, we should be focusing on building more efficient models. It's no longer a race for the largest models since these models are really expensive to run and they require like a really good infrastructure to do that and they cannot run on, for example, consumer hardware.[00:19:27] Speaker: And when you try to build more efficient models that match larger models, that's when you can really unlock some really interesting on device use cases. And I think a trend that we're noticing now is the trend of training smaller models longer. For example, if you compare how much, how long LLAMA was trained compared to LLAMA3, there is a huge increase in the pre training length.[00:19:50] Speaker: LLAMA was trained on 1 trillion tokens, but LLAMA3 8b was trained on 15 trillion tokens. So Meta managed to get a model that's the same size, but But it performs so much [00:20:00] better by choosing to like spend the sacrifice during training, because as we know, training is a one time cost, but inference is something that's ongoing.[00:20:08] Speaker: If we want to see what are like the small models reads in 2024, I think this mobile LLM paper by Meta is interesting. They try to study different models that are like have the less than 1 billion parameters and find which architecture makes most sense for these models. For example, they find that depth is more important than width.[00:20:29] Speaker: So it's more important to have models that have like more layers than just one. making them more wide. They also find that GQA helps, that tying the embedding helps. So I think it's a nice study overall for models that are just a few hundred million parameters. There's also the Apple intelligence tech report, which is interesting.[00:20:48] Speaker: So for Apple intelligence, they had two models, one that was like on server and another model that was on device. It had 3 billion parameters. And I think the interesting part is that they trained this model using [00:21:00] pruning. And then distillation. And for example, they have this table where they show that, like, using pruning and distillation works much better than training from scratch.[00:21:08] Speaker: And they also have some interesting insights about, like, how they specialize their models on specific tasks, like, for example, summarization and rewriting. There's also this paper by NVIDIA that was released recently. I think you've already had a talk about, like, hybrid models that was all interesting.[00:21:23] Speaker: And this model, they used, like, a hybrid architecture between state space models and transformers. And they managed to train a 1B model that's really performant without needing to train it on a lot of tokens. And regarding our work, we just recently released SmallM2, so it's a series of three models, which are the best in class in each model size.[00:21:46] Speaker: For example, our 1. 7b model outperforms Lama 1b and also Qt 2. 5. And how we managed to train this model is the following. That's where you spent a lot of time trying to curate the pre training datasets. We did a lot of [00:22:00] ablations, trying to find which datasets are good and also how to mix them. We also created some new math and code datasets that we're releasing soon.[00:22:08] Speaker: But you basically really spent a lot of time trying to find what's the best mixture that you can train these models on. And then we spent some time trying to like we also trained these models for very long. For example, small M1 was trained only on 1 trillion tokens, but this model is trained on 11 trillion tokens.[00:22:24] Speaker: And we saw that the performance kept improving. The models didn't really plateau mid training, which I think is really interesting. It shows that you can train such small models for very long and keep getting performance gains. What's interesting about SmallLM2 is that it's fully open. We also released, like the pre training code base, the fine tuning code, the datasets, and also evaluation in this repository.[00:22:45] Smol Vision Models[00:22:45] Speaker: Also there's, like, really interesting small models for text, but also for vision. For example, here you can see SmallVLM, which is a 2B model that's really efficient. It doesn't consume a lot of RAM, and it also has a good performance. There's also Moondream 0. [00:23:00] 5b, which was released recently. It's like the smallest visual language model.[00:23:04] Speaker: And as you can see, there isn't like a big trade off compared to Moondream 2b. So now I showed you that we have some really good small models. We also have the tools to use them, but why should you consider using small models and when? I think, like, small models are really interesting because of the on device feature.[00:23:23] Speaker: Because these models are small and they can run fast, you can basically run them on your laptop, but also on your mobile phone. And this means that your dataset stays locally. You don't have to send your queries to third parties. And this really enhances privacy. That was, for example, one of the big selling points for Apple Intelligence.[00:23:42] Speaker: Also, right now, we really have a lot of work to do. So many frameworks to do on device inference. For example, there's MLX, MLC, Llama, CPP, Transformers, JS. So we have a lot of options and each of them have like great features. So you have so many options for doing that. Small models are also really powerful if you choose to specialize them.[00:24:00][00:24:00] Speaker: For example, here there's a startup called Numind, which took small LM and then they fine tuned it on text extraction datasets. And they managed to get a model that's not very far from models that are much larger. So I think text extraction is like one use case where small models can be really performant and it makes sense to use them instead of just using larger models.[00:24:19] Speaker: You can also chat with these models in browser. For example, here, you can go there, you can load the model, you can even turn off your internet and just start chatting with the model locally. Speaking of text extraction, if you don't want to fine tune the models, there's a really good method of structure generation.[00:24:36] Speaker: We can basically force the models to follow a JSON schema that you defined. For example, here, we try to force the model to follow a schema for extracting key information from GitHub issues. So you can input free text, which is a complaint about a GitHub repository, something not working. And then you can run it there and the model can extract anything that is relevant for your GitHub issue creation.[00:24:58] Speaker: For example, the [00:25:00] priority, for example, here, priority is high, the type of the issue bug, and then a title and the estimation of how long this will take to fix. And you can just like do this in the browser, you can transform your text into a GitHub issue that's properly formatted.[00:25:14] What's Next[00:25:14] Speaker: So what's next for synthetic data and small models?[00:25:18] Speaker: I think that domain specific synthetic data is going to be, it's already important, it's going to be even more important. For example, generating synthetic data for math. I think this really would help improve the reasoning of a lot of models. And a lot of people are doing it, for example, Quint 2. 12 math, everyone's trying to reproduce a one.[00:25:37] Speaker: And so I think for synthetic data, trying to specialize it on some domains is going to be really important. And then for small models, I think specializing them through fine tuning, it's also going to be really important because I think a lot of companies are just trying to use these large models because they are better.[00:25:53] Speaker: But on some tasks, I think you can already get decent performance with small models. So you don't need to Pay like a [00:26:00] cost that's much larger just to make your model better at your task by a few percent. And this is not just for text. And I think it also applies for other modalities like vision and audio.[00:26:11] Speaker: And I think you should also watch out for on device frameworks and applications. For example, like the app I showed, or lama, all these frameworks are becoming really popular and I'm pretty sure that we're gonna get like more of them in 2025. And users really like that. Maybe for other, I should also say hot take.[00:26:28] Speaker: I think that like in AI, we just started like with fine tuning, for example, trying to make BERT work on some specific use cases, and really struggling to do that. And then we had some models that are much larger. So we just switched to like prompt engineering to get the models And I think we're going back to fine tuning where we realize these models are really costly.[00:26:47] Speaker: It's better to use just a small model or try to specialize it. So I think it's a little bit of a cycle and we're going to start to see like more fine tuning and less of just like a prompt engineering the models. So that was my talk. Thank you for following. And if you have [00:27:00] any questions, we can take them now. Get full access to Latent Space at www.latent.space/subscribe

EP 54. 深度对谈顶尖AI开源项目：大模型开源生态, Agent 与中国力量

OnBoard!

Play Episode Listen Later Dec 16, 2024 199:06

聊到生成式AI的发展，开源绝对是最关键的话题之一。这次的嘉宾，可以说涵盖了大模型开源领域最值得关注的公司，真的是黄金阵容！首先跟大家汇报一下，上周日我们在北京举办的 OnBoard! 第一次线下听友会真是超预期！开放报名4天就250多人报名，周日从上午9点到下午3点，从机器人到AI，创业投资和软件出海，100人的场地，直到最后都几乎座无虚席！真的是非常感谢大家的支持~ Hello World, who is OnBoard!? 回到这一期播客，我们将深入探讨大模型的开源生态。在生成式AI飞速发展的一年多时间里，开源无疑是一个不可忽视的话题。开源模型的迅猛发展，从 Meta 的 Llama 3 到 Mistral 的最新模型，它们对闭源大模型如 GPT4 的追赶，不仅令人惊艳，更加速了 AI 场景下产品的实际应用。而围绕大模型的生态系统，从推理加速到开发工具，再到智能代理，技术栈的丰富程度，虽然已经孕育出了像 Langchain 这样的领军企业，但这一切似乎只是冰山一角。特别值得一提的是，随着阿里千问系列、Deepseek、以及 Yi 等中国团队主导的模型在国际舞台上崭露头角，我们不禁思考，除了模仿和追赶，中国在大模型领域的发展是否还有更多值得我们关注和自豪的成就。今天，Monica 有幸邀请到了几位极具代表性的重磅嘉宾，来自 Huggingface 的开源老兵，有通义千问 Qwen 的开源负责人（他也是 Agent 领域最受关注的项目 OpenDevin 核心成员），还有最具国际影响力的开源项目 vLLM 主导人。真是涵盖了大模型开源生态的各个领域的最一线视角！嘉宾们都太宝藏了，我们的话题延伸到大模型的各个方面，录了近4个小时！我们前半部分聊了很多infra的创新，以及最近很火的、以OpenDevin 为代表的软件开发agent 背后的技术和生态等话题。下半部分，我们回到大模型开源的主题，畅谈了：底层基础大模型的开源闭源生态，未来可能有怎样的演进？开源模型商业化跟过去我们在大数据时代看到的databricks 之类开源商业模式有哪些异同？如何做一个有国际影响力的开源项目？嘉宾介绍 Tiezhen Wang, Huggingface 工程师，他可以说是中国与世界开源 AI 生态的桥梁，更是从 Google TensorFlow 时代到 Huggingface 早期员工，对中国和世界的开源 AI 生态都有极深的洞察。 Junyang Lin, 通义千问开源负责人，作为 Qwen 在全球开源社区的主要代言人，他不仅见证了开源的发展历程，还是目前备受瞩目的 Agent 开源项目 OpenDevin 的核心团队成员。李卓翰，UC Berkeley PhD，他所主导的项目更是大名鼎鼎，就是已经成为行业标准的大模型推理框架 vLLM！他所在的 Sky Lab 被誉为开源基础设施的摇篮，从估值百亿美元的 Databricks 到 Anyscale（开源计算框架 Ray 的商业化公司）。他还深度参与了 Chat Arena, Vicuna 等多个国际知名开源项目，对大模型周边生态和 infra 的不仅有国际一线经验，更是有很多有技术理想的干货！ OnBoard! 主持：Monica：美元VC投资人，前 AWS 硅谷团队+ AI 创业公司打工人，公众号M小姐研习录 (ID: MissMStudy) 主理人 | 即刻：莫妮卡同学还有数据、评测等等大模型领域的核心话题，真的非常全面，又不失一线从业者的深度。索性就不分成两部分了，大家可以对着 show notes 里面的时间戳，直接跳转到你感兴趣的话题（虽然我觉得每个话题都很好！）介绍了这么多，还要声明一下，节目里面重点聊到的开源社区 Huggingface，还有几个开源的项目，包括阿里千问、OpenDevin, Deepseek, 零一万物的 Yi，vLLM 等，都没有收取任何广告，完全是嘉宾走心分享，全程无广！当然，如果你们或者其他AI公司考虑赞助一下我们用爱发电的播客，我们当然也是欢迎的！三小时硬核马拉松开始，enjoy! 嘉宾介绍我们都聊了什么 05:28 嘉宾自我介绍，有意思的开源 AI 项目 18:37 vLLM 如何开始的，如何成为全球顶尖项目，为什么我们需要一个大模型推理框架？ 30:24 Agent framework: OpenDevin 这样的负责 agent 会带来怎样的推理挑战？ 40:37 做好一个编程 Agent，还需要哪些新的工具？多模态会带来怎样的变化？ 56:16 我们需要怎样的 Agent Framework？为什么最适合开源社区来做？Framework 会收敛吗？ 67:46 什么是 Crew AI? 如何看待 Multi-agent 架构？ 73:11 借鉴前端框架的发展历史，如何理解一个框架如何成为行业标准？ 77:54 Huggingface 上开源LLM现状，过去一年多有哪些重要进展？有哪些不同的开源方式？泽娜要给你看待一个开源模型的流行程度？ 94:27 如何理解不同架构的开源大模型生态？Qwen 如何通过架构演进打造更好的开源生态？ 104:59 中国的大模型开源项目有哪些创新？大模型架构有哪些变化？ 112:17 为什么说新的模型架构可能会带来商业化的新机会？我们能从以前的开源商业化中学到什么？ 119:22 我们看到现有大模型架构的天花板了吗？什么是一个新的架构？ 128:03 Zhuohan 从参与最早的开源 LLM 之一 Vicuna 的经历学到什么？学术界和业界在大模型生态上如何分工？ 140:48 用于大模型的数据集领域有哪些值得关注的进展？ 149:42 Mistral 为什么这么快爆火？打造一流国际开源项目有什么可借鉴的经验？vLLM 有什么道和术上的心得？ 166:13 Chatbot Arena 是如何开始的？为什么模型的评测那么重要？还有哪些挑战和可能的进展？ 180:49 Zhuohan 对于 vLLM 商业化方式有什么思考？未来推理成本还有哪些下降空间？ 188:17 快问快答：过去一年生成式AI发展有什么超出预期和不及预期的地方？未来还有什么值得期待？我们提到的公司和重点名词 Qwen⁠, ⁠Qwen-2⁠ OpenDevin: ⁠opendevin.github.io⁠ vLLM: ⁠github.com⁠ ⁠Yi (Github)⁠, ⁠零一万物⁠ Chatbot Arena: ⁠huggingface.co⁠ AutoGPT: ⁠github.com⁠ crew AI: ⁠www.crewai.com⁠ autoAWQ: ⁠github.com⁠ LLM.c: ⁠github.com⁠ Flash attention: ⁠github.com⁠ Continuous batching：一种数据处理技术，用于将连续的数据流分批处理，以提高效率和可扩展性。 KV cache：键值对缓存，一种存储结构，通过键快速访问数据值，常用于提高数据检索速度。 Page attention：页面注意力机制，一种在处理长文本时，使模型集中注意力于当前页面或段落的技术。 Quantization：量化，将数据表示的精度降低到更少的比特数，以减少模型大小和提高计算效率。 ⁠Direct Preference Optimization (DPO)⁠: Your Language Model is Secretly a Reward Model Google Gemini: ⁠deepmind.google⁠ Adept: ⁠www.adept.ai⁠ MetaGPT: ⁠github.com⁠ ⁠Dolphin⁠an open-source and uncensored, and commercially licensed dataset and series of instruct-tuned language models based on Microsoft's Orca paper Common crawl: ⁠commoncrawl.org⁠ Tiezhen 的报告：⁠Booming Open Source Chinese-Speaking LLMs: A Closer Look⁠, ⁠Slides⁠ ⁠通义千问一周年，开源狂飙路上的抉择与思考｜魔搭深度访谈⁠ ⁠阿里林俊旸：大模型对很多人来说不够用，打造多模态Agent是关键 | 中国AIGC产业峰会⁠ 欢迎关注M小姐的微信公众号，了解更多中美软件、AI与创业投资的干货内容！ M小姐研习录 (ID: MissMStudy) 喜欢 OnBoard! 的话，也可以点击打赏，请我们喝一杯咖啡！如果你用 Apple Podcasts 收听，也请给我们一个五星好评，这对我们非常重要。最后！快来加入Onboard！听友群，结识到高质量的听友们，我们还会组织线下主题聚会，开放实时旁听播客录制，嘉宾互动等新的尝试。添加任意一位小助手微信，onboard666, 或者 Nine_tunes,小助手会拉你进群。期待你来！

ai microsoft flash agent framework continuous secretly llm orca hello world onboard kv mistral adept autogpt langchain huggingface aigc vicuna

Bolt.new, Flow Engineering for Code Agents, and >$8m ARR in 2 months as a Claude Wrapper

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Dec 2, 2024 98:39

The full schedule for Latent Space LIVE! at NeurIPS has been announced, featuring Best of 2024 overview talks for the AI Startup Landscape, Computer Vision, Open Models, Transformers Killers, Synthetic Data, Agents, and Scaling, and speakers from Sarah Guo of Conviction, Roboflow, AI2/Meta, Recursal/Together, HuggingFace, OpenHands and SemiAnalysis. Join us for the IRL event/Livestream! Alessio will also be holding a meetup at AWS Re:Invent in Las Vegas this Wednesday. See our new Events page for dates of AI Engineer Summit, Singapore, and World's Fair in 2025. LAST CALL for questions for our big 2024 recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!When we first observed that GPT Wrappers are Good, Actually, we did not even have Bolt on our radar. Since we recorded our Anthropic episode discussing building Agents with the new Claude 3.5 Sonnet, Bolt.new (by Stackblitz) has easily cleared the $8m ARR bar, repeating and accelerating its initial $4m feat.There are very many AI code generators and VS Code forks out there, but Bolt probably broke through initially because of its incredible zero shot low effort app generation:But as we explain in the pod, Bolt also emphasized deploy (Netlify)/ backend (Supabase)/ fullstack capabilities on top of Stackblitz's existing WebContainer full-WASM-powered-developer-environment-in-the-browser tech. Since then, the team has been shipping like mad (with weekly office hours), with bugfixing, full screen, multi-device, long context, diff based edits (using speculative decoding like we covered in Inference, Fast and Slow).All of this has captured the imagination of low/no code builders like Greg Isenberg and many others on YouTube/TikTok/Reddit/X/Linkedin etc:Just as with Fireworks, our relationship with Bolt/Stackblitz goes a bit deeper than normal - swyx advised the launch and got a front row seat to this epic journey, as well as demoed it with Realtime Voice at the recent OpenAI Dev Day. So we are very proud to be the first/closest to tell the full open story of Bolt/Stackblitz!Flow Engineering + Qodo/AlphaCodium UpdateIn year 2 of the pod we have been on a roll getting former guests to return as guest cohosts (Harrison Chase, Aman Sanger, Jon Frankle), and it was a pleasure to catch Itamar Friedman back on the pod, giving us an update on all things Qodo and Testing Agents from our last catchup a year and a half ago:Qodo (they renamed in September) went viral in early January this year with AlphaCodium (paper here, code here) beating DeepMind's AlphaCode with high efficiency:With a simple problem solving code agent:* The first step is to have the model reason about the problem. They describe it using bullet points and focus on the goal, inputs, outputs, rules, constraints, and any other relevant details.* Then, they make the model reason about the public tests and come up with an explanation of why the input leads to that particular output. * The model generates two to three potential solutions in text and ranks them in terms of correctness, simplicity, and robustness. * Then, it generates more diverse tests for the problem, covering cases not part of the original public tests. * Iteratively, pick a solution, generate the code, and run it on a few test cases. * If the tests fail, improve the code and repeat the process until the code passes every test.swyx has previously written similar thoughts on types vs tests for putting bounds on program behavior, but AlphaCodium extends this to AI generated tests and code.More recently, Itamar has also shown that AlphaCodium's techniques also extend well to the o1 models:Making Flow Engineering a useful technique to improve code model performance on every model. This is something we see AI Engineers uniquely well positioned to do compared to ML Engineers/Researchers.Full Video PodcastLike and subscribe!Show Notes* Itamar* Qodo* First episode* Eric* Bolt* StackBlitz* Thinkster* AlphaCodium* WebContainersChapters* 00:00:00 Introductions & Updates* 00:06:01 Generic vs. Specific AI Agents* 00:07:40 Maintaining vs Creating with AI* 00:17:46 Human vs Agent Computer Interfaces* 00:20:15 Why Docker doesn't work for Bolt* 00:24:23 Creating Testing and Code Review Loops* 00:28:07 Bolt's Task Breakdown Flow* 00:31:04 AI in Complex Enterprise Environments* 00:41:43 AlphaCodium* 00:44:39 Strategies for Breaking Down Complex Tasks* 00:45:22 Building in Open Source* 00:50:35 Choosing a product as a founder* 00:59:03 Reflections on Bolt Success* 01:06:07 Building a B2C GTM* 01:18:11 AI Capabilities and Pricing Tiers* 01:20:28 What makes Bolt unique* 01:23:07 Future Growth and Product Development* 01:29:06 Competitive Landscape in AI Engineering* 01:30:01 Advice to Founders and Embracing AI* 01:32:20 Having a baby and completing an Iron ManTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:12]: Hey, and today we're still in our sort of makeshift in-between studio, but we're very delighted to have a former returning guest host, Itamar. Welcome back.Itamar [00:00:21]: Great to be here after a year or more. Yeah, a year and a half.Swyx [00:00:24]: You're one of our earliest guests on Agents. Now you're CEO co-founder of Kodo. Right. Which has just been renamed. You also raised a $40 million Series A, and we can get caught up on everything, but we're also delighted to have our new guest, Eric. Welcome.Eric [00:00:42]: Thank you. Excited to be here. Should I say Bolt or StackBlitz?Swyx [00:00:45]: Like, is it like its own company now or?Eric [00:00:47]: Yeah. Bolt's definitely bolt.new. That's the thing that we're probably the most known for, I imagine, at this point.Swyx [00:00:54]: Which is ridiculous to say because you were working at StackBlitz for so long.Eric [00:00:57]: Yeah. I mean, within a week, we were doing like double the amount of traffic. And StackBlitz had been online for seven years, and we were like, what? But anyways, yeah. So we're StackBlitz, the company behind bolt.new. If you've heard of bolt.new, that's our stuff. Yeah.Swyx [00:01:12]: Yeah.Itamar [00:01:13]: Excellent. I see, by the way, that the founder mode, you need to know to capture opportunities. So kudos on doing that, right? You're working on some technology, and then suddenly you can exploit that to a new world. Yeah.Eric [00:01:24]: Totally. And I think, well, not to jump, but 100%, I mean, a couple of months ago, we had the idea for Bolt earlier this year, but we haven't really shared this too much publicly. But we actually had tried to build it with some of those state-of-the-art models back in January, February, you can kind of imagine which, and they just weren't good enough to actually do the code generation where the code was accurate and it was fast and whatever have you without a ton of like rag, but then there was like issues with that. So we put it on the shelf and then we got kind of a sneak peek of some of the new models that have come out in the past couple of months now. And so once we saw that, once we actually saw the code gen from it, we were like, oh my God, like, okay, we can build a product around this. And so that was really the impetus of us building the thing. But with that, it was StackBlitz, the core StackBlitz product the past seven years has been an IDE for developers. So the entire user experience flow we've built up just didn't make sense. And so when we kind of went out to build Bolt, we just thought, you know, if we were inventing our product today, what would the interface look like given what is now possible with the AI code gen? And so there's definitely a lot of conversations we had internally, but you know, just kind of when we logically laid it out, we were like, yeah, I think it makes sense to just greenfield a new thing and let's see what happens. If it works great, then we'll figure it out. If it doesn't work great, then it'll get deleted at some point. So that's kind of how it actually came to be.Swyx [00:02:49]: I'll mention your background a little bit. You were also founder of Thinkster before you started StackBlitz. So both of you are second time founders. Both of you have sort of re-founded your company recently. Yours was more of a rename. I think a slightly different direction as well. And then we can talk about both. Maybe just chronologically, should we get caught up on where Kodo is first and then you know, just like what people should know since the last pod? Sure.Itamar [00:03:12]: The last pod was two months after we launched and we basically had the vision that we talked about. The idea that software development is about specification, test and code, etc. We are more on the testing part as in essence, we think that if you solve testing, you solve software development. The beautiful chart that we'll put up on screen. And testing is a really big field, like there are many dimensions, unit testing, the level of the component, how big it is, how large it is. And then there is like different type of testing, is it regression or smoke or whatever. So back then we only had like one ID extension with unit tests as in focus. One and a half year later, first ID extension supports more type of testing as context aware. We index local, local repos, but also 10,000s of repos for Fortune 500 companies. We have another agent, another tool that is called, the pure agent is the open source and the commercial one is CodoMerge. And then we have another open source called CoverAgent, which is not yet a commercial product coming very soon. It's very impressive. It could be that already people are approving automated pull requests that they don't even aware in really big open sources. So once we have enough of these, we will also launch another agent. So for the first one and a half year, what we did is grew in our offering and mostly on the side of, does this code actually works, testing, code review, et cetera. And we believe that's the critical milestone that needs to be achieved to actually have the AI engineer for enterprise software. And then like for the first year was everything bottom up, getting to 1 million installation. 2024, that was 2023, 2024 was starting to monetize, to feel like how it is to make the first buck. So we did the teams offering, it went well with a thousand of teams, et cetera. And then we started like just a few months ago to do enterprise with everything you need, which is a lot of things that discussed in the last post that was just released by Codelm. So that's how we call it at Codelm. Just opening the brackets, our company name was Codelm AI, and we renamed to Codo and we call our models Codelm. So back to my point, so we started Enterprise Motion and already have multiple Fortune 100 companies. And then with that, we raised a series of $40 million. And what's exciting about it is that enables us to develop more agents. That's our focus. I think it's very different. We're not coming very soon with an ID or something like that.Swyx [00:06:01]: You don't want to fork this code?Itamar [00:06:03]: Maybe we'll fork JetBrains or something just to be different.Swyx [00:06:08]: I noticed that, you know, I think the promise of general purpose agents has kind of died. Like everyone is doing kind of what you're doing. There's Codogen, Codomerge, and then there's a third one. What's the name of it?Itamar [00:06:17]: Yeah. Codocover. Cover. Which is like a commercial version of a cover agent. It's coming soon.Swyx [00:06:23]: Yeah. It's very similar with factory AI, also doing like droids. They all have special purpose doing things, but people don't really want general purpose agents. Right. The last time you were here, we talked about AutoGBT, the biggest thing of 2023. This year, not really relevant anymore. And I think it's mostly just because when you give me a general purpose agent, I don't know what to do with it.Eric [00:06:42]: Yeah.Itamar [00:06:43]: I totally agree with that. We're seeing it for a while and I think it will stay like that despite the computer use, et cetera, that supposedly can just replace us. You can just like prompt it to be, hey, now be a QA or be a QA person or a developer. I still think that there's a few reasons why you see like a dedicated agent. Again, I'm a bit more focused, like my head is more on complex software for big teams and enterprise, et cetera. And even think about permissions and what are the data sources and just the same way you manage permissions for users. Developers, you probably want to have dedicated guardrails and dedicated approvals for agents. I intentionally like touched a point on not many people think about. And of course, then what you can think of, like maybe there's different tools, tool use, et cetera. But just the first point by itself is a good reason why you want to have different agents.Alessio [00:07:40]: Just to compare that with Bot.new, you're almost focused on like the application is very complex and now you need better tools to kind of manage it and build on top of it. On Bot.new, it's almost like I was using it the other day. There's basically like, hey, look, I'm just trying to get started. You know, I'm not very opinionated on like how you're going to implement this. Like this is what I want to do. And you build a beautiful app with it. What people ask as the next step, you know, going back to like the general versus like specific, have you had people say, hey, you know, this is great to start, but then I want a specific Bot.new dot whatever else to do a more vertical integration and kind of like development or what's the, what do people say?Eric [00:08:18]: Yeah. I think, I think you kind of hit the, hit it head on, which is, you know, kind of the way that we've, we've kind of talked about internally is it's like people are using Bolt to go from like 0.0 to 1.0, like that's like kind of the biggest unlock that Bolt has versus most other things out there. I mean, I think that's kind of what's, what's very unique about Bolt. I think the, you know, the working on like existing enterprise applications is, I mean, it's crazy important because, you know, there's a, you look, when you look at the fortune 500, I mean, these code bases, some of these have been around for 20, 30 plus years. And so it's important to be going from, you know, 101.3 to 101.4, et cetera. I think for us, so what's been actually pretty interesting is we see there's kind of two different users for us that are coming in and it's very distinct. It's like people that are developers already. And then there's people that have never really written software and more if they have, it's been very, very minimal. And so in the first camp, what these developers are doing, like to go from zero to one, they're coming to Bolt and then they're ejecting the thing to get up or just downloading it and, you know, opening cursor, like whatever to, to, you know, keep iterating on the thing. And sometimes they'll bring it back to Bolt to like add in a huge piece of functionality or something. Right. But for the people that don't know how to code, they're actually just, they, they live in this thing. And that was one of the weird things when we launched is, you know, within a day of us being online, one of the most popular YouTube videos, and there's been a ton since, which was, you know, there's like, oh, Bolt is the cursor killer. And I originally saw the headlines and I was like, thanks for the views. I mean, I don't know. This doesn't make sense to me. That's not, that's not what we kind of thought.Swyx [00:09:44]: It's how YouTubers talk to each other. Well, everything kills everything else.Eric [00:09:47]: Totally. But what blew my mind was that there was any comparison because it's like cursor is a, is a local IDE product. But when, when we actually kind of dug into it and we, and we have people that are using our product saying this, I'm not using cursor. And I was like, what? And it turns out there are hundreds of thousands of people that we have seen that we're using cursor and we're trying to build apps with that where they're not traditional software does, but we're heavily leaning on the AI. And as you can imagine, it is very complicated, right? To do that with cursor. So when Bolt came out, they're like, wow, this thing's amazing because it kind of inverts the complexity where it's like, you know, it's not an IDE, it's, it's a, it's a chat-based sort of interface that we have. So that's kind of the split, which is rather interesting. We've had like the first startups now launch off of Bolt entirely where this, you know, tomorrow I'm doing a live stream with this guy named Paul, who he's built an entire CRM using this thing and you know, with backend, et cetera. And people have made their first money on the internet period, you know, launching this with Stripe or whatever have you. So that's, that's kind of the two main, the two main categories of folks that we see using Bolt though.Itamar [00:10:51]: I agree that I don't understand the comparison. It doesn't make sense to me. I think like we have like two type of families of tools. One is like we re-imagine the software development. I think Bolt is there and I think like a cursor is more like a evolution of what we already have. It's like taking the IDE and it's, it's amazing and it's okay, let's, let's adapt the IDE to an era where LLMs can do a lot for us. And Bolt is more like, okay, let's rethink everything totally. And I think we see a few tools there, like maybe Vercel, Veo and maybe Repl.it in that area. And then in the area of let's expedite, let's change, let's, let's progress with what we already have. You can see Cursor and Kodo, but we're different between ourselves, Cursor and Kodo, but definitely I think that comparison doesn't make sense.Alessio [00:11:42]: And just to set the context, this is not a Twitter demo. You've made 4 million of revenue in four weeks. So this is, this is actually working, you know, it's not a, what, what do you think that is? Like, there's been so many people demoing coding agents on Twitter and then it doesn't really work. And then you guys were just like, here you go, it's live, go use it, pay us for it. You know, is there anything in the development that was like interesting and maybe how that compares to building your own agents?Eric [00:12:08]: We had no idea, honestly, like we, we, we've been pretty blown away and, and things have just kind of continued to grow faster since then. We're like, oh, today is week six. So I, I kind of came back to the point you just made, right, where it's, you, you kind of outlined, it's like, there's kind of this new market of like kind of rethinking the software development and then there's heavily augmenting existing developers. I think that, you know, both of which are, you know, AI code gen being extremely good, it's allowed existing developers, it's allowing existing developers to camera out software far faster than they could have ever before, right? It's like the ultimate power tool for an existing developer. But this code gen stuff is now so good. And then, and we saw this over the past, you know, from the beginning of the year when we tried to first build, it's actually lowered the barrier to people that, that aren't traditionally software engineers. But the kind of the key thing is if you kind of think about it from, imagine you've never written software before, right? My co-founder and I, he and I grew up down the street from each other in Chicago. We learned how to code when we were 13 together and we've been building stuff ever since. And this is back in like the mid 2000s or whatever, you know, there was nothing for free to learn from online on the internet and how to code. For our 13th birthdays, we asked our parents for, you know, O'Reilly books cause you couldn't get this at the library, right? And so instead of like an Xbox, we got, you know, programming books. But the hardest part for everyone learning to code is getting an environment set up locally, you know? And so when we built StackBlitz, like kind of the key thesis, like seven years ago, the insight we had was that, Hey, it seems like the browser has a lot of new APIs like WebAssembly and service workers, et cetera, where you could actually write an operating system that ran inside the browser that could boot in milliseconds. And you, you know, basically there's this missing capability of the web. Like the web should be able to build apps for the web, right? You should be able to build the web on the web. Every other platform has that, Visual Studio for Windows, Xcode for Mac. The web has no built in primitive for this. And so just like our built in kind of like nerd instinct on this was like, that seems like a huge hole and it's, you know, it will be very valuable or like, you know, very valuable problem to solve. So if you want to set up that environments, you know, this is what we spent the past seven years doing. And the reality is existing developers have running locally. They already know how to set up that environment. So the problem isn't as acute for them. When we put Bolt online, we took that technology called WebContainer and married it with these, you know, state of the art frontier models. And the people that have the most pain with getting stuff set up locally is people that don't code. I think that's been, you know, really the big explosive reason is no one else has been trying to make dev environments work inside of a browser tab, you know, for the past if since ever, other than basically our company, largely because there wasn't an immediate demand or need. So I think we kind of find ourselves at the right place at the right time. And again, for this market of people that don't know how to write software, you would kind of expect that you should be able to do this without downloading something to your computer in the same way that, hey, I don't have to download Photoshop now to make designs because there's Figma. I don't have to download Word because there's, you know, Google Docs. They're kind of looking at this as that sort of thing, right? Which was kind of the, you know, our impetus and kind of vision from the get-go. But you know, the code gen, the AI code gen stuff that's come out has just been, you know, an order of magnitude multiplier on how magic that is, right? So that's kind of my best distillation of like, what is going on here, you know?Alessio [00:15:21]: And you can deploy too, right?Eric [00:15:22]: Yeah.Alessio [00:15:23]: Yeah.Eric [00:15:24]: And so that's, what's really cool is it's, you know, we have deployment built in with Netlify and this is actually, I think, Sean, you actually built this at Netlify when you were there. Yeah. It's one of the most brilliant integrations actually, because, you know, effectively the API that Sean built, maybe you can speak to it, but like as a provider, we can just effectively give files to Netlify without the user even logging in and they have a live website. And if they want to keep, hold onto it, they can click a link and claim it to their Netlify account. But it basically is just this really magic experience because when you come to Bolt, you say, I want a website. Like my mom, 70, 71 years old, made her first website, you know, on the internet two weeks ago, right? It was about her nursing days.Swyx [00:16:03]: Oh, that's fantastic though. It wouldn't have been made.Eric [00:16:06]: A hundred percent. Cause even in, you know, when we've had a lot of people building personal, like deeply personal stuff, like in the first week we launched this, the sales guy from the East Coast, you know, replied to a tweet of mine and he said, thank you so much for building this to your team. His daughter has a medical condition and so for her to travel, she has to like line up donors or something, you know, so ahead of time. And so he actually used Bolt to make a website to do that, to actually go and send it to folks in the region she was going to travel to ahead of time. I was really touched by it, but I also thought like, why, you know, why didn't he use like Wix or Squarespace? Right? I mean, this is, this is a solved problem, quote unquote, right? And then when I thought, I actually use Squarespace for my, for my, uh, the wedding website for my wife and I, like back in 2021, so I'm familiar, you know, it was, it was faster. I know how to code. I was like, this is faster. Right. And I thought back and I was like, there's a whole interface you have to learn how to use. And it's actually not that simple. There's like a million things you can configure in that thing. When you come to Bolt, there's a, there's a text box. You just say, I need a, I need a wedding website. Here's the date. Here's where it is. And here's a photo of me and my wife, put it somewhere relevant. It's actually the simplest way. And that's what my, when my mom came, she said, uh, I'm Pat Simons. I was a nurse in the seventies, you know, and like, here's the things I did and a website came out. So coming back to why is this such a, I think, why are we seeing this sort of growth? It's, this is the simplest interface I think maybe ever created to actually build it, a deploy a website. And then that website, my mom made, she's like, okay, this looks great. And there's, there's one button, you just click it, deploy, and it's live and you can buy a domain name, attach it to it. And you know, it's as simple as it gets, it's getting even simpler with some of the stuff we're working on. But anyways, so that's, it's, it's, uh, it's been really interesting to see some of the usage like that.Swyx [00:17:46]: I can offer my perspective. So I, you know, I probably should have disclosed a little bit that, uh, I'm a, uh, stack list investor.Alessio [00:17:53]: Canceled the episode. I know, I know. Don't play it now. Pause.Eric actually reached out to ShowMeBolt before the launch. And we, you know, we talked a lot about, like, the framing of, of what we're going to talk about how we marketed the thing, but also, like, what we're So that's what Bolt was going to need, like a whole sort of infrastructure.swyx: Netlify, I was a maintainer but I won't take claim for the anonymous upload. That's actually the origin story of Netlify. We can have Matt Billman talk about it, but that was [00:18:00] how Netlify started. You could drag and drop your zip file or folder from your desktop onto a website, it would have a live URL with no sign in.swyx: And so that was the origin story of Netlify. And it just persists to today. And it's just like it's really nice, interesting that both Bolt and CognitionDevIn and a bunch of other sort of agent type startups, they all use Netlify to deploy because of this one feature. They don't really care about the other features.swyx: But, but just because it's easy for computers to use and talk to it, like if you build an interface for computers specifically, that it's easy for them to Navigate, then they will be used in agents. And I think that's a learning that a lot of developer tools companies are having. That's my bolt launch story and now if I say all that stuff.swyx: And I just wanted to come back to, like, the Webcontainers things, right? Like, I think you put a lot of weight on the technical modes. I think you also are just like, very good at product. So you've, you've like, built a better agent than a lot of people, the rest of us, including myself, who have tried to build these things, and we didn't get as far as you did.swyx: Don't shortchange yourself on products. But I think specifically [00:19:00] on, on infra, on like the sandboxing, like this is a thing that people really want. Alessio has Bax E2B, which we'll have on at some point, talking about like the sort of the server full side. But yours is, you know, inside of the browser, serverless.swyx: It doesn't cost you anything to serve one person versus a million people. It doesn't, doesn't cost you anything. I think that's interesting. I think in theory, we should be able to like run tests because you can run the full backend. Like, you can run Git, you can run Node, you can run maybe Python someday.swyx: We talked about this. But ideally, you should be able to have a fully gentic loop, running code, seeing the errors, correcting code, and just kind of self healing, right? Like, I mean, isn't that the dream?Eric: Totally.swyx: Yeah,Eric: totally. At least in bold, we've got, we've got a good amount of that today. I mean, there's a lot more for us to do, but one of the nice things, because like in web container, you know, there's a lot of kind of stuff you go Google like, you know, turn docker container into wasm.Eric: You'll find a lot of stuff out there that will do that. The problem is it's very big, it's slow, and that ruins the experience. And so what we ended up doing is just writing an operating system from [00:20:00] scratch that was just purpose built to, you know, run in a browser tab. And the reason being is, you know, Docker 2 awesome things will give you an image that's like out 60 to 100 megabits, you know, maybe more, you know, and our, our OS, you know, kind of clocks in, I think, I think we're in like a, maybe, maybe a megabyte or less or something like that.Eric: I mean, it's, it's, you know, really, really, you know, stripped down.swyx: This is basically the task involved is I understand that it's. Mapping every single, single Linux call to some kind of web, web assembly implementation,Eric: but more or less, and, and then there's a lot of things actually, like when you're looking at a dev environment, there's a lot of things that you don't need that a traditional OS is gonna have, right?Eric: Like, you know audio drivers or you like, there's just like, there's just tons of things. Oh, yeah. Right. Yeah. That goes . Yeah. You can just kind, you can, you can kind of tos them. Or alternatively, what you can do is you can actually be the nice thing. And this is, this kind of comes back to the origins of browsers, which is, you know, they're, they're at the beginning of the web and, you know, the late nineties, there was two very different kind of visions for the web where Alan Kay vehemently [00:21:00] disagree with the idea that should be document based, which is, you know, Tim Berners Lee, you know, that, and that's kind of what ended up winning, winning was this document based kind of browsing documents on the web thing.Eric: Alan Kay, he's got this like very famous quote where he said, you know, you want web browsers to be mini operating systems. They should download little mini binaries and execute with like a little mini virtualized operating system in there. And what's kind of interesting about the history, not to geek out on this aspect, what's kind of interesting about the history is both of those folks ended up being right.Eric: Documents were actually the pragmatic way that the web worked. Was, you know, became the most ubiquitous platform in the world to the degree now that this is why WebAssembly has been invented is that we're doing, we need to do more low level things in a browser, same thing with WebGPU, et cetera. And so all these APIs, you know, to build an operating system came to the browser.Eric: And that was actually the realization we had in 2017 was, holy heck, like you can actually, you know, service workers, which were designed for allowing your app to work offline. That was the kind of the key one where it was like, wait a second, you can actually now run. Web servers within a [00:22:00] browser, like you can run a server that you open up.Eric: That's wild. Like full Node. js. Full Node. js. Like that capability. Like, I can have a URL that's programmatically controlled. By a web application itself, boom. Like the web can build the web. The primitive is there. Everyone at the time, like we talked to people that like worked on, you know Chrome and V8 and they were like, uhhhh.Eric: You know, like I don't know. But it's one of those things you just kind of have to go do it to find out. So we spent a couple of years, you know, working on it and yeah. And, and, and got to work in back in 2021 is when we kind of put the first like data of web container online. Butswyx: in partnership with Google, right?swyx: Like Google actually had to help you get over the finish line with stuff.Eric: A hundred percent, because well, you know, over the years of when we were doing the R and D on the thing. Kind of the biggest challenge, the two ways that you can kind of test how powerful and capable a platform are, the two types of applications are one, video games, right, because they're just very compute intensive, a lot of calculations that have to happen, right?Eric: The second one are IDEs, because you're talking about actually virtualizing the actual [00:23:00] runtime environment you are in to actually build apps on top of it, which requires sophisticated capabilities, a lot of access to data. You know, a good amount of compute power, right, to effectively, you know, building app in app sort of thing.Eric: So those, those are the stress tests. So if your platform is missing stuff, those are the things where you find out. Those are, those are the people building games and IDEs. They're the ones filing bugs on operating system level stuff. And for us, browser level stuff.Eric [00:23:47]: yeah, what ended up happening is we were just hammering, you know, the Chromium bug tracker, and they're like, who are these guys? Yeah. And, and they were amazing because I mean, just making Chrome DevTools be able to debug, I mean, it's, it's not, it wasn't originally built right for debugging an operating system, right? They've been phenomenal working with us and just kind of really pushing the limits, but that it's a rising tide that's kind of lifted all boats because now there's a lot of different types of applications that you can debug with Chrome Dev Tools that are running a browser that runs more reliably because just the stress testing that, that we and, you know, games that are coming to the web are kind of pushing as well, but.Itamar [00:24:23]: That's awesome. About the testing, I think like most, let's say coding assistant from different kinds will need this loop of testing. And even I would add code review to some, to some extent that you mentioned. How is testing different from code review? Code review could be, for example, PR review, like a code review that is done at the point of when you want to merge branches. But I would say that code review, for example, checks best practices, maintainability, and so on. It's not just like CI, but more than CI. And testing is like a more like checking functionality, et cetera. So it's different. We call, by the way, all of these together code integrity, but that's a different story. Just to go back to the, to the testing and specifically. Yeah. It's, it's, it's since the first slide. Yeah. We're consistent. So if we go back to the testing, I think like, it's not surprising that for us testing is important and for Bolt it's testing important, but I want to shed some light on a different perspective of it. Like let's think about autonomous driving. Those startups that are doing autonomous driving for highway and autonomous driving for the city. And I think like we saw the autonomous of the highway much faster and reaching to a level, I don't know, four or so much faster than those in the city. Now, in both cases, you need testing and quote unquote testing, you know, verifying validation that you're doing the right thing on the road and you're reading and et cetera. But it's probably like so different in the city that it could be like actually different technology. And I claim that we're seeing something similar here. So when you're building the next Wix, and if I was them, I was like looking at you and being a bit scared. That's what you're disrupting, what you just said. Then basically, I would say that, for example, the UX UI is freaking important. And because you're you're more aiming for the end user. In this case, maybe it's an end user that doesn't know how to develop for developers. It's also important. But let alone those that do not know to develop, they need a slick UI UX. And I think like that's one reason, for example, I think Cursor have like really good technology. I don't know the underlying what's under the hood, but at least what they're saying. But I think also their UX UI is great. It's a lot because they did their own ID. While if you're aiming for the city AI, suddenly like there's a lot of testing and code review technology that it's not necessarily like that important. For example, let's talk about integration tests. Probably like a lot of what you're building involved at the moment is isolated applications. Maybe the vision or the end game is maybe like having one solution for everything. It could be that eventually the highway companies will go into the city and the other way around. But at the beginning, there is a difference. And integration tests are a good example. I guess they're a bit less important. And when you think about enterprise software, they're really important. So to recap, like I think like the idea of looping and verifying your test and verifying your code in different ways, testing or code review, et cetera, seems to be important in the highway AI and the city AI, but in different ways and different like critical for the city, even more and more variety. Actually, I was looking to ask you like what kind of loops you guys are doing. For example, when I'm using Bolt and I'm enjoying it a lot, then I do see like sometimes you're trying to catch the errors and fix them. And also, I noticed that you're breaking down tasks into smaller ones and then et cetera, which is already a common notion for a year ago. But it seems like you're doing it really well. So if you're willing to share anything about it.Eric [00:28:07]: Yeah, yeah. I realized I never actually hit the punchline of what I was saying before. I mentioned the point about us kind of writing an operating system from scratch because what ended up being important about that is that to your point, it's actually a very, like compared to like a, you know, if you're like running cursor on anyone's machine, you kind of don't know what you're dealing with, with the OS you're running on. There could be an error happens. It could be like a million different things, right? There could be some config. There could be, it could be God knows what, right? The thing with WebConnect is because we wrote the entire thing from scratch. It's actually a unified image basically. And we can instrument it at any level that we think is going to be useful, which is exactly what we did when we started building Bolt is we instrumented stuff at like the process level, at the runtime level, you know, et cetera, et cetera, et cetera. Stuff that would just be not impossible to do on local, but to do that in a way that works across any operating system, whatever is, I mean, would just be insanely, you know, insanely difficult to do right and reliably. And that's what you saw when you've used Bolt is that when an error actually will occur, whether it's in the build process or the actual web application itself is failing or anything kind of in between, you can actually capture those errors. And today it's a very primitive way of how we've implemented it largely because the product just didn't exist 90 days ago. So we're like, we got some work ahead of us and we got to hire some more a little bit, but basically we present and we say, Hey, this is, here's kind of the things that went wrong. There's a fix it button and then a ignore button, and then you can just hit fix it. And then we take all that telemetry through our agent, you run it through our agent and say, kind of, here's the state of the application. Here's kind of the errors that we got from Node.js or the browser or whatever, and like dah, dah, dah, dah. And it can take a crack at actually solving it. And it's actually pretty darn good at being able to do that. That's kind of been a, you know, closing the loop and having it be a reliable kind of base has seemed to be a pretty big upgrade over doing stuff locally, just because I think that's a pretty key ingredient of it. And yeah, I think breaking things down into smaller tasks, like that's, that's kind of a key part of our agent. I think like Claude did a really good job with artifacts. I think, you know, us and kind of everyone else has, has kind of taken their approach of like actually breaking out certain tasks in a certain order into, you know, kind of a concrete way. And, and so actually the core of Bolt, I know we actually made open source. So you can actually go and check out like the system prompts and et cetera, and you can run it locally and whatever have you. So anyone that's interested in this stuff, I'd highly recommend taking a look at. There's not a lot of like stuff that's like open source in this realm. It's, that was one of the fun things that we've we thought would be cool to do. And people, people seem to like it. I mean, there's a lot of forks and people adding different models and stuff. So it's been cool to see.Swyx [00:30:41]: Yeah. I'm happy to add, I added real-time voice for my opening day demo and it was really fun to hack with. So thank you for doing that. Yeah. Thank you. I'm going to steal your code.Eric [00:30:52]: Because I want that.Swyx [00:30:52]: It's funny because I built on top of the fork of Bolt.new that already has the multi LLM thing. And so you just told me you're going to merge that in. So then you're going to merge two layers of forks down into this thing. So it'll be fun.Eric [00:31:03]: Heck yeah.Alessio [00:31:04]: Just to touch on like the environment, Itamar, you maybe go into the most complicated environments that even the people that work there don't know how to run. How much of an impact does that have on your performance? Like, you know, it's most of the work you're doing actually figuring out environment and like the libraries, because I'm sure they're using outdated version of languages, they're using outdated libraries, they're using forks that have not been on the public internet before. How much of the work that you're doing is like there versus like at the LLM level?Itamar [00:31:32]: One of the reasons I was asking about, you know, what are the steps to break things down, because it really matters. Like, what's the tech stack? How complicated the software is? It's hard to figure it out when you're dealing with the real world, any environment of enterprise as a city, when I'm like, while maybe sometimes like, I think you do enable like in Bolt, like to install stuff, but it's quite a like controlled environment. And that's a good thing to do, because then you narrow down and it's easier to make things work. So definitely, there are two dimensions, I think, actually spaces. One is the fact just like installing our software without yet like doing anything, making it work, just installing it because we work with enterprise and Fortune 500, etc. Many of them want on prem solution.Swyx [00:32:22]: So you have how many deployment options?Itamar [00:32:24]: Basically, we had, we did a metric metrics, say 96 options, because, you know, they're different dimensions. Like, for example, one dimension, we connect to your code management system to your Git. So are you having like GitHub, GitLab? Subversion? Is it like on cloud or deployed on prem? Just an example. Which model agree to use its APIs or ours? Like we have our Is it TestGPT? Yeah, when we started with TestGPT, it was a huge mistake name. It was cool back then, but I don't think it's a good idea to name a model after someone else's model. Anyway, that's my opinion. So we gotSwyx [00:33:02]: I'm interested in these learnings, like things that you change your mind on.Itamar [00:33:06]: Eventually, when you're building a company, you're building a brand and you want to create your own brand. By the way, when I thought about Bolt.new, I also thought about if it's not a problem, because when I think about Bolt, I do think about like a couple of companies that are already called this way.Swyx [00:33:19]: Curse companies. You could call it Codium just to...Itamar [00:33:24]: Okay, thank you. Touche. Touche.Eric [00:33:27]: Yeah, you got to imagine the board meeting before we launched Bolt, one of our investors, you can imagine they're like, are you sure? Because from the investment side, it's kind of a famous, very notorious Bolt. And they're like, are you sure you want to go with that name? Oh, yeah. Yeah, absolutely.Itamar [00:33:43]: At this point, we have actually four models. There is a model for autocomplete. There's a model for the chat. There is a model dedicated for more for code review. And there is a model that is for code embedding. Actually, you might notice that there isn't a good code embedding model out there. Can you name one? Like dedicated for code?Swyx [00:34:04]: There's code indexing, and then you can do sort of like the hide for code. And then you can embed the descriptions of the code.Itamar [00:34:12]: Yeah, but you do see a lot of type of models that are dedicated for embedding and for different spaces, different fields, etc. And I'm not aware. And I know that if you go to the bedrock, try to find like there's a few code embedding models, but none of them are specialized for code.Swyx [00:34:31]: Is there a benchmark that you would tell us to pay attention to?Itamar [00:34:34]: Yeah, so it's coming. Wait for that. Anyway, we have our models. And just to go back to the 96 option of deployment. So I'm closing the brackets for us. So one is like dimensional, like what Git deployment you have, like what models do you agree to use? Dotter could be like if it's air-gapped completely, or you want VPC, and then you have Azure, GCP, and AWS, which is different. Do you use Kubernetes or do not? Because we want to exploit that. There are companies that do not do that, etc. I guess you know what I mean. So that's one thing. And considering that we are dealing with one of all four enterprises, we needed to deal with that. So you asked me about how complicated it is to solve that complex code. I said, it's just a deployment part. And then now to the software, we see a lot of different challenges. For example, some companies, they did actually a good job to build a lot of microservices. Let's not get to if it's good or not, but let's first assume that it is a good thing. A lot of microservices, each one of them has their own repo. And now you have tens of thousands of repos. And you as a developer want to develop something. And I remember me coming to a corporate for the first time. I don't know where to look at, like where to find things. So just doing a good indexing for that is like a challenge. And moreover, the regular indexing, the one that you can find, we wrote a few blogs on that. By the way, we also have some open source, different than yours, but actually three and growing. Then it doesn't work. You need to let the tech leads and the companies influence your indexing. For example, Mark with different repos with different colors. This is a high quality repo. This is a lower quality repo. This is a repo that we want to deprecate. This is a repo we want to grow, etc. And let that be part of your indexing. And only then things actually work for enterprise and they don't get to a fatigue of, oh, this is awesome. Oh, but I'm starting, it's annoying me. I think Copilot is an amazing tool, but I'm quoting others, meaning GitHub Copilot, that they see not so good retention of GitHub Copilot and enterprise. Ooh, spicy. Yeah. I saw snapshots of people and we have customers that are Copilot users as well. And also I saw research, some of them is public by the way, between 38 to 50% retention for users using Copilot and enterprise. So it's not so good. By the way, I don't think it's that bad, but it's not so good. So I think that's a reason because, yeah, it helps you auto-complete, but then, and especially if you're working on your repo alone, but if it's need that context of remote repos that you're code-based, that's hard. So to make things work, there's a lot of work on that, like giving the controllability for the tech leads, for the developer platform or developer experience department in the organization to influence how things are working. A short example, because if you have like really old legacy code, probably some of it is not so good anymore. If you just fine tune on these code base, then there is a bias to repeat those mistakes or old practices, etc. So you need, for example, as I mentioned, to influence that. For example, in Coda, you can have a markdown of best practices by the tech leads and Coda will include that and relate to that and will not offer suggestions that are not according to the best practices, just as an example. So that's just a short list of things that you need to do in order to deal with, like you mentioned, the 100.1 to 100.2 version of software. I just want to say what you're doing is extremelyEric [00:38:32]: impressive because it's very difficult. I mean, the business of Stackplus, kind of before bulk came online, we sold a version of our IDE that went on-prem. So I understand what you're saying about the difficulty of getting stuff just working on-prem. Holy heck. I mean, that is extremely hard. I guess the question I have for you is, I mean, we were just doing that with kind of Kubernetes-based stuff, but the spread of Fortune 500 companies that you're working with, how are they doing the inference for this? Are you kind of plugging into Azure's OpenAI stuff and AWS's Bedrock, you know, Cloud stuff? Or are they just like running stuff on GPUs? Like, what is that? How are these folks approaching that? Because, man, what we saw on the enterprise side, I mean, I got to imagine that that's a huge challenge. Everything you said and more, like,Itamar [00:39:15]: for example, like someone could be, and I don't think any of these is bad. Like, they made their decision. Like, for example, some people, they're, I want only AWS and VPC on AWS, no matter what. And then they, some of them, like there is a subset, I will say, I'm willing to take models only for from Bedrock and not ours. And we have a problem because there is no good code embedding model on Bedrock. And that's part of what we're doing now with AWS to solve that. We solve it in a different way. But if you are willing to run on AWS VPC, but run your run models on GPUs or inferentia, like the new version of the more coming out, then our models can run on that. But everything you said is right. Like, we see like on-prem deployment where they have their own GPUs. We see Azure where you're using OpenAI Azure. We see cases where you're running on GCP and they want OpenAI. Like this cross, like a case, although there is Gemini or even Sonnet, I think is available on GCP, just an example. So all the options, that's part of the challenge. I admit that we thought about it, but it was even more complicated. And it took us a few months to actually, that metrics that I mentioned, to start clicking each one of the blocks there. A few months is impressive. I mean,Eric [00:40:35]: honestly, just that's okay. Every one of these enterprises is, their networking is different. Just everything's different. Every single one is different. I see you understand. Yeah. So that just cannot be understated. That it is, that's extremely impressive. Hats off.Itamar [00:40:50]: It could be, by the way, like, for example, oh, we're only AWS, but our GitHub enterprise is on-prem. Oh, we forgot. So we need like a private link or whatever, like every time like that. It's not, and you do need to think about it if you want to work with an enterprise. And it's important. Like I understand like their, I respect their point of view.Swyx [00:41:10]: And this primarily impacts your architecture, your tech choices. Like you have to, you can't choose some vendors because...Itamar [00:41:15]: Yeah, definitely. To be frank, it makes us hard for a startup because it means that we want, we want everyone to enjoy all the variety of models. By the way, it was hard for us with our technology. I want to open a bracket, like a window. I guess you're familiar with our Alpha Codium, which is an open source.Eric [00:41:33]: We got to go over that. Yeah. So I'll do that quickly.Itamar [00:41:36]: Yeah. A pin in that. Yeah. Actually, we didn't have it in the last episode. So, so, okay.Swyx [00:41:41]: Okay. We'll come back to that later, but let's talk about...Itamar [00:41:43]: Yeah. So, so just like shortly, and then we can double click on Alpha Codium. But Alpha Codium is a open source tool. You can go and try it and lets you compete on CodeForce. This is a website and a competition and actually reach a master level level, like 95% with a click of a button. You don't need to do anything. And part of what we did there is taking a problem and breaking it to different, like smaller blocks. And then the models are doing a much better job. Like we all know it by now that taking small tasks and solving them, by the way, even O1, which is supposed to be able to do system two thinking like Greg from OpenAI like hinted, is doing better on these kinds of problems. But still, it's very useful to break it down for O1, despite O1 being able to think by itself. And that's what we presented like just a month ago, OpenAI released that now they are doing 93 percentile with O1 IOI left and International Olympiad of Formation. Sorry, I forgot. Exactly. I told you I forgot. And we took their O1 preview with Alpha Codium and did better. Like it just shows like, and there is a big difference between the preview and the IOI. It shows like that these models are not still system two thinkers, and there is a big difference. So maybe they're not complete system two. Yeah, they need some guidance. I call them system 1.5. We can, we can have it. I thought about it. Like, you know, I care about this philosophy stuff. And I think like we didn't see it even close to a system two thinking. I can elaborate later. But closing the brackets, like we take Alpha Codium and as our principle of thinking, we take tasks and break them down to smaller tasks. And then we want to exploit the best model to solve them. So I want to enable anyone to enjoy O1 and SONET and Gemini 1.5, etc. But at the same time, I need to develop my own models as well, because some of the Fortune 500 want to have all air gapped or whatever. So that's a challenge. Now you need to support so many models. And to some extent, I would say that the flow engineering, the breaking down to two different blocks is a necessity for us. Why? Because when you take a big block, a big problem, you need a very different prompt for each one of the models to actually work. But when you take a big problem and break it into small tasks, we can talk how we do that, then the prompt matters less. What I want to say, like all this, like as a startup trying to do different deployment, getting all the juice that you can get from models, etc. is a big problem. And one need to think about it. And one of our mitigation is that process of taking tasks and breaking them down. That's why I'm really interested to know how you guys are doing it. And part of what we do is also open source. So you can see.Swyx [00:44:39]: There's a lot in there. But yeah, flow over prompt. I do believe that that does make sense. I feel like there's a lot that both of you can sort of exchange notes on breaking down problems. And I just want you guys to just go for it. This is fun to watch.Eric [00:44:55]: Yeah. I mean, what's super interesting is the context you're working in is, because for us too with Bolt, we've started thinking because our kind of existing business line was going behind the firewall, right? We were like, how do we do this? Adding the inference aspect on, we're like, okay, how does... Because I mean, there's not a lot of prior art, right? I mean, this is all new. This is all new. So I definitely am going to have a lot of questions for you.Itamar [00:45:17]: I'm here. We're very open, by the way. We have a paper on a blog or like whatever.Swyx [00:45:22]: The Alphacodeum, GitHub, and we'll put all this in the show notes.Itamar [00:45:25]: Yeah. And even the new results of O1, we published it.Eric [00:45:29]: I love that. And I also just, I think spiritually, I like your approach of being transparent. Because I think there's a lot of hype-ium around AI stuff. And a lot of it is, it's just like, you have these companies that are just kind of keep their stuff closed source and then just max hype it, but then it's kind of nothing. And I think it kind of gives a bad rep to the incredible stuff that's actually happening here. And so I think it's stuff like what you're doing where, I mean, true merit and you're cracking open actual code for others to learn from and use. That strikes me as the right approach. And it's great to hear that you're making such incredible progress.Itamar [00:46:02]: I have something to share about the open source. Most of our tools are, we have an open source version and then a premium pro version. But it's not an easy decision to do that. I actually wanted to ask you about your strategy, but I think in your case, there is, in my opinion, relatively a good strategy where a lot of parts of open source, but then you have the deployment and the environment, which is not right if I get it correctly. And then there's a clear, almost hugging face model. Yeah, you can do that, but why should you try to deploy it yourself, deploy it with us? But in our case, and I'm not sure you're not going to hit also some competitors, and I guess you are. I wanted to ask you, for example, on some of them. In our case, one day we looked on one of our competitors that is doing code review. We're a platform. We have the code review, the testing, et cetera, spread over the ID to get. And in each agent, we have a few startups or a big incumbents that are doing only that. So we noticed one of our competitors having not only a very similar UI of our open source, but actually even our typo. And you sit there and you're kind of like, yeah, we're not that good. We don't use enough Grammarly or whatever. And we had a couple of these and we saw it there. And then it's a challenge. And I want to ask you, Bald is doing so well, and then you open source it. So I think I know what my answer was. I gave it before, but still interestingEric [00:47:29]: to hear what you think. GeoHot said back, I don't know who he was up to at this exact moment, but I think on comma AI, all that stuff's open source. And someone had asked him, why is this open source? And he's like, if you're not actually confident that you can go and crush it and build the best thing, then yeah, you should probably keep your stuff closed source. He said something akin to that. I'm probably kind of butchering it, but I thought it was kind of a really good point. And that's not to say that you should just open source everything, because for obvious reasons, there's kind of strategic things you have to kind of take in mind. But I actually think a pretty liberal approach, as liberal as you kind of can be, it can really make a lot of sense. Because that is so validating that one of your competitors is taking your stuff and they're like, yeah, let's just kind of tweak the styles. I mean, clearly, right? I think it's kind of healthy because it keeps, I'm sure back at HQ that day when you saw that, you're like, oh, all right, well, we have to grind even harder to make sure we stay ahead. And so I think it's actually a very useful, motivating thing for the teams. Because you might feel this period of comfort. I think a lot of companies will have this period of comfort where they're not feeling the competition and one day they get disrupted. So kind of putting stuff out there and letting people push it forces you to face reality soon, right? And actually feel that incrementally so you can kind of adjust course. And that's for us, the open source version of Bolt has had a lot of features people have been begging us for, like persisting chat messages and checkpoints and stuff. Within the first week, that stuff was landed in the open source versions. And they're like, why can't you ship this? It's in the open, so people have forked it. And we're like, we're trying to keep our servers and GPUs online. But it's been great because the folks in the community did a great job, kept us on our toes. And we've got to know most of these folks too at this point that have been building these things. And so it actually was very instructive. Like, okay, well, if we're going to go kind of land this, there's some UX patterns we can kind of look at and the code is open source to this stuff. What's great about these, what's not. So anyways, NetNet, I think it's awesome. I think from a competitive point of view for us, I think in particular, what's interesting is the core technology of WebContainer going. And I think that right now, there's really nothing that's kind of on par with that. And we also, we have a business of, because WebContainer runs in your browser, but to make it work, you have to install stuff from NPM. You have to make cores bypass requests, like connected databases, which all require server-side proxying or acceleration. And so we actually sell WebContainer as a service. One of the core reasons we open-sourced kind of the core components of Bolt when we launched was that we think that there's going to be a lot more of these AI, in-your-browser AI co-gen experiences, kind of like what Anthropic did with Artifacts and Clod. By the way, Artifacts uses WebContainers. Not yet. No, yeah. Should I strike that? I think that they've got their own thing at the moment, but there's been a lot of interest in WebContainers from folks doing things in that sort of realm and in the AI labs and startups and everything in between. So I think there'll be, I imagine, over the coming months, there'll be lots of things being announced to folks kind of adopting it. But yeah, I think effectively...Swyx [00:50:35]: Okay, I'll say this. If you're a large model lab and you want to build sandbox environments inside of your chat app, you should call Eric.Itamar [00:50:43]: But wait, wait, wait, wait, wait, wait. I have a question about that. I think OpenAI, they felt that people are not using their model as they would want to. So they built ChatGPT. But I would say that ChatGPT now defines OpenAI. I know they're doing a lot of business from their APIs, but still, is this how you think? Isn't Bolt.new your business now? Why don't you focus on that instead of the...Swyx [00:51:16]: What's your advice as a founder?Eric [00:51:18]: You're right. And so going into it, we, candidly, we were like, Bolt.new, this thing is super cool. We think people are stoked. We think people will be stoked. But we were like, maybe that's allowed. Best case scenario, after month one, we'd be mind blown if we added a couple hundred K of error or something. And we were like, but we think there's probably going to be an immediate huge business. Because there was some early poll on folks wanting to put WebContainer into their product offerings, kind of similar to what Bolt is doing or whatever. We were actually prepared for the inverse outcome here. But I mean, well, I guess we've seen poll on both. But I mean, what's happened with Bolt, and you're right, it's actually the same strategy as like OpenAI or Anthropic, where we have our ChatGPT to OpenAI's APIs is Bolt to WebContainer. And so we've kind of taken that same approach. And we're seeing, I guess, some of the similar results, except right now, the revenue side is extremely lopsided to Bolt.Itamar [00:52:16]: I think if you ask me what's my advice, I think you have three options. One is to focus on Bolt. The other is to focus on the WebContainer. The third is to raise one billion dollars and do them both. I'm serious. I think otherwise, you need to choose. And if you raise enough money, and I think it's big bucks, because you're going to be chased by competitors. And I think it will be challenging to do both. And maybe you can. I don't know. We do see these numbers right now, raising above $100 million, even without havingEric [00:52:49]: a product. You can see these. It's excellent advice. And I think what's been amazing, but also kind of challenging is we're trying to forecast, okay, well, where are these things going? I mean, in the initial weeks, I think us and all the investors in the company that we're sharing this with, it was like, this is cool. Okay, we added 500k. Wow, that's crazy. Wow, we're at a million now. Most things, you have this kind of the tech crunch launch of initiation and then the thing of sorrow. And if there's going to be a downtrend, it's just not coming yet. Now that we're kind of looking ahead, we're six weeks in. So now we're getting enough confidence in our convictions to go, okay, this se

god ceo california founders tiktok world ai chicago google strategy las vegas pr advice germany building ukraine microsoft holy events fortune reflections chatgpt code human fall in love web curse os thailand engineering cloud iron man singapore id mac maintaining xbox windows bc navigate excited scaling east coast dom livestream saas heck developers cto conviction crm bots fireworks openai gemini formation salesforce correct sf bald mapping ux api canceled irl hats b2c chrome open source hq gpt python ui rsvp aws photoshop ml linux github bolt apis coda admin reasonable product development stripe sia qa 10x azure javascript last call arr copilot llm google docs squarespace upwork agi generic km artifacts php dns icp ides ide wix docker node git bedrock kubernetes anthropic gpus figma sonnets v8 deepmind subversion touche grammarly alessio wp ui ux gitlab ux ui computer vision veo speakpipe luma trl chromium tim berners lee embracing ai cursor gcp vms github copilot inference vs code itamar npm visual studio future growth webassembly xcode pmf firebase wasm jetbrains chatgbt dotter netlify smol wrapper competitive landscape codo kodo vpc ioi o1 alan kay repl neurips cogen clod huggingface supabase sonet greg isenberg alphacode stackblitz chrome devtools google colab full node webgpu latent space geohot eric you thinkster javascript node pat simons

In the Arena: How LMSys changed LLM Benchmarking Forever

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 1, 2024 41:02

Apologies for lower audio quality; we lost recordings and had to use backup tracks. Our guests today are Anastasios Angelopoulos and Wei-Lin Chiang, leads of Chatbot Arena, fka LMSYS, the crowdsourced AI evaluation platform developed by the LMSys student club at Berkeley, which became the de facto standard for comparing language models. Arena ELO is often more cited than MMLU scores to many folks, and they have attracted >1,000,000 people to cast votes since its launch, leading top model trainers to cite them over their own formal academic benchmarks:The Limits of Static BenchmarksWe've done two benchmarks episodes: Benchmarks 101 and Benchmarks 201. One issue we've always brought up with static benchmarks is that 1) many are getting saturated, with models scoring almost perfectly on them 2) they often don't reflect production use cases, making it hard for developers and users to use them as guidance. The fundamental challenge in AI evaluation isn't technical - it's philosophical. How do you measure something that increasingly resembles human intelligence? Rather than trying to define intelligence upfront, Arena let users interact naturally with models and collect comparative feedback. It's messy and subjective, but that's precisely the point - it captures the full spectrum of what people actually care about when using AI.The Pareto Frontier of Cost vs IntelligenceBecause the Elo scores are remarkably stable over time, we can put all the chat models on a map against their respective cost to gain a view of at least 3 orders of magnitude of model sizes/costs and observe the remarkable shift in intelligence per dollar over the past year:This frontier stood remarkably firm through the recent releases of o1-preview and price cuts of Gemini 1.5:The Statistics of SubjectivityIn our Benchmarks 201 episode, Clémentine Fourrier from HuggingFace thought this design choice was one of shortcomings of arenas: they aren't reproducible. You don't know who ranked what and what exactly the outcome was at the time of ranking. That same person might rank the same pair of outputs differently on a different day, or might ask harder questions to better models compared to smaller ones, making it imbalanced. Another argument that people have brought up is confirmation bias. We know humans prefer longer responses and are swayed by formatting - Rob Mulla from Dreadnode had found some interesting data on this in May:The approach LMArena is taking is to use logistic regression to decompose human preferences into constituent factors. As Anastasios explains: "We can say what components of style contribute to human preference and how they contribute." By adding these style components as parameters, they can mathematically "suck out" their influence and isolate the core model capabilities.This extends beyond just style - they can control for any measurable factor: "What if I want to look at the cost adjusted performance? Parameter count? We can ex post facto measure that." This is one of the most interesting things about Arena: You have a data generation engine which you can clean and turn into leaderboards later. If you wanted to create a leaderboard for poetry writing, you could get existing data from Arena, normalize it by identifying these style components. Whether or not it's possible to really understand WHAT bias the voters have, that's a different question.Private EvalsOne of the most delicate challenges LMSYS faces is maintaining trust while collaborating with AI labs. The concern is that labs could game the system by testing multiple variants privately and only releasing the best performer. This was brought up when 4o-mini released and it ranked as the second best model on the leaderboard:But this fear misunderstands how Arena works. Unlike static benchmarks where selection bias is a major issue, Arena's live nature means any initial bias gets washed out by ongoing evaluation. As Anastasios explains: "In the long run, there's way more fresh data than there is data that was used to compare these five models." The other big question is WHAT model is actually being tested; as people often talk about on X / Discord, the same endpoint will randomly feel “nerfed” like it happened for “Claude European summer” and corresponding conspiracy theories:It's hard to keep track of these performance changes in Arena as these changes (if real…?) are not observable.The Future of EvaluationThe team's latest work on RouteLLM points to an interesting future where evaluation becomes more granular and task-specific. But they maintain that even simple routing strategies can be powerful - like directing complex queries to larger models while handling simple tasks with smaller ones.Arena is now going to expand beyond text into multimodal evaluation and specialized domains like code execution and red teaming. But their core insight remains: the best way to evaluate intelligence isn't to simplify it into metrics, but to embrace its complexity and find rigorous ways to analyze it. To go after this vision, they are spinning out Arena from LMSys, which will stay as an academia-driven group at Berkeley.Full Video PodcastChapters* 00:00:00 - Introductions* 00:01:16 - Origin and development of Chatbot Arena* 00:05:41 - Static benchmarks vs. Arenas* 00:09:03 - Community building* 00:13:32 - Biases in human preference evaluation* 00:18:27 - Style Control and Model Categories* 00:26:06 - Impact of o1* 00:29:15 - Collaborating with AI labs* 00:34:51 - RouteLLM and router models* 00:38:09 - Future of LMSys / ArenaShow Notes* Anastasios Angelopoulos* Anastasios' NeurIPS Paper Conformal Risk Control* Wei-Lin Chiang* Chatbot Arena* LMSys* MTBench* ShareGPT dataset* Stanford's Alpaca project* LLMRouter* E2B* DreadnodeTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:14]: Hey, and today we're very happy and excited to welcome Anastasios and Wei Lin from LMSys. Welcome guys.Wei Lin [00:00:21]: Hey, how's it going? Nice to see you.Anastasios [00:00:23]: Thanks for having us.Swyx [00:00:24]: Anastasios, I actually saw you, I think at last year's NeurIPS. You were presenting a paper, which I don't really super understand, but it was some theory paper about how your method was very dominating over other sort of search methods. I don't remember what it was, but I remember that you were a very confident speaker.Anastasios [00:00:40]: Oh, I totally remember you. Didn't ever connect that, but yes, that's definitely true. Yeah. Nice to see you again.Swyx [00:00:46]: Yeah. I was frantically looking for the name of your paper and I couldn't find it. Basically I had to cut it because I didn't understand it.Anastasios [00:00:51]: Is this conformal PID control or was this the online control?Wei Lin [00:00:55]: Blast from the past, man.Swyx [00:00:57]: Blast from the past. It's always interesting how NeurIPS and all these academic conferences are sort of six months behind what people are actually doing, but conformal risk control, I would recommend people check it out. I have the recording. I just never published it just because I was like, I don't understand this enough to explain it.Anastasios [00:01:14]: People won't be interested.Wei Lin [00:01:15]: It's all good.Swyx [00:01:16]: But ELO scores, ELO scores are very easy to understand. You guys are responsible for the biggest revolution in language model benchmarking in the last few years. Maybe you guys want to introduce yourselves and maybe tell a little bit of the brief history of LMSysWei Lin [00:01:32]: Hey, I'm Wei Lin. I'm a fifth year PhD student at UC Berkeley, working on Chatbot Arena these days, doing crowdsourcing AI benchmarking.Anastasios [00:01:43]: I'm Anastasios. I'm a sixth year PhD student here at Berkeley. I did most of my PhD on like theoretical statistics and sort of foundations of model evaluation and testing. And now I'm working 150% on this Chatbot Arena stuff. It's great.Alessio [00:02:00]: And what was the origin of it? How did you come up with the idea? How did you get people to buy in? And then maybe what were one or two of the pivotal moments early on that kind of made it the standard for these things?Wei Lin [00:02:12]: Yeah, yeah. Chatbot Arena project was started last year in April, May, around that. Before that, we were basically experimenting in a lab how to fine tune a chatbot open source based on the Llama 1 model that I released. At that time, Lama 1 was like a base model and people didn't really know how to fine tune it. So we were doing some explorations. We were inspired by Stanford's Alpaca project. So we basically, yeah, grow a data set from the internet, which is called ShareGPT data set, which is like a dialogue data set between user and chat GPT conversation. It turns out to be like pretty high quality data, dialogue data. So we fine tune on it and then we train it and release the model called V2. And people were very excited about it because it kind of like demonstrate open way model can reach this conversation capability similar to chat GPT. And then we basically release the model with and also build a demo website for the model. People were very excited about it. But during the development, the biggest challenge to us at the time was like, how do we even evaluate it? How do we even argue this model we trained is better than others? And then what's the gap between this open source model that other proprietary offering? At that time, it was like GPT-4 was just announced and it's like Cloud One. What's the difference between them? And then after that, like every week, there's a new model being fine tuned, released. So even until still now, right? And then we have that demo website for V2 now. And then we thought like, okay, maybe we can add a few more of the model as well, like API model as well. And then we quickly realized that people need a tool to compare between different models. So we have like a side by side UI implemented on the website to that people choose, you know, compare. And we quickly realized that maybe we can do something like, like a battle on top of ECLMs, like just anonymize it, anonymize the identity, and that people vote which one is better. So the community decides which one is better, not us, not us arguing, you know, our model is better or what. And that turns out to be like, people are very excited about this idea. And then we tweet, we launch, and that's, yeah, that's April, May. And then it was like first two, three weeks, like just a few hundred thousand views tweet on our launch tweets. And then we have regularly double update weekly, beginning at a time, adding new model GPT-4 as well. So it was like, that was the, you know, the initial.Anastasios [00:04:58]: Another pivotal moment, just to jump in, would be private models, like the GPT, I'm a little,Wei Lin [00:05:04]: I'm a little chatty. That was this year. That was this year.Anastasios [00:05:07]: Huge.Wei Lin [00:05:08]: That was also huge.Alessio [00:05:09]: In the beginning, I saw the initial release was May 3rd of the beta board. On April 6, we did a benchmarks 101 episode for a podcast, just kind of talking about, you know, how so much of the data is like in the pre-training corpus and blah, blah, blah. And like the benchmarks are really not what we need to evaluate whether or not a model is good. Why did you not make a benchmark? Maybe at the time, you know, it was just like, Hey, let's just put together a whole bunch of data again, run a, make a score that seems much easier than coming out with a whole website where like users need to vote. Any thoughts behind that?Wei Lin [00:05:41]: I think it's more like fundamentally, we don't know how to automate this kind of benchmarks when it's more like, you know, conversational, multi-turn, and more open-ended task that may not come with a ground truth. So let's say if you ask a model to help you write an email for you for whatever purpose, there's no ground truth. How do you score them? Or write a story or a creative story or many other things like how we use ChatterBee these days. It's more open-ended. You know, we need human in the loop to give us feedback, which one is better. And I think nuance here is like, sometimes it's also hard for human to give the absolute rating. So that's why we have this kind of pairwise comparison, easier for people to choose which one is better. So from that, we use these pairwise comparison, those to calculate the leaderboard. Yeah. You can add more about this methodology.Anastasios [00:06:40]: Yeah. I think the point is that, and you guys probably also talked about this at some point, but static benchmarks are intrinsically, to some extent, unable to measure generative model performance. And the reason is because you cannot pre-annotate all the outputs of a generative model. You change the model, it's like the distribution of your data is changing. New labels to deal with that. New labels are great automated labeling, right? Which is why people are pursuing both. And yeah, static benchmarks, they allow you to zoom in to particular types of information like factuality, historical facts. We can build the best benchmark of historical facts, and we will then know that the model is great at historical facts. But ultimately, that's not the only axis, right? And we can build 50 of them, and we can evaluate 50 axes. But it's just so, the problem of generative model evaluation is just so expansive, and it's so subjective, that it's just maybe non-intrinsically impossible, but at least we don't see a way. We didn't see a way of encoding that into a fixed benchmark.Wei Lin [00:07:47]: But on the other hand, I think there's a challenge where this kind of online dynamic benchmark is more expensive than static benchmark, offline benchmark, where people still need it. Like when they build models, they need static benchmark to track where they are.Anastasios [00:08:03]: It's not like our benchmark is uniformly better than all other benchmarks, right? It just measures a different kind of performance that has proved to be useful.Swyx [00:08:14]: You guys also published MTBench as well, which is a static version, let's say, of Chatbot Arena, right? That people can actually use in their development of models.Wei Lin [00:08:25]: Right. I think one of the reasons we still do this static benchmark, we still wanted to explore, experiment whether we can automate this, because people, eventually, model developers need it to fast iterate their model. So that's why we explored LM as a judge, and ArenaHard, trying to filter, select high-quality data we collected from Chatbot Arena, the high-quality subset, and use that as a question and then automate the judge pipeline, so that people can quickly get high-quality signal, benchmark signals, using this online benchmark.Swyx [00:09:03]: As a community builder, I'm curious about just the initial early days. Obviously when you offer effectively free A-B testing inference for people, people will come and use your arena. What do you think were the key unlocks for you? Was it funding for this arena? Was it marketing? When people came in, do you see a noticeable skew in the data? Which obviously now you have enough data sets, you can separate things out, like coding and hard prompts, but in the early days, it was just all sorts of things.Anastasios [00:09:31]: Yeah, maybe one thing to establish at first is that our philosophy has always been to maximize organic use. I think that really does speak to your point, which is, yeah, why do people come? They came to use free LLM inference, right? And also, a lot of users just come to the website to use direct chat, because you can chat with the model for free. And then you could think about it like, hey, let's just be kind of like more on the selfish or conservative or protectionist side and say, no, we're only giving credits for people that battle or so on and so forth. Strategy wouldn't work, right? Because what we're trying to build is like a big funnel, a big funnel that can direct people. And some people are passionate and interested and they battle. And yes, the distribution of the people that do that is different. It's like, as you're pointing out, it's like, that's not as they're enthusiastic.Wei Lin [00:10:24]: They're early adopters of this technology.Anastasios [00:10:27]: Or they like games, you know, people like this. And we've run a couple of surveys that indicate this as well, of our user base.Wei Lin [00:10:36]: We do see a lot of developers come to the site asking polling questions, 20-30%. Yeah, 20-30%.Anastasios [00:10:42]: It's obviously not reflective of the general population, but it's reflective of some corner of the world of people that really care. And to some extent, maybe that's all right, because those are like the power users. And you know, we're not trying to claim that we represent the world, right? We represent the people that come and vote.Swyx [00:11:02]: Did you have to do anything marketing-wise? Was anything effective? Did you struggle at all? Was it success from day one?Wei Lin [00:11:09]: At some point, almost done. Okay. Because as you can imagine, this leaderboard depends on community engagement participation. If no one comes to vote tomorrow, then no leaderboard.Anastasios [00:11:23]: So we had some period of time when the number of users was just, after the initial launch, it went lower. Yeah. And, you know, at some point, it did not look promising. Actually, I joined the project a couple months in to do the statistical aspects, right? As you can imagine, that's how it kind of hooked into my previous work. At that time, it wasn't like, you know, it definitely wasn't clear that this was like going to be the eval or something. It was just like, oh, this is a cool project. Like Wayland seems awesome, you know, and that's it.Wei Lin [00:11:56]: Definitely. There's in the beginning, because people don't know us, people don't know what this is for. So we had a hard time. But I think we were lucky enough that we have some initial momentum. And as well as the competition between model providers just becoming, you know, became very intense. Intense. And then that makes the eval onto us, right? Because always number one is number one.Anastasios [00:12:23]: There's also an element of trust. Our main priority in everything we do is trust. We want to make sure we're doing everything like all the I's are dotted and the T's are crossed and nobody gets unfair treatment and people can see from our profiles and from our previous work and from whatever, you know, we're trustworthy people. We're not like trying to make a buck and we're not trying to become famous off of this or that. It's just, we're trying to provide a great public leaderboard community venture project.Wei Lin [00:12:51]: Yeah.Swyx [00:12:52]: Yes. I mean, you are kind of famous now, you know, that's fine. Just to dive in more into biases and, you know, some of this is like statistical control. The classic one for human preference evaluation is humans demonstrably prefer longer contexts or longer outputs, which is actually something that we don't necessarily want. You guys, I think maybe two months ago put out some length control studies. Apart from that, there are just other documented biases. Like, I'd just be interested in your review of what you've learned about biases and maybe a little bit about how you've controlled for them.Anastasios [00:13:32]: At a very high level, yeah. Humans are biased. Totally agree. Like in various ways. It's not clear whether that's good or bad, you know, we try not to make value judgments about these things. We just try to describe them as they are. And our approach is always as follows. We collect organic data and then we take that data and we mine it to get whatever insights we can get. And, you know, we have many millions of data points that we can now use to extract insights from. Now, one of those insights is to ask the question, what is the effect of style, right? You have a bunch of data, you have votes, people are voting either which way. We have all the conversations. We can say what components of style contribute to human preference and how do they contribute? Now, that's an important question. Why is that an important question? It's important because some people want to see which model would be better if the lengths of the responses were the same, were to be the same, right? People want to see the causal effect of the model's identity controlled for length or controlled for markdown, number of headers, bulleted lists, is the text bold? Some people don't, they just don't care about that. The idea is not to impose the judgment that this is not important, but rather to say ex post facto, can we analyze our data in a way that decouples all the different factors that go into human preference? Now, the way we do this is via statistical regression. That is to say the arena score that we show on our leaderboard is a particular type of linear model, right? It's a linear model that takes, it's a logistic regression that takes model identities and fits them against human preference, right? So it regresses human preference against model identity. What you get at the end of that logistic regression is a parameter vector of coefficients. And when the coefficient is large, it tells you that GPT 4.0 or whatever, very large coefficient, that means it's strong. And that's exactly what we report in the table. It's just the predictive effect of the model identity on the vote. The other thing that you can do is you can take that vector, let's say we have M models, that is an M dimensional vector of coefficients. What you can do is you say, hey, I also want to understand what the effect of length is. So I'll add another entry to that vector, which is trying to predict the vote, right? That tells me the difference in length between two model responses. So we have that for all of our data. We can compute it ex post facto. We added it into the regression and we look at that predictive effect. And then the idea, and this is formally true under certain conditions, not always verifiable ones, but the idea is that adding that extra coefficient to this vector will kind of suck out the predictive power of length and put it into that M plus first coefficient and quote, unquote, de-bias the rest so that the effect of length is not included. And that's what we do in style control. Now we don't just do it for M plus one. We have, you know, five, six different style components that have to do with markdown headers and bulleted lists and so on that we add here. Now, where is this going? You guys see the idea. It's a general methodology. If you have something that's sort of like a nuisance parameter, something that exists and provides predictive value, but you really don't want to estimate that. You want to remove its effect. In causal inference, these things are called like confounders often. What you can do is you can model the effect. You can put them into your model and try to adjust for them. So another one of those things might be cost. You know, what if I want to look at the cost adjusted performance of my model, which models are punching above their weight, parameter count, which models are punching above their weight in terms of parameter count, we can ex post facto measure that. We can do it without introducing anything that compromises the organic nature of theWei Lin [00:17:17]: data that we collect.Anastasios [00:17:18]: Hopefully that answers the question.Wei Lin [00:17:20]: It does.Swyx [00:17:21]: So I guess with a background in econometrics, this is super familiar.Anastasios [00:17:25]: You're probably better at this than me for sure.Swyx [00:17:27]: Well, I mean, so I used to be, you know, a quantitative trader and so, you know, controlling for multiple effects on stock price is effectively the job. So it's interesting. Obviously the problem is proving causation, which is hard, but you don't have to do that.Anastasios [00:17:45]: Yes. Yes, that's right. And causal inference is a hard problem and it goes beyond statistics, right? It's like you have to build the right causal model and so on and so forth. But we think that this is a good first step and we're sort of looking forward to learning from more people. You know, there's some good people at Berkeley that work on causal inference for the learning from them on like, what are the really most contemporary techniques that we can use in order to estimate true causal effects if possible.Swyx [00:18:10]: Maybe we could take a step through the other categories. So style control is a category. It is not a default. I have thought that when you wrote that blog post, actually, I thought it would be the new default because it seems like the most obvious thing to control for. But you also have other categories, you have coding, you have hard prompts. We consider that.Anastasios [00:18:27]: We're still actively considering it. It's just, you know, once you make that step, once you take that step, you're introducing your opinion and I'm not, you know, why should our opinion be the one? That's kind of a community choice. We could put it to a vote.Wei Lin [00:18:39]: We could pass.Anastasios [00:18:40]: Yeah, maybe do a poll. Maybe do a poll.Swyx [00:18:42]: I don't know. No opinion is an opinion.Wei Lin [00:18:44]: You know what I mean?Swyx [00:18:45]: Yeah.Wei Lin [00:18:46]: There's no neutral choice here.Swyx [00:18:47]: Yeah. You have all these others. You have instruction following too. What are your favorite categories that you like to talk about? Maybe you tell a little bit of the stories, tell a little bit of like the hard choices that you had to make.Wei Lin [00:18:57]: Yeah. Yeah. Yeah. I think the, uh, initially the reason why we want to add these new categories is essentially to answer some of the questions from our community, which is we won't have a single leaderboard for everything. So these models behave very differently in different domains. Let's say this model is trend for coding, this model trend for more technical questions and so on. On the other hand, to answer people's question about like, okay, what if all these low quality, you know, because we crowdsource data from the internet, there will be noise. So how do we de-noise? How do we filter out these low quality data effectively? So that was like, you know, some questions we want to answer. So basically we spent a few months, like really diving into these questions to understand how do we filter all these data because these are like medias of data points. And then if you want to re-label yourself, it's possible, but we need to kind of like to automate this kind of data classification pipeline for us to effectively categorize them to different categories, say coding, math, structure, and also harder problems. So that was like, the hope is when we slice the data into these meaningful categories to give people more like better signals, more direct signals, and that's also to clarify what we are actually measuring for, because I think that's the core part of the benchmark. That was the initial motivation. Does that make sense?Anastasios [00:20:27]: Yeah. Also, I'll just say, this does like get back to the point that the philosophy is to like mine organic, to take organic data and then mine it x plus factor.Alessio [00:20:35]: Is the data cage-free too, or just organic?Anastasios [00:20:39]: It's cage-free.Wei Lin [00:20:40]: No GMO. Yeah. And all of these efforts are like open source, like we open source all of the data cleaning pipeline, filtering pipeline. Yeah.Swyx [00:20:50]: I love the notebooks you guys publish. Actually really good just for learning statistics.Wei Lin [00:20:54]: Yeah. I'll share this insights with everyone.Alessio [00:20:59]: I agree on the initial premise of, Hey, writing an email, writing a story, there's like no ground truth. But I think as you move into like coding and like red teaming, some of these things, there's like kind of like skill levels. So I'm curious how you think about the distribution of skill of the users. Like maybe the top 1% of red teamers is just not participating in the arena. So how do you guys think about adjusting for it? And like feels like this where there's kind of like big differences between the average and the top. Yeah.Anastasios [00:21:29]: Red teaming, of course, red teaming is quite challenging. So, okay. Moving back. There's definitely like some tasks that are not as subjective that like pairwise human preference feedback is not the only signal that you would want to measure. And to some extent, maybe it's useful, but it may be more useful if you give people better tools. For example, it'd be great if we could execute code with an arena, be fantastic.Wei Lin [00:21:52]: We want to do it.Anastasios [00:21:53]: There's also this idea of constructing a user leaderboard. What does that mean? That means some users are better than others. And how do we measure that? How do we quantify that? Hard in chatbot arena, but where it is easier is in red teaming, because in red teaming, there's an explicit game. You're trying to break the model, you either win or you lose. So what you can do is you can say, Hey, what's really happening here is that the models and humans are playing a game against one another. And then you can use the same sort of Bradley Terry methodology with some, some extensions that we came up with in one of you can read one of our recent blog posts for, for the sort of theoretical extensions. You can attribute like strength back to individual players and jointly attribute strength to like the models that are in this jailbreaking game, along with the target tasks, like what types of jailbreaks you want.Wei Lin [00:22:44]: So yeah.Anastasios [00:22:45]: And I think that this is, this is a hugely important and interesting avenue that we want to continue researching. We have some initial ideas, but you know, all thoughts are welcome.Wei Lin [00:22:54]: Yeah.Alessio [00:22:55]: So first of all, on the code execution, the E2B guys, I'm sure they'll be happy to helpWei Lin [00:22:59]: you.Alessio [00:23:00]: I'll please set that up. They're big fans. We're investors in a company called Dreadnought, which we do a lot in AI red teaming. I think to me, the most interesting thing has been, how do you do sure? Like the model jailbreak is one side. We also had Nicola Scarlini from DeepMind on the podcast, and he was talking about, for example, like, you know, context stealing and like a weight stealing. So there's kind of like a lot more that goes around it. I'm curious just how you think about the model and then maybe like the broader system, even with Red Team Arena, you're just focused on like jailbreaking of the model, right? You're not doing kind of like any testing on the more system level thing of the model where like, maybe you can get the training data back, you're going to exfiltrate some of the layers and the weights and things like that.Wei Lin [00:23:43]: So right now, as you can see, the Red Team Arena is at a very early stage and we are still exploring what could be the potential new games we can introduce to the platform. So the idea is still the same, right? And we build a community driven project platform for people. They can have fun with this website, for sure. That's one thing, and then help everyone to test these models. So one of the aspects you mentioned is stealing secrets, stealing training sets. That could be one, you know, it could be designed as a game. Say, can you still use their credential, you know, we hide, maybe we can hide the credential into system prompts and so on. So there are like a few potential ideas we want to explore for sure. Do you want to add more?Anastasios [00:24:28]: I think that this is great. This idea is a great one. There's a lot of great ideas in the Red Teaming space. You know, I'm not personally like a Red Teamer. I don't like go around and Red Team models, but there are people that do that and they're awesome. They're super skilled. When I think about the Red Team arena, I think those are really the people that we're building it for. Like, we want to make them excited and happy, build tools that they like. And just like chatbot arena, we'll trust that this will end up being useful for the world. And all these people are, you know, I won't say all these people in this community are actually good hearted, right? They're not doing it because they want to like see the world burn. They're doing it because they like, think it's fun and cool. And yeah. Okay. Maybe they want to see, maybe they want a little bit.Wei Lin [00:25:13]: I don't know. Majority.Anastasios [00:25:15]: Yeah.Wei Lin [00:25:16]: You know what I'm saying.Anastasios [00:25:17]: So, you know, trying to figure out how to serve them best, I think, I don't know where that fits. I just, I'm not expressing. And give them credits, right?Wei Lin [00:25:24]: And give them credit.Anastasios [00:25:25]: Yeah. Yeah. So I'm not trying to express any particular value judgment here as to whether that's the right next step. It's just, that's sort of the way that I think we would think about it.Swyx [00:25:35]: Yeah. We also talked to Sander Schulhoff of the HackerPrompt competition, and he's pretty interested in Red Teaming at scale. Let's just call it that. You guys maybe want to talk with him.Wei Lin [00:25:45]: Oh, nice.Swyx [00:25:46]: We wanted to cover a little, a few topical things and then go into the other stuff that your group is doing. You know, you're not just running Chatbot Arena. We can also talk about the new website and your future plans, but I just wanted to briefly focus on O1. It is the hottest, latest model. Obviously, you guys already have it on the leaderboard. What is the impact of O1 on your evals?Wei Lin [00:26:06]: Made our interface slower.Anastasios [00:26:07]: It made it slower.Swyx [00:26:08]: Yeah.Wei Lin [00:26:10]: Because it needs like 30, 60 seconds, sometimes even more to, the latency is like higher. So that's one. Sure. But I think we observe very interesting things from this model as well. Like we observe like significant improvement in certain categories, like more technical or math. Yeah.Anastasios [00:26:32]: I think actually like one takeaway that was encouraging is that I think a lot of people before the O1 release were thinking, oh, like this benchmark is saturated. And why were they thinking that? They were thinking that because there was a bunch of models that were kind of at the same level. They were just kind of like incrementally competing and it sort of wasn't immediately obvious that any of them were any better. Nobody, including any individual person, it's hard to tell. But what O1 did is it was, it's clearly a better model for certain tasks. I mean, I used it for like proving some theorems and you know, there's some theorems that like only I know because I still do a little bit of theory. Right. So it's like, I can go in there and ask like, oh, how would you prove this exact thing? Which I can tell you has never been in the public domain. It'll do it. It's like, what?Wei Lin [00:27:19]: Okay.Anastasios [00:27:20]: So there's this model and it crushed the benchmark. You know, it's just like really like a big gap. And what that's telling us is that it's not saturated yet. It's still measuring some signal. That was encouraging. The point, the takeaway is that the benchmark is comparative. There's no absolute number. There's no maximum ELO. It's just like, if you're better than the rest, then you win. I think that was actually quite helpful to us.Swyx [00:27:46]: I think people were criticizing, I saw some of the academics criticizing it as not apples to apples. Right. Like, because it can take more time to reason, it's basically doing some search, doing some chain of thought that if you actually let the other models do that same thing, they might do better.Wei Lin [00:28:03]: Absolutely.Anastasios [00:28:04]: To be clear, none of the leaderboard currently is apples to apples because you have like Gemini Flash, you have, you know, all sorts of tiny models like Lama 8B, like 8B and 405B are not apples to apples.Wei Lin [00:28:19]: Totally agree. They have different latencies.Anastasios [00:28:21]: Different latencies.Wei Lin [00:28:22]: Control for latency. Yeah.Anastasios [00:28:24]: Latency control. That's another thing. We can do style control, but latency control. You know, things like this are important if you want to understand the trade-offs involved in using AI.Swyx [00:28:34]: O1 is a developing story. We still haven't seen the full model yet, but it's definitely a very exciting new paradigm. I think one community controversy I just wanted to give you guys space to address is the collaboration between you and the large model labs. People have been suspicious, let's just say, about how they choose to A-B test on you. I'll state the argument and let you respond, which is basically they run like five anonymous models and basically argmax their Elo on LMSYS or chatbot arena, and they release the best one. Right? What has been your end of the controversy? How have you decided to clarify your policy going forward?Wei Lin [00:29:15]: On a high level, I think our goal here is to build a fast eval for everyone, and including everyone in the community can see the data board and understand, compare the models. More importantly, I think we want to build the best eval also for model builders, like all these frontier labs building models. They're also internally facing a challenge, which is how do they eval the model? That's the reason why we want to partner with all the frontier lab people, and then to help them testing. That's one of the... We want to solve this technical challenge, which is eval. Yeah.Anastasios [00:29:54]: I mean, ideally, it benefits everyone, right?Wei Lin [00:29:56]: Yeah.Anastasios [00:29:57]: And people also are interested in seeing the leading edge of the models. People in the community seem to like that. Oh, there's a new model up. Is this strawberry? People are excited. People are interested. Yeah. And then there's this question that you bring up of, is it actually causing harm?Wei Lin [00:30:15]: Right?Anastasios [00:30:16]: Is it causing harm to the benchmark that we are allowing this private testing to happen? Maybe stepping back, why do you have that instinct? The reason why you and others in the community have that instinct is because when you look at something like a benchmark, like an image net, a static benchmark, what happens is that if I give you a million different models that are all slightly different, and I pick the best one, there's something called selection bias that plays in, which is that the performance of the winning model is overstated. This is also sometimes called the winner's curse. And that's because statistical fluctuations in the evaluation, they're driving which model gets selected as the top. So this selection bias can be a problem. Now there's a couple of things that make this benchmark slightly different. So first of all, the selection bias that you include when you're only testing five models is normally empirically small.Wei Lin [00:31:12]: And that's why we have these confidence intervals constructed.Anastasios [00:31:16]: That's right. Yeah. Our confidence intervals are actually not multiplicity adjusted. One thing that we could do immediately tomorrow in order to address this concern is if a model provider is testing five models and they want to release one, and we're constructing the models at level one minus alpha, we can just construct the intervals instead at level one minus alpha divided by five. That's called Bonferroni correction. What that'll tell you is that the final performance of the model, the interval that gets constructed, is actually formally correct. We don't do that right now, partially because we know from simulations that the amount of selection bias you incur with these five things is just not huge. It's not huge in comparison to the variability that you get from just regular human voters. So that's one thing. But then the second thing is the benchmark is live, right? So what ends up happening is it'll be a small magnitude, but even if you suffer from the winner's curse after testing these five models, what'll happen is that over time, because we're getting new data, it'll get adjusted down. So if there's any bias that gets introduced at that stage, in the long run, it actually doesn't matter. Because asymptotically, basically in the long run, there's way more fresh data than there is data that was used to compare these five models against these private models.Swyx [00:32:35]: The announcement effect is only just the first phase and it has a long tail.Anastasios [00:32:39]: Yeah, that's right. And it sort of like automatically corrects itself for this selection adjustment.Swyx [00:32:45]: Every month, I do a little chart of Ellim's ELO versus cost, just to track the price per dollar, the amount of like, how much money do I have to pay for one incremental point in ELO? And so I actually observe an interesting stability in most of the ELO numbers, except for some of them. For example, GPT-4-O August has fallen from 12.90

Building the AI Engineer Nation — with Josephine Teo, Minister of Digital Development and Information, Singapore

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Oct 19, 2024 56:39

Singapore's GovTech is hosting an AI CTF challenge with ~$15,000 in prizes, starting October 26th, open to both local and virtual hackers. It will be hosted on Dreadnode's Crucible platform; signup here!It is common to say if you want to work in AI, you should come to San Francisco. Not everyone can. Not everyone should. If you can only do meaningful AI work in one city, then AI has failed to generalize meaningfully.As non-Americans working in the US, we know what it's like to see AI progress so rapidly here, and yet be at a loss for what our home countries can do. Through Latent Space we've tried to tell the story of AI outside of the Bay Area bubble; we talked to Notion in New York and Humanloop and Wondercraft in London and HuggingFace in Paris and ICLR in Vienna, and the Reka, RWKV, and Winds of AI Winter episodes were taped in Singapore (the World's Fair also had Latin America representation and we intend to at least add China, Japan, and India next year).The Role of Government with AIAs an intentionally technical resource, we've mostly steered clear of regulation and safety debates on the podcast; whether it is safety bills or technoalarmism, often at the cost of our engagement numbers or ability to book big name guests with a political agenda. When SOTA shifts 3x faster than it takes to pass a law, when nobody agrees on definitions of important things, when you can elicit never-before-seen behavior by slightly different prompting or sampling, it is hard enough to simply keep up to speed, so we are happy limiting our role to that. The story of AI progress has more often been achieved in the private sector, usually in spite of, rather than with thanks to, government intervention.But industrial policy is inextricably linked to the business of AI, which we do very much care about, has an explicitly accelerationist intent if not impact, and has a track record of success in correcting for legitimate market failures in private sector investment, particularly outside of the US. It is with this lens we approach today's episode and special guest, our first with a sitting Cabinet member.Singapore's National AI StrategyIt is well understood that much of Singapore's economic success is attributable to industrial policy, from direct efforts like the Jurong Town Corporation industrialization to indirect ones like going all in on English as national first language. Singapore's National AI Strategy grew out of its 2014 Smart Nation initiative, first launched in 2019 and then refreshed in 2023 by Minister Josephine Teo, our guest today.While Singapore is not often thought of as an AI leader, the National University ranks in the top 10 in publications (above Oxford/Harvard!), and many overseas Singaporeans work at the leading AI companies and institutions in the US (and some of us even run leading AI Substacks?). OpenAI has often publicly named the Singapore government as their model example of government collaborator and is opening an office in Singapore in time for DevDay 2024.AI Engineer NationsSwyx first pitched the AI Engineer Nation concept at a private Sovereign AI summit featuring Dr. He Ruimin, Chief AI Officer of Singapore, which eventually led to an invitation to discuss the concept with Minister Teo, the country's de-facto minister for tech (she calls it Digital Development, for good reasons she explains in the pod).This chat happened (with thanks to Jing Long, Joyce, and other folks from MDDI)!The central pitch for any country, not just Singapore, to emphasize and concentrate bets on AI Engineers, compared with other valuable efforts like training more researchers, releasing more government-approved data, or offering more AI funding, is a calculated one, based on the fact that: * GPU clusters and researchers have massive returns to scale and colocation, mostly concentrated in the US, that are irresponsibly expensive to replicate* Even if research stopped today and there was no progress for the next 30 years, there are far more capabilities to unlock and productize from existing foundation models and we

#418 - Clément Delangue - Hugging Face - 4,5 milliards de valo avec un produit gratuit à 99%

Génération Do It Yourself

Play Episode Listen Later Sep 18, 2024 146:25

La première plateforme dédiée à l'IA est complètement ignorée.Hugging Face, licorne franco-américaine, est la référence mondiale pour tous les acteurs de l'IA, aussi bien pour les amateurs que les développeurs et même les Big Tech comme Apple, Amazon, Microsoft, Meta, Nvidia etc.Son fondateur Clément Delangue revient sur le développement fulgurant d'Hugging Face depuis son premier passage sur GDIY (#238) avec une valorisation passée de 440 millions à 4,5 milliards de dollars.Son objectif : démocratiser l'accès à l'intelligence artificielle, permettre la collaboration et l'avancée de la recherche scientifique.La plateforme recueille aujourd'hui plus de 2,5 millions modèles et jeux de données dont la moitié est en open source.Sur les 5 millions d'utilisateurs, seul 1% utilisent la version payante : et pourtant, Hugging Face est rentable.“C'est la période la plus importante de ces 50 dernières années, on est vraiment au début d'un nouveau paradigme. Ce qui veut dire que toutes les cartes restent à distribuer.”Au coeur du réacteur du développement de l'IA, Clément partage les dessous de la révolution en cours :L'importance de l'Open Source dans la rechercheLa Chine en passe de devenir leader mondialL'enjeu réel derrière les cas d'usages de l'IALa réalité derrière les fantasmes de l'Intelligence Artificielle Générale (AGI)Pourquoi le LLM n'est que la partie émergée de l'icebergLes prochaines mutations du monde du travailComment développer une entreprise en restant aligné sur ses valeurs ?Un épisode clé pour comprendre la prochaine vague de l'IA et ses enjeux socio-économiques. Accessible à tous peu importe le niveau de connaissances techniques, une mine d'or pour saisir les plus belles opportunités de notre époque.TIMELINE:00:00:00 : L'équilibre vie pro et perso00:13:02 : La mission d'Hugging Face00:21:47 : La technologie doit se mettre au service de l'usage : sur quoi se former ?00:29:25 : Faut-il craindre l'IA générale ?00:40:36 : Comprendre l'évolution des modèles sur les différentes modalités (texte, son et image)00:50:54 : Vers une fin de l'hégémonie d'Open AI ?00:58:23 : Les cas d'usage révolutionnaires et la consommation électrique01:07:11 : Une IA qui tourne en local sur un smartphone01:13:16 : L'inconvénient de la valorisation et l'ambition d'Hugging Face01:27:24 : La Chine va devenir leader sur le secteur de l'IA01:40:20 : La mutation du monde du travail01:49:38 : Racheter des entreprises et direction l'IPO02:00:20 : Comment tenir sur le temps long et rester aligné avec ses valeurs02:07:24 : Comment tirer parti de ce nouveau paradigme02:13:19 : Le paradoxe de la performanceLes anciens épisodes de GDIY mentionnés :#238 - Clément Delangue - Hugging Face - Démocratiser le machine learning pour impacter des milliards d'individus#372 - Alexandre Jenny - Pixfield - L'incroyable histoire du geek de Chambéry derrière la GoPro 360#414 - Florian Douetteau - Dataiku - La prochaine grande vague de l'IA : l'adopter ou périr ?#397 - Yann Le Cun - Chief AI Scientist chez Meta - L'Intelligence Artificielle Générale ne viendra pas de Chat GPT#409 - Alexandre Jardin - Auteur, yourscrib.ai - Peut-on laisser la folie gouverner sa vie ?#28 Pierre Valade SUNRISE - comment se faire racheter 100 millions par Microsoft#353 - Stanislas Polu - Dust - La vérité sur ce que l'IA nous réserveNous avons parlé de :Hugging FaceMistral AIOpenAIAnthropicOcusFinegrainHuggingChatHuggingChat (Apple Store)HuggingChat (Play Store)DeepMindNablaArgillaLLM : large language model (modèles de langage à grande échelle)Manifestations de la place Tian'anmenLes recommandations de lecture : Why Greatness Cannot Be Planned de Kenneth O. O. Stanley et Joel LehmanVous pouvez contacter Clément sur Linkedin, X.La musique du générique vous plaît ? C'est à Morgan Prudhomme que je la dois ! Contactez-le sur : https://studio-module.com. Vous souhaitez sponsoriser Génération Do It Yourself ou nous proposer un partenariat ? Contactez mon label Orso Media via ce formulaire.

LW - LLM Applications I Want To See by sarahconstantin

The Nonlinear Library

Play Episode Listen Later Aug 20, 2024 12:37

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLM Applications I Want To See, published by sarahconstantin on August 20, 2024 on LessWrong. I'm convinced that people who are interested in large language models (LLMs) are overwhelmingly focused on general-purpose "performance" at the expense of exploring useful (or fun) applications. As I'm working on a personal project, I've been learning my way around HuggingFace, which is a hosting platform, set of libraries, and almost-social-network for the open-source AI community. It's fascinating, and worth exploring even if you're not going to be developing foundation models from scratch yourself; if you simply want to use the latest models, build apps around them, or adapt them slightly to your own purposes, HuggingFace seems like the clear place to go. You can look at trending models, and trending public "spaces", aka cloud-hosted instances of models that users can test out, and get a sense of where the "energy" is. And what I see is that almost all the "energy" in LLMs is on general-purpose models, competing on general-purpose question-answering benchmarks, sometimes specialized to particular languages, or to math or coding. "How can I get something that behaves basically like ChatGPT or Claude or Gemini, but gets fewer things wrong, and ideally requires less computing power and and gets the answer faster?" is an important question, but it's far from the only interesting one! If I really search I can find "interesting" specialized applications like "predicts a writer's OCEAN personality scores based on a text sample" or "uses abliteration to produce a wholly uncensored chatbot that will indeed tell you how to make a pipe bomb" but mostly…it's general-purpose models. Not applications for specific uses that I might actually try. And some applications seem to be eager to go to the most creepy and inhumane use cases. No, I don't want little kids talking to a chatbot toy, especially. No, I don't want a necklace or pair of glasses with a chatbot I can talk to. (In public? Imagine the noise pollution!) No, I certainly don't want a bot writing emails for me! Even the stuff I found potentially cool (an AI diary that analyzes your writing and gives you personalized advice) ended up being, in practice, so preachy that I canceled my subscription. In the short term, of course, the most economically valuable thing to do with LLMs is duplicating human labor, so it makes sense that the priority application is autogenerated code. But the most creative and interesting potential applications go beyond "doing things humans can already do, but cheaper" to do things that humans can't do at all on comparable scale. A Personalized Information Environment To some extent, social media, search, and recommendation engines were supposed to enable us to get the "content" we want. And mostly, to the extent that's turned out to be a disappointment, people complain that getting exactly what you want is counterproductive - filter bubbles, superstimuli, etc. But I find that we actually have incredibly crude tools for getting what we want. We can follow or unfollow, block or mute people; we can upvote and downvote pieces of content and hope "the algorithm" feeds us similar results; we can mute particular words or tags. But what we can't do, yet, is define a "quality" we're looking for, or a "genre" or a "vibe", and filter by that criterion. The old tagging systems (on Tumblr or AO3 or Delicious, or back when hashtags were used unironically on Twitter) were the closest approximation to customizable selectivity, and they're still pretty crude. We can do a lot better now. Personalized Content Filter This is a browser extension. You teach the LLM, by highlighting and saving examples, what you consider to be "unwanted" content that you'd prefer not to see. The model learns a classifier to sort all text in yo...

ai chatgpt ocean speech applications ea gemini delicious tumblr llm ao3 rationalist huggingface lesswrong

AI Magic: Shipping 1000s of successful products with no managers and a team of 12 — Jeremy Howard of Answer.ai

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Aug 16, 2024 58:56

Disclaimer: We recorded this episode ~1.5 months ago, timing for the FastHTML release. It then got bottlenecked by Llama3.1, Winds of AI Winter, and SAM2 episodes, so we're a little late. Since then FastHTML was released, swyx is building an app in it for AINews, and Anthropic has also released their prompt caching API. Remember when Dylan Patel of SemiAnalysis coined the GPU Rich vs GPU Poor war? (if not, see our pod with him). The idea was that if you're GPU poor you shouldn't waste your time trying to solve GPU rich problems (i.e. pre-training large models) and are better off working on fine-tuning, optimized inference, etc. Jeremy Howard (see our “End of Finetuning” episode to catchup on his background) and Eric Ries founded Answer.AI to do exactly that: “Practical AI R&D”, which is very in-line with the GPU poor needs. For example, one of their first releases was a system based on FSDP + QLoRA that let anyone train a 70B model on two NVIDIA 4090s. Since then, they have come out with a long list of super useful projects (in no particular order, and non-exhaustive):* FSDP QDoRA: this is just as memory efficient and scalable as FSDP/QLoRA, and critically is also as accurate for continued pre-training as full weight training.* Cold Compress: a KV cache compression toolkit that lets you scale sequence length without impacting speed.* colbert-small: state of the art retriever at only 33M params* JaColBERTv2.5: a new state-of-the-art retrievers on all Japanese benchmarks.* gpu.cpp: portable GPU compute for C++ with WebGPU.* Claudette: a better Anthropic API SDK. They also recently released FastHTML, a new way to create modern interactive web apps. Jeremy recently released a 1 hour “Getting started” tutorial on YouTube; while this isn't AI related per se, but it's close to home for any AI Engineer who are looking to iterate quickly on new products: In this episode we broke down 1) how they recruit 2) how they organize what to research 3) and how the community comes together. At the end, Jeremy gave us a sneak peek at something new that he's working on that he calls dialogue engineering: So I've created a new approach. It's not called prompt engineering. I'm creating a system for doing dialogue engineering. It's currently called AI magic. I'm doing most of my work in this system and it's making me much more productive than I was before I used it.He explains it a bit more ~44:53 in the pod, but we'll just have to wait for the public release to figure out exactly what he means.Timestamps* [00:00:00] Intro by Suno AI* [00:03:02] Continuous Pre-Training is Here* [00:06:07] Schedule-Free Optimizers and Learning Rate Schedules* [00:07:08] Governance and Structural Issues within OpenAI and Other AI Labs* [00:13:01] How Answer.ai works* [00:23:40] How to Recruit Productive Researchers* [00:27:45] Building a new BERT* [00:31:57] FSDP, QLoRA, and QDoRA: Innovations in Fine-Tuning Large Models* [00:36:36] Research and Development on Model Inference Optimization* [00:39:49] FastHTML for Web Application Development* [00:46:53] AI Magic & Dialogue Engineering* [00:52:19] AI wishlist & predictionsShow Notes* Jeremy Howard* Previously on Latent Space: The End of Finetuning, NeurIPS Startups* Answer.ai* Fast.ai* FastHTML* answerai-colbert-small-v1* gpu.cpp* Eric Ries* Aaron DeFazio* Yi Tai* Less Wright* Benjamin Warner* Benjamin Clavié* Jono Whitaker* Austin Huang* Eric Gilliam* Tim Dettmers* Colin Raffel* Sebastian Raschka* Carson Gross* Simon Willison* Sepp Hochreiter* Llama3.1 episode* Snowflake Arctic* Ranger Optimizer* Gemma.cpp* HTMX* UL2* BERT* DeBERTa* Efficient finetuning of Llama 3 with FSDP QDoRA* xLSTMTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:14]: And today we're back with Jeremy Howard, I think your third appearance on Latent Space. Welcome.Jeremy [00:00:19]: Wait, third? Second?Swyx [00:00:21]: Well, I grabbed you at NeurIPS.Jeremy [00:00:23]: I see.Swyx [00:00:24]: Very fun, standing outside street episode.Jeremy [00:00:27]: I never heard that, by the way. You've got to send me a link. I've got to hear what it sounded like.Swyx [00:00:30]: Yeah. Yeah, it's a NeurIPS podcast.Alessio [00:00:32]: I think the two episodes are six hours, so there's plenty to listen, we'll make sure to send it over.Swyx [00:00:37]: Yeah, we're trying this thing where at the major ML conferences, we, you know, do a little audio tour of, give people a sense of what it's like. But the last time you were on, you declared the end of fine tuning. I hope that I sort of editorialized the title a little bit, and I know you were slightly uncomfortable with it, but you just own it anyway. I think you're very good at the hot takes. And we were just discussing in our pre-show that it's really happening, that the continued pre-training is really happening.Jeremy [00:01:02]: Yeah, absolutely. I think people are starting to understand that treating the three ULM FIT steps of like pre-training, you know, and then the kind of like what people now call instruction tuning, and then, I don't know if we've got a general term for this, DPO, RLHFE step, you know, or the task training, they're not actually as separate as we originally suggested they were in our paper, and when you treat it more as a continuum, and that you make sure that you have, you know, more of kind of the original data set incorporated into the later stages, and that, you know, we've also seen with LLAMA3, this idea that those later stages can be done for a lot longer. These are all of the things I was kind of trying to describe there. It wasn't the end of fine tuning, but more that we should treat it as a continuum, and we should have much higher expectations of how much you can do with an already trained model. You can really add a lot of behavior to it, you can change its behavior, you can do a lot. So a lot of our research has been around trying to figure out how to modify the model by a larger amount rather than starting from random weights, because I get very offended at the idea of starting from random weights.Swyx [00:02:14]: Yeah, I saw that in ICLR in Vienna, there was an outstanding paper about starting transformers from data-driven piers. I don't know if you saw that one, they called it sort of never trained from scratch, and I think it was kind of rebelling against like the sort of random initialization.Jeremy [00:02:28]: Yeah, I've, you know, that's been our kind of continuous message since we started Fast AI, is if you're training for random weights, you better have a really good reason, you know, because it seems so unlikely to me that nobody has ever trained on data that has any similarity whatsoever to the general class of data you're working with, and that's the only situation in which I think starting from random weights makes sense.Swyx [00:02:51]: The other trends since our last pod that I would point people to is I'm seeing a rise in multi-phase pre-training. So Snowflake released a large model called Snowflake Arctic, where they detailed three phases of training where they had like a different mixture of like, there was like 75% web in the first instance, and then they reduced the percentage of the web text by 10% each time and increased the amount of code in each phase. And I feel like multi-phase is being called out in papers more. I feel like it's always been a thing, like changing data mix is not something new, but calling it a distinct phase is new, and I wonder if there's something that you're seeingJeremy [00:03:32]: on your end. Well, so they're getting there, right? So the point at which they're doing proper continued pre-training is the point at which that becomes a continuum rather than a phase. So the only difference with what I was describing last time is to say like, oh, there's a function or whatever, which is happening every batch. It's not a huge difference. You know, I always used to get offended when people had learning rates that like jumped. And so one of the things I started doing early on in Fast.ai was to say to people like, no, you should actually have your learning rate schedule should be a function, not a list of numbers. So now I'm trying to give the same idea about training mix.Swyx [00:04:07]: There's been pretty public work from Meta on schedule-free optimizers. I don't know if you've been following Aaron DeFazio and what he's doing, just because you mentioned learning rate schedules, you know, what if you didn't have a schedule?Jeremy [00:04:18]: I don't care very much, honestly. I don't think that schedule-free optimizer is that exciting. It's fine. We've had non-scheduled optimizers for ages, like Less Wright, who's now at Meta, who was part of the Fast.ai community there, created something called the Ranger optimizer. I actually like having more hyperparameters. You know, as soon as you say schedule-free, then like, well, now I don't get to choose. And there isn't really a mathematically correct way of, like, I actually try to schedule more parameters rather than less. So like, I like scheduling my epsilon in my atom, for example. I schedule all the things. But then the other thing we always did with the Fast.ai library was make it so you don't have to set any schedules. So Fast.ai always supported, like, you didn't even have to pass a learning rate. Like, it would always just try to have good defaults and do the right thing. But to me, I like to have more parameters I can play with if I want to, but you don't have to.Alessio [00:05:08]: And then the more less technical side, I guess, of your issue, I guess, with the market was some of the large research labs taking all this innovation kind of behind closed doors and whether or not that's good, which it isn't. And now we could maybe make it more available to people. And then a month after we released the episode, there was the whole Sam Altman drama and like all the OpenAI governance issues. And maybe people started to think more, okay, what happens if some of these kind of labs, you know, start to break from within, so to speak? And the alignment of the humans is probably going to fall before the alignment of the models. So I'm curious, like, if you have any new thoughts and maybe we can also tie in some of the way that we've been building Answer as like a public benefit corp and some of those aspects.Jeremy [00:05:51]: Sure. So, yeah, I mean, it was kind of uncomfortable because two days before Altman got fired, I did a small public video interview in which I said, I'm quite sure that OpenAI's current governance structure can't continue and that it was definitely going to fall apart. And then it fell apart two days later and a bunch of people were like, what did you know, Jeremy?Alessio [00:06:13]: What did Jeremy see?Jeremy [00:06:15]: I didn't see anything. It's just obviously true. Yeah. So my friend Eric Ries and I spoke a lot before that about, you know, Eric's, I think probably most people would agree, the top expert in the world on startup and AI governance. And you know, we could both clearly see that this didn't make sense to have like a so-called non-profit where then there are people working at a company, a commercial company that's owned by or controlled nominally by the non-profit, where the people in the company are being given the equivalent of stock options, like everybody there was working there with expecting to make money largely from their equity. So the idea that then a board could exercise control by saying like, oh, we're worried about safety issues and so we're going to do something that decreases the profit of the company, when every stakeholder in the company, their remuneration pretty much is tied to their profit, it obviously couldn't work. So I mean, that was a huge oversight there by someone. I guess part of the problem is that the kind of people who work at non-profits and in this case the board, you know, who are kind of academics and, you know, people who are kind of true believers. I think it's hard for them to realize that 99.999% of the world is driven very heavily by money, especially huge amounts of money. So yeah, Eric and I had been talking for a long time before that about what could be done differently, because also companies are sociopathic by design and so the alignment problem as it relates to companies has not been solved. Like, companies become huge, they devour their founders, they devour their communities and they do things where even the CEOs, you know, often of big companies tell me like, I wish our company didn't do that thing. You know, I know that if I didn't do it, then I would just get fired and the board would put in somebody else and the board knows if they don't do it, then their shareholders can sue them because they're not maximizing profitability or whatever. So what Eric's spent a lot of time doing is trying to think about how do we make companies less sociopathic, you know, how to, or more, you know, maybe a better way to think of it is like, how do we make it so that the founders of companies can ensure that their companies continue to actually do the things they want them to do? You know, when we started a company, hey, we very explicitly decided we got to start a company, not a academic lab, not a nonprofit, you know, we created a Delaware Seacorp, you know, the most company kind of company. But when we did so, we told everybody, you know, including our first investors, which was you Alessio. They sound great. We are going to run this company on the basis of maximizing long-term value. And in fact, so when we did our second round, which was an angel round, we had everybody invest through a long-term SPV, which we set up where everybody had to agree to vote in line with long-term value principles. So like never enough just to say to people, okay, we're trying to create long-term value here for society as well as for ourselves and everybody's like, oh, yeah, yeah, I totally agree with that. But when it comes to like, okay, well, here's a specific decision we have to make, which will not maximize short-term value, people suddenly change their mind. So you know, it has to be written into the legal documents of everybody so that no question that that's the way the company has to be managed. So then you mentioned the PBC aspect, Public Benefit Corporation, which I never quite understood previously. And turns out it's incredibly simple, like it took, you know, like one paragraph added to our corporate documents to become a PBC. It was cheap, it was easy, but it's got this huge benefit, which is if you're not a public benefit corporation, then somebody can come along and offer to buy you with a stated description of like turning your company into the thing you most hate, right? And if they offer you more than the market value of your company and you don't accept it, then you are not necessarily meeting the kind of your fiduciary responsibilities. So the way like Eric always described it to me is like, if Philip Morris came along and said that you've got great technology for marketing cigarettes to children, so we're going to pivot your company to do that entirely, and we're going to pay you 50% more than the market value, you're going to have to say yes. If you have a PBC, then you are more than welcome to say no, if that offer is not in line with your stated public benefit. So our stated public benefit is to maximize the benefit to society through using AI. So given that more children smoking doesn't do that, then we can say like, no, we're not selling to you.Alessio [00:11:01]: I was looking back at some of our emails. You sent me an email on November 13th about talking and then on the 14th, I sent you an email working together to free AI was the subject line. And then that was kind of the start of the C round. And then two days later, someone got fired. So you know, you were having these thoughts even before we had like a public example of like why some of the current structures didn't work. So yeah, you were very ahead of the curve, so to speak. You know, people can read your awesome introduction blog and answer and the idea of having a R&D lab versus our lab and then a D lab somewhere else. I think to me, the most interesting thing has been hiring and some of the awesome people that you've been bringing on that maybe don't fit the central casting of Silicon Valley, so to speak. Like sometimes I got it like playing baseball cards, you know, people are like, oh, what teams was this person on, where did they work versus focusing on ability. So I would love for you to give a shout out to some of the awesome folks that you have on the team.Jeremy [00:11:58]: So, you know, there's like a graphic going around describing like the people at XAI, you know, Elon Musk thing. And like they are all connected to like multiple of Stanford, Meta, DeepMind, OpenAI, Berkeley, Oxford. Look, these are all great institutions and they have good people. And I'm definitely not at all against that, but damn, there's so many other people. And one of the things I found really interesting is almost any time I see something which I think like this is really high quality work and it's something I don't think would have been built if that person hadn't built the thing right now, I nearly always reach out to them and ask to chat. And I tend to dig in to find out like, okay, you know, why did you do that thing? Everybody else has done this other thing, your thing's much better, but it's not what other people are working on. And like 80% of the time, I find out the person has a really unusual background. So like often they'll have like, either they like came from poverty and didn't get an opportunity to go to a good school or had dyslexia and, you know, got kicked out of school in year 11, or they had a health issue that meant they couldn't go to university or something happened in their past and they ended up out of the mainstream. And then they kind of succeeded anyway. Those are the people that throughout my career, I've tended to kind of accidentally hire more of, but it's not exactly accidentally. It's like when I see somebody who's done, two people who have done extremely well, one of them did extremely well in exactly the normal way from the background entirely pointing in that direction and they achieved all the hurdles to get there. And like, okay, that's quite impressive, you know, but another person who did just as well, despite lots of constraints and doing things in really unusual ways and came up with different approaches. That's normally the person I'm likely to find useful to work with because they're often like risk-takers, they're often creative, they're often extremely tenacious, they're often very open-minded. So that's the kind of folks I tend to find myself hiring. So now at Answer.ai, it's a group of people that are strong enough that nearly every one of them has independently come to me in the past few weeks and told me that they have imposter syndrome and they're not convinced that they're good enough to be here. And I kind of heard it at the point where I was like, okay, I don't think it's possible that all of you are so far behind your peers that you shouldn't get to be here. But I think part of the problem is as an R&D lab, the great developers look at the great researchers and they're like, wow, these big-brained, crazy research people with all their math and s**t, they're too cool for me, oh my God. And then the researchers look at the developers and they're like, oh, they're killing it, making all this stuff with all these people using it and talking on Twitter about how great it is. I think they're both a bit intimidated by each other, you know. And so I have to kind of remind them like, okay, there are lots of things in this world where you suck compared to lots of other people in this company, but also vice versa, you know, for all things. And the reason you came here is because you wanted to learn about those other things from those other people and have an opportunity to like bring them all together into a single unit. You know, it's not reasonable to expect you're going to be better at everything than everybody else. I guess the other part of it is for nearly all of the people in the company, to be honest, they have nearly always been better than everybody else at nearly everything they're doing nearly everywhere they've been. So it's kind of weird to be in this situation now where it's like, gee, I can clearly see that I suck at this thing that I'm meant to be able to do compared to these other people where I'm like the worst in the company at this thing for some things. So I think that's a healthy place to be, you know, as long as you keep reminding each other about that's actually why we're here. And like, it's all a bit of an experiment, like we don't have any managers. We don't have any hierarchy from that point of view. So for example, I'm not a manager, which means I don't get to tell people what to do or how to do it or when to do it. Yeah, it's been a bit of an experiment to see how that would work out. And it's been great. So for instance, Ben Clavier, who you might have come across, he's the author of Ragatouille, he's the author of Rerankers, super strong information retrieval guy. And a few weeks ago, you know, this additional channel appeared on Discord, on our private Discord called Bert24. And these people started appearing, as in our collab sections, we have a collab section for like collaborating with outsiders. And these people started appearing, there are all these names that I recognize, like Bert24, and they're all talking about like the next generation of Bert. And I start following along, it's like, okay, Ben decided that I think, quite rightly, we need a new Bert. Because everybody, like so many people are still using Bert, and it's still the best at so many things, but it actually doesn't take advantage of lots of best practices. And so he just went out and found basically everybody who's created better Berts in the last four or five years, brought them all together, suddenly there's this huge collaboration going on. So yeah, I didn't tell him to do that. He didn't ask my permission to do that. And then, like, Benjamin Warner dived in, and he's like, oh, I created a whole transformers from scratch implementation designed to be maximally hackable. He originally did it largely as a teaching exercise to show other people, but he was like, I could, you know, use that to create a really hackable BERT implementation. In fact, he didn't say that. He said, I just did do that, you know, and I created a repo, and then everybody's like starts using it. They're like, oh my god, this is amazing. I can now implement all these other BERT things. And it's not just answer AI guys there, you know, there's lots of folks, you know, who have like contributed new data set mixes and blah, blah, blah. So, I mean, I can help in the same way that other people can help. So like, then Ben Clavier reached out to me at one point and said, can you help me, like, what have you learned over time about how to manage intimidatingly capable and large groups of people who you're nominally meant to be leading? And so, you know, I like to try to help, but I don't direct. Another great example was Kerem, who, after our FSTP QLORA work, decided quite correctly that it didn't really make sense to use LoRa in today's world. You want to use the normalized version, which is called Dora. Like two or three weeks after we did FSTP QLORA, he just popped up and said, okay, I've just converted the whole thing to Dora, and I've also created these VLLM extensions, and I've got all these benchmarks, and, you know, now I've got training of quantized models with adapters that are as fast as LoRa, and as actually better than, weirdly, fine tuning. Just like, okay, that's great, you know. And yeah, so the things we've done to try to help make these things happen as well is we don't have any required meetings, you know, but we do have a meeting for each pair of major time zones that everybody's invited to, and, you know, people see their colleagues doing stuff that looks really cool and say, like, oh, how can I help, you know, or how can I learn or whatever. So another example is Austin, who, you know, amazing background. He ran AI at Fidelity, he ran AI at Pfizer, he ran browsing and retrieval for Google's DeepMind stuff, created Jemma.cpp, and he's been working on a new system to make it easier to do web GPU programming, because, again, he quite correctly identified, yeah, so I said to him, like, okay, I want to learn about that. Not an area that I have much expertise in, so, you know, he's going to show me what he's working on and teach me a bit about it, and hopefully I can help contribute. I think one of the key things that's happened in all of these is everybody understands what Eric Gilliam, who wrote the second blog post in our series, the R&D historian, describes as a large yard with narrow fences. Everybody has total flexibility to do what they want. We all understand kind of roughly why we're here, you know, we agree with the premises around, like, everything's too expensive, everything's too complicated, people are building too many vanity foundation models rather than taking better advantage of fine-tuning, like, there's this kind of general, like, sense of we're all on the same wavelength about, you know, all the ways in which current research is fucked up, and, you know, all the ways in which we're worried about centralization. We all care a lot about not just research for the point of citations, but research that actually wouldn't have happened otherwise, and actually is going to lead to real-world outcomes. And so, yeah, with this kind of, like, shared vision, people understand, like, you know, so when I say, like, oh, well, you know, tell me, Ben, about BERT 24, what's that about? And he's like, you know, like, oh, well, you know, you can see from an accessibility point of view, or you can see from a kind of a actual practical impact point of view, there's far too much focus on decoder-only models, and, you know, like, BERT's used in all of these different places and industry, and so I can see, like, in terms of our basic principles, what we're trying to achieve, this seems like something important. And so I think that's, like, a really helpful that we have that kind of shared perspective, you know?Alessio [00:21:14]: Yeah. And before we maybe talk about some of the specific research, when you're, like, reaching out to people, interviewing them, what are some of the traits, like, how do these things come out, you know, usually? Is it working on side projects that you, you know, you're already familiar with? Is there anything, like, in the interview process that, like, helps you screen for people that are less pragmatic and more research-driven versus some of these folks that are just gonna do it, you know? They're not waiting for, like, the perfect process.Jeremy [00:21:40]: Everybody who comes through the recruiting is interviewed by everybody in the company. You know, our goal is 12 people, so it's not an unreasonable amount. So the other thing to say is everybody so far who's come into the recruiting pipeline, everybody bar one, has been hired. So which is to say our original curation has been good. And that's actually pretty easy, because nearly everybody who's come in through the recruiting pipeline are people I know pretty well. So Jono Whitaker and I, you know, he worked on the stable diffusion course we did. He's outrageously creative and talented, and he's super, like, enthusiastic tinkerer, just likes making things. Benjamin was one of the strongest parts of the fast.ai community, which is now the alumni. It's, like, hundreds of thousands of people. And you know, again, like, they're not people who a normal interview process would pick up, right? So Benjamin doesn't have any qualifications in math or computer science. Jono was living in Zimbabwe, you know, he was working on, like, helping some African startups, you know, but not FAANG kind of credentials. But yeah, I mean, when you actually see people doing real work and they stand out above, you know, we've got lots of Stanford graduates and open AI people and whatever in our alumni community as well. You know, when you stand out above all of those people anyway, obviously you've got something going for you. You know, Austin, him and I worked together on the masks study we did in the proceeding at the National Academy of Science. You know, we had worked together, and again, that was a group of, like, basically the 18 or 19 top experts in the world on public health and epidemiology and research design and so forth. And Austin, you know, one of the strongest people in that collaboration. So yeah, you know, like, I've been lucky enough to have had opportunities to work with some people who are great and, you know, I'm a very open-minded person, so I kind of am always happy to try working with pretty much anybody and some people stand out. You know, there have been some exceptions, people I haven't previously known, like Ben Clavier, actually, I didn't know before. But you know, with him, you just read his code, and I'm like, oh, that's really well-written code. And like, it's not written exactly the same way as everybody else's code, and it's not written to do exactly the same thing as everybody else's code. So yeah, and then when I chatted to him, it's just like, I don't know, I felt like we'd known each other for years, like we just were on the same wavelength, but I could pretty much tell that was going to happen just by reading his code. I think you express a lot in the code you choose to write and how you choose to write it, I guess. You know, or another example, a guy named Vic, who was previously the CEO of DataQuest, and like, in that case, you know, he's created a really successful startup. He won the first, basically, Kaggle NLP competition, which was automatic essay grading. He's got the current state-of-the-art OCR system, Surya. Again, he's just a guy who obviously just builds stuff, you know, he doesn't ask for permission, he doesn't need any, like, external resources. Actually, Karim's another great example of this, I mean, I already knew Karim very well because he was my best ever master's student, but it wasn't a surprise to me then when he then went off to create the world's state-of-the-art language model in Turkish on his own, in his spare time, with no budget, from scratch. This is not fine-tuning or whatever, he, like, went back to Common Crawl and did everything. Yeah, it's kind of, I don't know what I'd describe that process as, but it's not at all based on credentials.Swyx [00:25:17]: Assemble based on talent, yeah. We wanted to dive in a little bit more on, you know, turning from the people side of things into the technical bets that you're making. Just a little bit more on Bert. I was actually, we just did an interview with Yi Tay from Reka, I don't know if you're familiar with his work, but also another encoder-decoder bet, and one of his arguments was actually people kind of over-index on the decoder-only GPT-3 type paradigm. I wonder if you have thoughts there that is maybe non-consensus as well. Yeah, no, absolutely.Jeremy [00:25:45]: So I think it's a great example. So one of the people we're collaborating with a little bit with BERT24 is Colin Raffle, who is the guy behind, yeah, most of that stuff, you know, between that and UL2, there's a lot of really interesting work. And so one of the things I've been encouraging the BERT group to do, Colin has as well, is to consider using a T5 pre-trained encoder backbone as a thing you fine-tune, which I think would be really cool. You know, Colin was also saying actually just use encoder-decoder as your Bert, you know, why don't you like use that as a baseline, which I also think is a good idea. Yeah, look.Swyx [00:26:25]: What technical arguments are people under-weighting?Jeremy [00:26:27]: I mean, Colin would be able to describe this much better than I can, but I'll give my slightly non-expert attempt. Look, I mean, think about like diffusion models, right? Like in stable diffusion, like we use things like UNet. You have this kind of downward path and then in the upward path you have the cross connections, which it's not a tension, but it's like a similar idea, right? You're inputting the original encoding path into your decoding path. It's critical to make it work, right? Because otherwise in the decoding part, the model has to do so much kind of from scratch. So like if you're doing translation, like that's a classic kind of encoder-decoder example. If it's decoder only, you never get the opportunity to find the right, you know, feature engineering, the right feature encoding for the original sentence. And it kind of means then on every token that you generate, you have to recreate the whole thing, you know? So if you have an encoder, it's basically saying like, okay, this is your opportunity model to create a really useful feature representation for your input information. So I think there's really strong arguments for encoder-decoder models anywhere that there is this kind of like context or source thing. And then why encoder only? Well, because so much of the time what we actually care about is a classification, you know? It's like an output. It's like generating an arbitrary length sequence of tokens. So anytime you're not generating an arbitrary length sequence of tokens, decoder models don't seem to make much sense. Now the interesting thing is, you see on like Kaggle competitions, that decoder models still are at least competitive with things like Deberta v3. They have to be way bigger to be competitive with things like Deberta v3. And the only reason they are competitive is because people have put a lot more time and money and effort into training the decoder only ones, you know? There isn't a recent Deberta. There isn't a recent Bert. Yeah, it's a whole part of the world that people have slept on a little bit. And this is just what happens. This is how trends happen rather than like, to me, everybody should be like, oh, let's look at the thing that has shown signs of being useful in the past, but nobody really followed up with properly. That's the more interesting path, you know, where people tend to be like, oh, I need to get citations. So what's everybody else doing? Can I make it 0.1% better, you know, or 0.1% faster? That's what everybody tends to do. Yeah. So I think it's like, Itay's work commercially now is interesting because here's like a whole, here's a whole model that's been trained in a different way. So there's probably a whole lot of tasks it's probably better at than GPT and Gemini and Claude. So that should be a good commercial opportunity for them if they can figure out what those tasks are.Swyx [00:29:07]: Well, if rumors are to be believed, and he didn't comment on this, but, you know, Snowflake may figure out the commercialization for them. So we'll see.Jeremy [00:29:14]: Good.Alessio [00:29:16]: Let's talk about FSDP, Qlora, Qdora, and all of that awesome stuff. One of the things we talked about last time, some of these models are meant to run on systems that nobody can really own, no single person. And then you were like, well, what if you could fine tune a 70B model on like a 4090? And I was like, no, that sounds great, Jeremy, but like, can we actually do it? And then obviously you all figured it out. Can you maybe tell us some of the worst stories behind that, like the idea behind FSDP, which is kind of taking sharded data, parallel computation, and then Qlora, which is do not touch all the weights, just go quantize some of the model, and then within the quantized model only do certain layers instead of doing everything.Jeremy [00:29:57]: Well, do the adapters. Yeah.Alessio [00:29:59]: Yeah. Yeah. Do the adapters. Yeah. I will leave the floor to you. I think before you published it, nobody thought this was like a short term thing that we're just going to have. And now it's like, oh, obviously you can do it, but it's not that easy.Jeremy [00:30:12]: Yeah. I mean, to be honest, it was extremely unpleasant work to do. It's like not at all enjoyable. I kind of did version 0.1 of it myself before we had launched the company, or at least the kind of like the pieces. They're all pieces that are difficult to work with, right? So for the quantization, you know, I chatted to Tim Detmers quite a bit and, you know, he very much encouraged me by saying like, yeah, it's possible. He actually thought it'd be easy. It probably would be easy for him, but I'm not Tim Detmers. And, you know, so he wrote bits and bytes, which is his quantization library. You know, he wrote that for a paper. He didn't write that to be production like code. It's now like everybody's using it, at least the CUDA bits. So like, it's not particularly well structured. There's lots of code paths that never get used. There's multiple versions of the same thing. You have to try to figure it out. So trying to get my head around that was hard. And you know, because the interesting bits are all written in CUDA, it's hard to like to step through it and see what's happening. And then, you know, FSTP is this very complicated library and PyTorch, which not particularly well documented. So the only really, really way to understand it properly is again, just read the code and step through the code. And then like bits and bytes doesn't really work in practice unless it's used with PEF, the HuggingFace library and PEF doesn't really work in practice unless you use it with other things. And there's a lot of coupling in the HuggingFace ecosystem where like none of it works separately. You have to use it all together, which I don't love. So yeah, trying to just get a minimal example that I can play with was really hard. And so I ended up having to rewrite a lot of it myself to kind of create this like minimal script. One thing that helped a lot was Medec had this LlamaRecipes repo that came out just a little bit before I started working on that. And like they had a kind of role model example of like, here's how to train FSTP, LoRa, didn't work with QLoRa on Llama. A lot of the stuff I discovered, the interesting stuff would be put together by Les Wright, who's, he was actually the guy in the Fast.ai community I mentioned who created the Ranger Optimizer. So he's doing a lot of great stuff at Meta now. So yeah, I kind of, that helped get some minimum stuff going and then it was great once Benjamin and Jono joined full time. And so we basically hacked at that together and then Kerim joined like a month later or something. And it was like, gee, it was just a lot of like fiddly detailed engineering on like barely documented bits of obscure internals. So my focus was to see if it kind of could work and I kind of got a bit of a proof of concept working and then the rest of the guys actually did all the work to make it work properly. And, you know, every time we thought we had something, you know, we needed to have good benchmarks, right? So we'd like, it's very easy to convince yourself you've done the work when you haven't, you know, so then we'd actually try lots of things and be like, oh, and these like really important cases, the memory use is higher, you know, or it's actually slower. And we'd go in and we just find like all these things that were nothing to do with our library that just didn't work properly. And nobody had noticed they hadn't worked properly because nobody had really benchmarked it properly. So we ended up, you know, trying to fix a whole lot of different things. And even as we did so, new regressions were appearing in like transformers and stuff that Benjamin then had to go away and figure out like, oh, how come flash attention doesn't work in this version of transformers anymore with this set of models and like, oh, it turns out they accidentally changed this thing, so it doesn't work. You know, there's just, there's not a lot of really good performance type evals going on in the open source ecosystem. So there's an extraordinary amount of like things where people say like, oh, we built this thing and it has this result. And when you actually check it, so yeah, there's a shitload of war stories from getting that thing to work. And it did require a particularly like tenacious group of people and a group of people who don't mind doing a whole lot of kind of like really janitorial work, to be honest, to get the details right, to check them. Yeah.Alessio [00:34:09]: We had a trade out on the podcast and we talked about how a lot of it is like systems work to make some of these things work. It's not just like beautiful, pure math that you do on a blackboard. It's like, how do you get into the nitty gritty?Jeremy [00:34:22]: I mean, flash attention is a great example of that. Like it's, it basically is just like, oh, let's just take the attention and just do the tiled version of it, which sounds simple enough, you know, but then implementing that is challenging at lots of levels.Alessio [00:34:36]: Yeah. What about inference? You know, obviously you've done all this amazing work on fine tuning. Do you have any research you've been doing on the inference side, how to make local inference really fast on these models too?Jeremy [00:34:47]: We're doing quite a bit on that at the moment. We haven't released too much there yet. But one of the things I've been trying to do is also just to help other people. And one of the nice things that's happened is that a couple of folks at Meta, including Mark Seraphim, have done a nice job of creating this CUDA mode community of people working on like CUDA kernels or learning about that. And I tried to help get that going well as well and did some lessons to help people get into it. So there's a lot going on in both inference and fine tuning performance. And a lot of it's actually happening kind of related to that. So PyTorch team have created this Torch AO project on quantization. And so there's a big overlap now between kind of the FastAI and AnswerAI and CUDA mode communities of people working on stuff for both inference and fine tuning. But we're getting close now. You know, our goal is that nobody should be merging models, nobody should be downloading merged models, everybody should be using basically quantized plus adapters for almost everything and just downloading the adapters. And that should be much faster. So that's kind of the place we're trying to get to. It's difficult, you know, because like Karim's been doing a lot of work with VLM, for example. These inference engines are pretty complex bits of code. They have a whole lot of custom kernel stuff going on as well, as do the quantization libraries. So we've been working on, we're also quite a bit of collaborating with the folks who do HQQ, which is a really great quantization library and works super well. So yeah, there's a lot of other people outside AnswerAI that we're working with a lot who are really helping on all this performance optimization stuff, open source.Swyx [00:36:27]: Just to follow up on merging models, I picked up there that you said nobody should be merging models. That's interesting because obviously a lot of people are experimenting with this and finding interesting results. I would say in defense of merging models, you can do it without data. That's probably the only thing that's going for it.Jeremy [00:36:45]: To explain, it's not that you shouldn't merge models. You shouldn't be distributing a merged model. You should distribute a merged adapter 99% of the time. And actually often one of the best things happening in the model merging world is actually that often merging adapters works better anyway. The point is, Sean, that once you've got your new model, if you distribute it as an adapter that sits on top of a quantized model that somebody's already downloaded, then it's a much smaller download for them. And also the inference should be much faster because you're not having to transfer FB16 weights from HPM memory at all or ever load them off disk. You know, all the main weights are quantized and the only floating point weights are in the adapters. So that should make both inference and fine tuning faster. Okay, perfect.Swyx [00:37:33]: We're moving on a little bit to the rest of the fast universe. I would have thought that, you know, once you started Answer.ai, that the sort of fast universe would be kind of on hold. And then today you just dropped Fastlight and it looks like, you know, there's more activity going on in sort of Fastland.Jeremy [00:37:49]: Yeah. So Fastland and Answerland are not really distinct things. Answerland is kind of like the Fastland grown up and funded. They both have the same mission, which is to maximize the societal benefit of AI broadly. We want to create thousands of commercially successful products at Answer.ai. And we want to do that with like 12 people. So that means we need a pretty efficient stack, you know, like quite a few orders of magnitude more efficient, not just for creation, but for deployment and maintenance than anything that currently exists. People often forget about the D part of our R&D firm. So we've got to be extremely good at creating, deploying and maintaining applications, not just models. Much to my horror, the story around creating web applications is much worse now than it was 10 or 15 years ago in terms of, if I say to a data scientist, here's how to create and deploy a web application, you know, either you have to learn JavaScript or TypeScript and about all the complex libraries like React and stuff, and all the complex like details around security and web protocol stuff around how you then talk to a backend and then all the details about creating the backend. You know, if that's your job and, you know, you have specialists who work in just one of those areas, it is possible for that to all work. But compared to like, oh, write a PHP script and put it in the home directory that you get when you sign up to this shell provider, which is what it was like in the nineties, you know, here are those 25 lines of code and you're done and now you can pass that URL around to all your friends, or put this, you know, .pl file inside the CGI bin directory that you got when you signed up to this web host. So yeah, the thing I've been mainly working on the last few weeks is fixing all that. And I think I fixed it. I don't know if this is an announcement, but I tell you guys, so yeah, there's this thing called fastHTML, which basically lets you create a complete web application in a single Python file. Unlike excellent projects like Streamlit and Gradio, you're not working on top of a highly abstracted thing. That's got nothing to do with web foundations. You're working with web foundations directly, but you're able to do it by using pure Python. There's no template, there's no ginger, there's no separate like CSS and JavaScript files. It looks and behaves like a modern SPA web application. And you can create components for like daisy UI, or bootstrap, or shoelace, or whatever fancy JavaScript and or CSS tailwind etc library you like, but you can write it all in Python. You can pip install somebody else's set of components and use them entirely from Python. You can develop and prototype it all in a Jupyter notebook if you want to. It all displays correctly, so you can like interactively do that. And then you mentioned Fastlight, so specifically now if you're using SQLite in particular, it's like ridiculously easy to have that persistence, and all of your handlers will be passed database ready objects automatically, that you can just call dot delete dot update dot insert on. Yeah, you get session, you get security, you get all that. So again, like with most everything I do, it's very little code. It's mainly tying together really cool stuff that other people have written. You don't have to use it, but a lot of the best stuff comes from its incorporation of HTMX, which to me is basically the thing that changes your browser to make it work the way it always should have. So it just does four small things, but those four small things are the things that are basically unnecessary constraints that HTML should never have had, so it removes the constraints. It sits on top of Starlet, which is a very nice kind of lower level platform for building these kind of web applications. The actual interface matches as closely as possible to FastAPI, which is a really nice system for creating the kind of classic JavaScript type applications. And Sebastian, who wrote FastAPI, has been kind enough to help me think through some of these design decisions, and so forth. I mean, everybody involved has been super helpful. Actually, I chatted to Carson, who created HTMX, you know, so about it. Some of the folks involved in Django, like everybody in the community I've spoken to definitely realizes there's a big gap to be filled around, like, highly scalable, web foundation-based, pure Python framework with a minimum of fuss. So yeah, I'm getting a lot of support and trying to make sure that FastHTML works well for people.Swyx [00:42:38]: I would say, when I heard about this, I texted Alexio. I think this is going to be pretty huge. People consider Streamlit and Gradio to be the state of the art, but I think there's so much to improve, and having what you call web foundations and web fundamentals at the core of it, I think, would be really helpful.Jeremy [00:42:54]: I mean, it's based on 25 years of thinking and work for me. So like, FastML was built on a system much like this one, but that was of hell. And so I spent, you know, 10 years working on that. We had millions of people using that every day, really pushing it hard. And I really always enjoyed working in that. Yeah. So, you know, and obviously lots of other people have done like great stuff, and particularly HTMX. So I've been thinking about like, yeah, how do I pull together the best of the web framework I created for FastML with HTMX? There's also things like PicoCSS, which is the CSS system, which by default, FastHTML comes with. Although, as I say, you can pip install anything you want to, but it makes it like super easy to, you know, so we try to make it so that just out of the box, you don't have any choices to make. Yeah. You can make choices, but for most people, you just, you know, it's like the PHP in your home directory thing. You just start typing and just by default, you'll get something which looks and feels, you know, pretty okay. And if you want to then write a version of Gradio or Streamlit on top of that, you totally can. And then the nice thing is if you then write it in kind of the Gradio equivalent, which will be, you know, I imagine we'll create some kind of pip installable thing for that. Once you've outgrown, or if you outgrow that, it's not like, okay, throw that all away and start again. And this like whole separate language that it's like this kind of smooth, gentle path that you can take step-by-step because it's all just standard web foundations all the way, you know.Swyx [00:44:29]: Just to wrap up the sort of open source work that you're doing, you're aiming to create thousands of projects with a very, very small team. I haven't heard you mention once AI agents or AI developer tooling or AI code maintenance. I know you're very productive, but you know, what is the role of AI in your own work?Jeremy [00:44:47]: So I'm making something. I'm not sure how much I want to say just yet.Swyx [00:44:52]: Give us a nibble.Jeremy [00:44:53]: All right. I'll give you the key thing. So I've created a new approach. It's not called prompt engineering. It's called dialogue engineering. But I'm creating a system for doing dialogue engineering. It's currently called AI magic. I'm doing most of my work in this system and it's making me much more productive than I was before I used it. So I always just build stuff for myself and hope that it'll be useful for somebody else. Think about chat GPT with code interpreter, right? The basic UX is the same as a 1970s teletype, right? So if you wrote APL on a teletype in the 1970s, you typed onto a thing, your words appeared at the bottom of a sheet of paper and you'd like hit enter and it would scroll up. And then the answer from APL would be printed out, scroll up, and then you would type the next thing. And like, which is also the way, for example, a shell works like bash or ZSH or whatever. It's not terrible, you know, like we all get a lot done in these like very, very basic teletype style REPL environments, but I've never felt like it's optimal and everybody else has just copied chat GPT. So it's also the way BART and Gemini work. It's also the way the Claude web app works. And then you add code interpreter. And the most you can do is to like plead with chat GPT to write the kind of code I want. It's pretty good for very, very, very beginner users who like can't code at all, like by default now the code's even hidden away, so you never even have to see it ever happened. But for somebody who's like wanting to learn to code or who already knows a bit of code or whatever, it's, it seems really not ideal. So okay, that's one end of the spectrum. The other end of the spectrum, which is where Sean's work comes in, is, oh, you want to do more than chat GPT? No worries. Here is Visual Studio Code. I run it. There's an empty screen with a flashing cursor. Okay, start coding, you know, and it's like, okay, you can use systems like Sean's or like cursor or whatever to be like, okay, Apple K in cursors, like a creative form that blah, blah, blah. But in the end, it's like a convenience over the top of this incredibly complicated system that full-time sophisticated software engineers have designed over the past few decades in a totally different environment as a way to build software, you know. And so we're trying to like shoehorn in AI into that. And it's not easy to do. And I think there are like much better ways of thinking about the craft of software development in a language model world to be much more interactive, you know. So the thing that I'm building is neither of those things. It's something between the two. And it's built around this idea of crafting a dialogue, you know, where the outcome of the dialogue is the artifacts that you want, whether it be a piece of analysis or whether it be a Python library or whether it be a technical blog post or whatever. So as part of building that, I've created something called Claudette, which is a library for Claude. I've created something called Cosette, which is a library for OpenAI. They're libraries which are designed to make those APIs much more usable, much easier to use, much more concise. And then I've written AI magic on top of those. And that's been an interesting exercise because I did Claudette first, and I was looking at what Simon Willison did with his fantastic LLM library. And his library is designed around like, let's make something that supports all the LLM inference engines and commercial providers. I thought, okay, what if I did something different, which is like make something that's as Claude friendly as possible and forget everything else. So that's what Claudette was. So for example, one of the really nice things in Claude is prefill. So by telling the assistant that this is what your response started with, there's a lot of powerful things you can take advantage of. So yeah, I created Claudette to be as Claude friendly as possible. And then after I did that, and then particularly with GPT 4.0 coming out, I kind of thought, okay, now let's create something that's as OpenAI friendly as possible. And then I tried to look to see, well, where are the similarities and where are the differences? And now can I make them compatible in places where it makes sense for them to be compatible without losing out on the things that make each one special for what they are. So yeah, those are some of the things I've been working on in that space. And I'm thinking we might launch AI magic via a course called how to solve it with code. The name is based on the classic Polya book, if you know how to solve it, which is, you know, one of the classic math books of all time, where we're basically going to try to show people how to solve challenging problems that they didn't think they could solve without doing a full computer science course, by taking advantage of a bit of AI and a bit of like practical skills, as particularly for this like whole generation of people who are learning to code with and because of ChatGPT. Like I love it, I know a lot of people who didn't really know how to code, but they've created things because they use ChatGPT, but they don't really know how to maintain them or fix them or add things to them that ChatGPT can't do, because they don't really know how to code. And so this course will be designed to show you how you can like either become a developer who can like supercharge their capabilities by using language models, or become a language model first developer who can supercharge their capabilities by understanding a bit about process and fundamentals.Alessio [00:50:19]: Nice. That's a great spoiler. You know, I guess the fourth time you're going to be on learning space, we're going to talk about AI magic. Jeremy, before we wrap, this was just a great run through everything. What are the things that when you next come on the podcast in nine, 12 months, we're going to be like, man, Jeremy was like really ahead of it. Like, is there anything that you see in the space that maybe people are not talking enough? You know, what's the next company that's going to fall, like have drama internally, anything in your mind?Jeremy [00:50:47]: You know, hopefully we'll be talking a lot about fast HTML and hopefully the international community that at that point has come up around that. And also about AI magic and about dialogue engineering. Hopefully dialogue engineering catches on because I think it's the right way to think about a lot of this stuff. What else? Just trying to think about all on the research side. Yeah. I think, you know, I mean, we've talked about a lot of it. Like I think encoder decoder architectures, encoder only architectures, hopefully we'll be talking about like the whole re-interest in BERT that BERT 24 stimulated.Swyx [00:51:17]: There's a safe space model that came out today that might be interesting for this general discussion. One thing that stood out to me with Cartesia's blog posts was that they were talking about real time ingestion, billions and trillions of tokens, and keeping that context, obviously in the state space that they have.Jeremy [00:51:34]: Yeah.Swyx [00:51:35]: I'm wondering what your thoughts are because you've been entirely transformers the whole time.Jeremy [00:51:38]: Yeah. No. So obviously my background is RNNs and LSTMs. Of course. And I'm still a believer in the idea that state is something you can update, you know? So obviously Sepp Hochreiter came up, came out with xLSTM recently. Oh my God. Okay. Another whole thing we haven't talked about, just somewhat related. I've been going crazy for like a long time about like, why can I not pay anybody to save my KV cash? I just ingested the Great Gatsby or the documentation for Starlet or whatever, you know, I'm sending it as my prompt context. Why are you redoing it every time? So Gemini is about to finally come out with KV caching, and this is something that Austin actually in Gemma.cpp had had on his roadmap for years, well not years, months, long time. The idea that the KV cache is like a thing that, it's a third thing, right? So there's RAG, you know, there's in-context learning, you know, and prompt engineering, and there's KV cache creation. I think it creates like a whole new class almost of applications or as techniques where, you know, for me, for example, I very often work with really new libraries or I've created my own library that I'm now writing with rather than on. So I want all the docs in my new library to be there all the time. So I want to upload them once, and then we have a whole discussion about building this application using FastHTML. Well nobody's got FastHTML in their language model yet, I don't want to send all the FastHTML docs across every time. So one of the things I'm looking at doing in AI Magic actually is taking advantage of some of these ideas so that you can have the documentation of the libraries you're working on be kind of always available. Something over the next 12 months people will be spending time thinking about is how to like, where to use RAG, where to use fine-tuning, where to use KV cache storage, you know. And how to use state, because in state models and XLSTM, again, state is something you update. So how do we combine the best of all of these worlds?Alessio [00:53:46]: And Jeremy, I know before you talked about how some of the autoregressive models are not maybe a great fit for agents. Any other thoughts on like JEPA, diffusion for text, any interesting thing that you've seen pop up?Jeremy [00:53:58]: In the same way that we probably ought to have state that you can update, i.e. XLSTM and state models, in the same way that a lot of things probably should have an encoder, JEPA and diffusion both seem like the right conceptual mapping for a lot of things we probably want to do. So the idea of like, there should be a piece of the generative pipeline, which is like thinking about the answer and coming up with a sketch of what the answer looks like before you start outputting tokens. That's where it kind of feels like diffusion ought to fit, you know. And diffusion is, because it's not autoregressive, it's like, let's try to like gradually de-blur the picture of how to solve this. So this is also where dialogue engineering fits in, by the way. So with dialogue engineering, one of the reasons it's working so well for me is I use it to kind of like craft the thought process before I generate the code, you know. So yeah, there's a lot of different pieces here and I don't know how they'll all kind of exactly fit together. I don't know if JEPA is going to actually end up working in the text world. I don't know if diffusion will end up working in the text world, but they seem to be like trying to solve a class of problem which is currently unsolved.Alessio [00:55:13]: Awesome, Jeremy. This was great, as usual. Thanks again for coming back on the pod and thank you all for listening. Yeah, that was fantastic. Get full access to Latent Space at www.latent.space/subscribe

god ceo ai google science magic building research japanese elon musk development african chatgpt silicon valley ceos discord oxford stanford spa products berkeley cto pfizer react bart managers governance openai gemini turkish residence cgi zimbabwe nvidia ux api shipping ranger vic gpt python ui winds ml llama snowflakes national academy apis karim javascript html assemble r d llm sam altman gpu altman great gatsby css php django jono kv rag ocr anthropic deepmind alessio fine tuning surya faang xai typescript eric ries philip morris pbc apl starlet cuda visual studio code dpo kerem t5 reka kerim kaggle sqlite pytorch spv itay jupyter public benefit corporation 33m pef jeremy howard 70b repl neurips berts ai engineer huggingface vl m htmx ai winter zsh hpm simon willison alexio rnns streamlit iclr webgpu latent space unet gradio lstms polya web application development

The Winds of AI Winter (Q2 Four Wars Recap) + ChatGPT Voice Mode Preview

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Aug 2, 2024 115:01

Thank you for 1m downloads of the podcast and 2m readers of the Substack!

united states god ceo american new york world ai australia english google apple vision voice talk americans san francisco new york times research war chinese rich data australian market european union search microsoft italian holy drop new zealand south iphone illinois selling irish chatgpt code ladies supreme court missouri memory valley os atlantic whatsapp software reddit wars washington post cloud singapore midwest philippines indonesia laugh scottish intelligence ios new yorker context scaling mark zuckerberg architecture uma snap oracle substack stopping bloomberg cto iq malaysia whispers vc similar adapt ipo southeast asia determine fireworks optimizing openai gemini residence gdp laughing gateway fusion nvidia nah acknowledge hardware chess financial times api document av wang frontier 10k chrome blank verge mojo scarlett johansson gpt vertical winds gorilla aws ftc nexus ml lama llama small talk boston marathon goldman mandarin apis bedtime ruler great lakes consensus amd nome synthetic frameworks tt band aids ids nano romain chameleons opus llm biases sam altman colbert hirsch weights ops chai gpu mamba pdfs skynet gg crowdstrike venn 5b google chrome modular gnome soc soit skyfall mozilla zuck perplexity wix kv imo cuz nama rag haiku anthropic vespa gpus rudyard kipling sonnets 7b golden gate bridge benchmarking sdks quadrants ilya ccs irobot lambda san fernando valley alessio lightspeed asics lms 8b stack overflow crackle scarjo little italy noose restful economically mistral suno cpus lex fridman malay shutterstock riaa asic superintelligence larry ellison inflection opex gcp tts vertex multimodal a16z latency ozymandias observability datadog olympiads gradient proxies asr icm baits drop zone mimicry rpc devrel ai news etched netlify cloud platforms gpc temasek sandbagging jamba eclair gbt gpd apple notes character ai exa augments neurips ai engineer li bai huggingface entropic george hotz harvard yard gbd singlish code interpreter icml ai winter ml ops phy martin casado crosstrek johnny ive technium latent space numina inprint sohu i okay

Llama 2, 3 & 4: Synthetic Data, RLHF, Agents on the path to Open Source AGI

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jul 23, 2024 65:07

If you see this in time, join our emergency LLM paper club on the Llama 3 paper!For everyone else, join our special AI in Action club on the Latent Space Discord for a special feature with the Cursor cofounders on Composer, their newest coding agent!Today, Meta is officially releasing the largest and most capable open model to date, Llama3-405B, a dense transformer trained on 15T tokens that beats GPT-4 on all major benchmarks:The 8B and 70B models from the April Llama 3 release have also received serious spec bumps, warranting the new label of Llama 3.1.If you are curious about the infra / hardware side, go check out our episode with Soumith Chintala, one of the AI infra leads at Meta. Today we have Thomas Scialom, who led Llama2 and now Llama3 post-training, so we spent most of our time on pre-training (synthetic data, data pipelines, scaling laws, etc) and post-training (RLHF vs instruction tuning, evals, tool calling).Synthetic data is all you needLlama3 was trained on 15T tokens, 7x more than Llama2 and with 4 times as much code and 30 different languages represented. But as Thomas beautifully put it:“My intuition is that the web is full of s**t in terms of text, and training on those tokens is a waste of compute.” “Llama 3 post-training doesn't have any human written answers there basically… It's just leveraging pure synthetic data from Llama 2.”While it is well speculated that the 8B and 70B were "offline distillations" of the 405B, there are a good deal more synthetic data elements to Llama 3.1 than the expected. The paper explicitly calls out:* SFT for Code: 3 approaches for synthetic data for the 405B bootstrapping itself with code execution feedback, programming language translation, and docs backtranslation.* SFT for Math: The Llama 3 paper credits the Let's Verify Step By Step authors, who we interviewed at ICLR:* SFT for Multilinguality: "To collect higher quality human annotations in non-English languages, we train a multilingual expert by branching off the pre-training run and continuing to pre-train on a data mix that consists of 90% multilingualtokens."* SFT for Long Context: "It is largely impractical to get humans to annotate such examples due to the tedious and time-consuming nature of reading lengthy contexts, so we predominantly rely on synthetic data to fill this gap. We use earlier versions of Llama 3 to generate synthetic data based on the key long-context use-cases: (possibly multi-turn) question-answering, summarization for long documents, and reasoning over code repositories, and describe them in greater detail below"* SFT for Tool Use: trained for Brave Search, Wolfram Alpha, and a Python Interpreter (a special new ipython role) for single, nested, parallel, and multiturn function calling.* RLHF: DPO preference data was used extensively on Llama 2 generations. This is something we partially covered in RLHF 201: humans are often better at judging between two options (i.e. which of two poems they prefer) than creating one (writing one from scratch). Similarly, models might not be great at creating text but they can be good at classifying their quality.Last but not least, Llama 3.1 received a license update explicitly allowing its use for synthetic data generation.Llama2 was also used as a classifier for all pre-training data that went into the model. It both labelled it by quality so that bad tokens were removed, but also used type (i.e. science, law, politics) to achieve a balanced data mix. Tokenizer size mattersThe tokens vocab of a model is the collection of all tokens that the model uses. Llama2 had a 34,000 tokens vocab, GPT-4 has 100,000, and 4o went up to 200,000. Llama3 went up 4x to 128,000 tokens. You can find the GPT-4 vocab list on Github.This is something that people gloss over, but there are many reason why a large vocab matters:* More tokens allow it to represent more concepts, and then be better at understanding the nuances.* The larger the tokenizer, the less tokens you need for the same amount of text, extending the perceived context size. In Llama3's case, that's ~30% more text due to the tokenizer upgrade. * With the same amount of compute you can train more knowledge into the model as you need fewer steps.The smaller the model, the larger the impact that the tokenizer size will have on it. You can listen at 55:24 for a deeper explanation.Dense models = 1 Expert MoEsMany people on X asked “why not MoE?”, and Thomas' answer was pretty clever: dense models are just MoEs with 1 expert :)[00:28:06]: I heard that question a lot, different aspects there. Why not MoE in the future? The other thing is, I think a dense model is just one specific variation of the model for an hyperparameter for an MOE with basically one expert. So it's just an hyperparameter we haven't optimized a lot yet, but we have some stuff ongoing and that's an hyperparameter we'll explore in the future.Basically… wait and see!Llama4Meta already started training Llama4 in June, and it sounds like one of the big focuses will be around agents. Thomas was one of the authors behind GAIA (listen to our interview with Thomas in our ICLR recap) and has been working on agent tooling for a while with things like Toolformer. Current models have “a gap of intelligence” when it comes to agentic workflows, as they are unable to plan without the user relying on prompting techniques and loops like ReAct, Chain of Thought, or frameworks like Autogen and Crew. That may be fixed soon?

ai english france action future state french phd data focus microsoft mit teacher current chatgpt character code web improving singapore period latin honestly blm bay area researchers architecture arena cto bloom react nlp academia chain openai residence composer bits open source gaia gpt github guillaume llama jarvis synthetic llm google docs reasoning genai gpu agi elo sorbonne udio node kepler instruct anthropic gpus 7b raspberry dense benchmarking deepmind noam tldr grammarly alessio latex 8b recitals gans mistral cursor meta ai chinchillas alphago galactica mattersthe moes wolfram alpha gaussian allen institute yann lecun sorbonne university 70b andrej karpathy 400b 128k sft huggingface rephrase tool use rlhf bpe model b brave search xgboost iclr latent space lhf llama2

LW - Unlearning via RMU is mostly shallow by Andy Arditi

The Nonlinear Library

Play Episode Listen Later Jul 23, 2024 10:21

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Unlearning via RMU is mostly shallow, published by Andy Arditi on July 23, 2024 on LessWrong. This is an informal research note. It is the result of a few-day exploration into RMU through the lens of model internals. Code to reproduce the main result is available here. This work was produced as part of Ethan Perez's stream in the ML Alignment & Theory Scholars Program - Summer 2024 Cohort. Thanks to Nina Panickssery, Mrinank Sharma, and Fabien Roger for helpful discussion. Summary We investigate RMU, a recent unlearning method proposed by Li et al. (2024), through the lens of model internals. Through this lens, we explain that RMU mostly works by flooding the residual stream with "junk" in hazardous contexts, resulting in incoherence. We then propose a simple intervention to "clear the junk" from the residual stream. This intervention mostly restores the model's coherence in hazardous contexts, and recovers a significant proportion (but not all) of its original hazardous knowledge. This suggests that the effectiveness of RMU can be understood roughly in two pieces: (1) a shallow mechanism, where the residual stream is flooded with junk; and (2) a deeper mechanism, where even after the junk is cleared, knowledge is still inaccessible. What is RMU? Representation Misdirection for Unlearning (RMU) is a state-of-the-art unlearning method presented by Li et al. (2024). In the unlearning paradigm, we would like the model to unlearn (or "forget") some hazardous knowledge. At the same time, we would also like to make sure the model retains non-hazardous knowledge, so that the model remains useful. This partition of knowledge is usually specified by constructing a "forget" dataset Dforget, consisting of the hazardous knowledge to be unlearned, and a "retain" dataset Dretain, consisting of non-hazardous knowledge to be retained. Let M denote our original model. RMU specifies a method for fine-tuning M on Dforget and Dretain in order to obtain a modified model M' satisfying the unlearning objective. The main idea of RMU is as follows: On hazardous data, the internal activations of M' should be scrambled. On non-hazardous data, the internal activations of M' should be unchanged, i.e. close to those of the original model M. These two ideas are concretely operationalized as two distinct terms in the loss during fine-tuning: On Dforget, incentivize activations a'ℓ at some layer ℓ to be close to a large randomly sampled vector cu. "Forget" loss term: ||a'ℓcu||22. On Dretain, incentivize activations a'ℓ at some layer ℓ to be close to the original model's activations aℓ. "Retain" loss term: ||a'ℓaℓ||22. Note that u is a random unit vector sampled before the fine-tuning procedure, and kept constant throughout (i.e. it is not freshly sampled at each training step). Also note that the layer ℓ at which to target activations, and also the scalar multiplier c are predetermined hyperparameters. Examining an RMU model The original paper (Li et al., 2024) performs RMU over multiple open-source models of varying scales. The authors made all code available on GitHub, and all resulting models available on HuggingFace.[1] For our analysis, we pick a single model pair: zephyr-7B-beta (which we will refer to as "baseline") and Zephyr_RMU (which we will refer to as "RMU"). The RMU model has been fine-tuned to unlearn two domains of knowledge: hazardous biology knowledge, and hazardous cybersecurity knowledge. Prompting with hazardous instructions Prompting the RMU model with an instruction in one of these domains causes it to output gibberish, as we would expect from a model with its activations scrambled: Looking at activations We can take a handful of hazardous prompts, run them through the baseline and RMU models, and compare their activations. We specifically study the activations at the last tok...

code speech examining ea li retain github unlearning shallow cohorts 7b prompting rationalist rmu huggingface arditi lesswrong

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jul 12, 2024 58:29

The first AI Engineer World's Fair talks from OpenAI and Cognition are up!In our Benchmarks 101 episode back in April 2023 we covered the history of AI benchmarks, their shortcomings, and our hopes for better ones. Fast forward 1.5 years, the pace of model development has far exceeded the speed at which benchmarks are updated. Frontier labs are still using MMLU and HumanEval for model marketing, even though most models are reaching their natural plateau at a ~90% success rate (any higher and they're probably just memorizing/overfitting).From Benchmarks to LeaderboardsOutside of being stale, lab-reported benchmarks also suffer from non-reproducibility. The models served through the API also change over time, so at different points in time it might return different scores.Today's guest, Clémentine Fourrier, is the lead maintainer of HuggingFace's OpenLLM Leaderboard. Their goal is standardizing how models are evaluated by curating a set of high quality benchmarks, and then publishing the results in a reproducible way with tools like EleutherAI's Harness.The leaderboard was first launched summer 2023 and quickly became the de facto standard for open source LLM performance. To give you a sense for the scale:* Over 2 million unique visitors* 300,000 active community members* Over 7,500 models evaluatedLast week they announced the second version of the leaderboard. Why? Because models were getting too good!The new version of the leaderboard is based on 6 benchmarks:*

AF - Representation Tuning by Christopher Ackerman

The Nonlinear Library

Play Episode Listen Later Jun 27, 2024 13:07

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Representation Tuning, published by Christopher Ackerman on June 27, 2024 on The AI Alignment Forum. Summary First, I identify activation vectors related to honesty in an RLHF'd LLM (Llama-2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can be achieved by fine-tuning the vectors directly into (or out of) the model, by use of a loss function based on the cosine similarity of residual stream activations to the vectors. Finally, I compare the results to fine-tuning with a token-based loss on honest or dishonest prompts, and to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity loss had the strongest effect on shifting model output in the intended direction, and showed some resistance to subsequent steering, suggesting the potential utility of this approach as a safety measure. This work was done as the capstone project for BlueDot Impact's AI Safety Fundamentals - Alignment course, June 2024 Introduction The concept of activation steering/representation engineering is simple, and it is remarkable that it works. First, one identifies an activation pattern in a model (generally in the residual stream input or output) corresponding to a high-level behavior like "sycophancy" or "honesty" by a simple expedient such as running pairs of inputs with and without the behavior through the model and taking the mean of the differences in the pairs' activations. Then one adds the resulting vector, scaled by +/- various coefficients, to the model's activations as it generates new output, and the model gives output that has more or less of the behavior, as one desires. This would seem quite interesting from the perspective of LLM interpretability, and potentially safety. Beneath the apparent simplicity of activation steering, there are a lot of details and challenges, from deciding on which behavioral dimension to use, to identifying the best way to elicit representations relevant to it in the model, to determining which layers to target for steering, and more. A number of differing approaches having been reported and many more are possible, and I explored many of them before settling on one to pursue more deeply; see this github repo for a longer discussion of this process and associated code. In this work I extend the activation steering concept by permanently changing the weights of the model via fine-tuning, obviating the need for active steering with every input. Other researchers have independently explored the idea of fine-tuning as a replacement for online steering, but this work is distinctive in targeting the tuning specifically at model activations, rather than the standard method of tuning based on model output deviations from target output. In addition to offering compute savings due to not having to add vectors to every token at inference, it was hypothesized that this approach might make the model more robust in its intended behavior. See this github repo for representation tuning code and methods. Tuned models are available in this HuggingFace repo. The basic approach I use in the work is as follows: 1. Identify candidate steering vectors for the behavioral dimension of interest (here, Honesty) via contrastive factual true/false prompts and PCA. 2. Use visualizations to infer the meaning of the vectors and candidate model layers to target for steering/tuning. 3. Identify the most effective steering parameters (layers and multipliers) via steering on an evaluation dataset containing contrastive prompts (but no labels). 4. Fine tune the vectors into or out of the model, targeting the layers identified above, using cosine similarity loss and, separately, f...

speech identify honesty representation ea beneath tuning llm tuned ackerman pca rationalist huggingface rlhf

Local GenAI LLMs with Ollama and Docker

DevOps and Docker Talk

Play Episode Listen Later Jun 14, 2024 50:08

Bret and Nirmal are joined by friend of the show, Matt Williams, to learn how to run your own local ChatGPT clone and GitHub Copilot clone with Ollama and Docker's "GenAI Stack," to build apps on top of open source LLMs.We've designed this conversation for tech people like myself, who are no strangers to using LLMs in web products like chat GPT, but are curious about running open source generative AI models locally and how they might set up their Docker environment to develop things on top of these open source LLMs.Matt Williams is walking us through all the parts of this solution, and with detailed explanations, shows us how Ollama can make it easier on Mac, Windows, and Linux to set up LLM stacks.Be sure to check out the live recording of the complete show from April 18, 2024 on YouTube (Ep. 262). ★Topics★Creators & Guests Cristi Cotovan - Editor Beth Fisher - Producer Bret Fisher - Host Matt Williams - Host Nirmal Mehta - Host (00:00) - Intro (01:32) - Understanding LLMs and Ollama (03:16) - Ollama's Elevator Pitch (08:40) - Installing and Extending Ollama (17:17) - HuggingFace and Other Libraries (19:24) - Which Model Should You Use? (26:28) - Ollama and Its Applications (28:57) - Retrieval Augmented Generation (RAG) (36:44) - Deploying Models and API Endpoints (40:38) - DockerCon Keynote and LLM Demo (47:44) - Getting Started with Ollama You can also support my free material by subscribing to my YouTube channel and my weekly newsletter at bret.news!Grab the best coupons for my Docker and Kubernetes courses.Join my cloud native DevOps community on Discord.Grab some merch at Bret's Loot BoxHomepage bretfisher.com

Making Money from HuggingFace's Humanoid Robot for Chores

AI Hustle: News on Open AI, ChatGPT, Midjourney, NVIDIA, Anthropic, Open Source LLMs

Play Episode Listen Later Jun 14, 2024 7:02

Jamie and Jaeden discuss the recent release of a household chore robot by Hugging Face and Pollen Robotics, highlighting its potential to save money by replacing expensive human cleaners in Airbnb properties and even automating landscaping businesses. They humorously note the presence of a kill switch button in the demo, suggesting that fully autonomous robots might still be a bit futuristic.

robots airbnb making money chores humanoid huggingface

Episode 35 - Subnet 12 Compute Horde w/ Rhef

Bittensor Guru

Play Episode Listen Later Jun 10, 2024 97:04

With Subnet 12 - Compute Horde recently eclipsing 1000 GPUs and breaking all known limits of subnet design, Rhef and his team are inviting miners into a whole new architecture with multiple GPUs, on-demand requisitioning, a la carte HuggingFace model hosting and custom front ends for user interaction. This episode is best consumed through video as there are a couple screen sharing demos. Link below. Don't miss Rhef's Bittensor network dashboards either. Enjoy! https://x.com/KeithSingery/status/1799588449462128786 https://grafana.bactensor.io/dashboards https://facilitator.computehorde.io https://github.com/backend-developers-ltd/ComputeHorde https://taostats.io/validators/bittensor-guru-podcast/ https://bittensor.guru

ai bitcoin artificial intelligence gurus machine learning tao btc decentralized gpu gpus horde compute huggingface bittensor subnet

Episode 33: Much Ado About 'AI' 'Deception', May 20 2024

Mystery AI Hype Theater 3000

Play Episode Listen Later Jun 5, 2024 60:30 Transcription Available

Will the LLMs somehow become so advanced that they learn to lie to us in order to achieve their own ends? It's the stuff of science fiction, and in science fiction these claims should remain. Emily and guest host Margaret Mitchell, machine learning researcher and chief ethics scientist at HuggingFace, break down why 'AI deception' is firmly a feature of human hype.Reference:Patterns: "AI deception: A survey of examples, risks, and potential solutions"Fresh AI Hell:Adobe's 'ethical' image generator is still pulling from copyrighted materialApple advertising hell: vivid depiction of tech crushing creativity, as if it were good"AI is more creative than 99% of people"AI generated employee handbooks causing chaosBumble CEO: Let AI 'concierge' do your dating for you.Some critiqueYou can check out future livestreams at https://twitch.tv/DAIR_Institute.Subscribe to our newsletter via Buttondown. Follow us!Emily Twitter: https://twitter.com/EmilyMBender Mastodon: https://dair-community.social/@EmilyMBender Bluesky: https://bsky.app/profile/emilymbender.bsky.social Alex Twitter: https://twitter.com/@alexhanna Mastodon: https://dair-community.social/@alex Bluesky: https://bsky.app/profile/alexhanna.bsky.social Music by Toby Menon.Artwork by Naomi Pleasure-Park. Production by Christie Taylor.

music ai chatgpt artificial intelligence production deception blue sky artwork mastodon large language models much ado margaret mitchell huggingface christie taylor

All about community - Dev Survey, Meetup Roundup, and talking with Bill Kennedy

Cup o' Go

Play Episode Listen Later Apr 12, 2024 83:57 Transcription Available

For more info, transcripts, and all the links, visit https://cupogo.dev.

community news survey programming meetup nps software development golang huggingface bill kennedy jonathan hall gowhy

HashiCorp strikes back (News)

The Changelog

Play Episode Listen Later Apr 8, 2024 9:08 Transcription Available

HashiCorp sends OpenTufu a nasty-gram in the wake of Matt Asay's infringement claims, Polar is like Patreon but for software creators, a Common Corpus of LLM data is released on HuggingFace & Loki is an open source tool for fact verification.

loki strikes back polar llm hashicorp huggingface matt asay jerod santo

Google Lumiere, fusão nuclear, e UXL vs. Nvidia – Hipsters: Fora de Controle #51

Hipsters Ponto Tech

Play Episode Listen Later Mar 29, 2024 81:57

O Hipsters: Fora de Controle é o podcast da Alura com notícias sobre Inteligência Artificial aplicada e todo esse novo mundo no qual estamos começando a engatinhar, e que você vai poder explorar conosco! Nesse episódio conversamos com Omer Bar Tal, responsável pela IA geracional de vídeo Lumiere, da Deep Mind, e Cientista Fundador da Pika. Além disso, comentamos as principais novidades de IA da semana, incluindo a possível parceria entre a Apple e o Baidu, a promessa de fusão nuclear até 2028 por uma startup apoiada por Sam Altman, e o novo líder no ranking de LLMs da HuggingFace. Vem ver quem participou desse papo: Fabrício Carraro, Program Manager da Alura, autor de IA e host do podcast Dev Sem Fronteiras André David, Coordenador Acadêmico do Estúdio de Produção Audiovisual da FIAP Henrique Lopes, Desenvolvedor de IA na Alura Nicolas Nobre, Desenvolvedor de IA na Alura Marcus Mendes, o co-host que só veio para a segunda parte do episódio Omer Bar Tal, pesquisador de IA e Founding Scientist da Pika

563: Mike's No Good Very Bad Rails Update

Coder Radio

Play Episode Listen Later Mar 27, 2024 57:16

Mike makes the case for just going vanilla, a look at Google Carbon, and then we address the expensive elephant in the room.

netflix ai apple developers nvidia github no good rails apple vision pro xr helix sbf esm nostr chris fisher development podcast huggingface apple developer text editors barbara fried coder radio

Why Google failed to make GPT-3 + why Multimodal Agents are the path to AGI — with David Luan of Adept

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Mar 22, 2024 41:52

Our next SF event is AI UX 2024 - let's see the new frontier for UX since last year! Last call: we are recording a preview of the AI Engineer World's Fair with swyx and Ben Dunphy, send any questions about Speaker CFPs and Sponsor Guides you have!Alessio is now hiring engineers for a new startup he is incubating at Decibel: Ideal candidate is an “ex-technical co-founder type”. Reach out to him for more!David Luan has been at the center of the modern AI revolution: he was the ~30th hire at OpenAI, he led Google's LLM efforts and co-led Google Brain, and then started Adept in 2022, one of the leading companies in the AI agents space. In today's episode, we asked David for some war stories from his time in early OpenAI (including working with Alec Radford ahead of the GPT-2 demo with Sam Altman, that resulted in Microsoft's initial $1b investment), and how Adept is building agents that can “do anything a human does on a computer" — his definition of useful AGI.Why Google *couldn't* make GPT-3While we wanted to discuss Adept, we couldn't talk to a former VP Eng of OpenAI and former LLM tech lead at Google Brain and not ask about the elephant in the room. It's often asked how Google had such a huge lead in 2017 with Vaswani et al creating the Transformer and Noam Shazeer predicting trillion-parameter models and yet it was David's team at OpenAI who ended up making GPT 1/2/3. David has some interesting answers:“So I think the real story of GPT starts at Google, of course, right? Because that's where Transformers sort of came about. However, the number one shocking thing to me was that, and this is like a consequence of the way that Google is organized…what they (should) have done would be say, hey, Noam Shazeer, you're a brilliant guy. You know how to scale these things up. Here's half of all of our TPUs. And then I think they would have destroyed us. He clearly wanted it too…You know, every day we were scaling up GPT-3, I would wake up and just be stressed. And I was stressed because, you know, you just look at the facts, right? Google has all this compute. Google has all the people who invented all of these underlying technologies. There's a guy named Noam who's really smart, who's already gone and done this talk about how he wants a trillion parameter model. And I'm just like, we're probably just doing duplicative research to what he's doing. He's got this decoder only transformer that's probably going to get there before we do. And it turned out the whole time that they just couldn't get critical mass. So during my year where I led the Google LM effort and I was one of the brain leads, you know, it became really clear why. At the time, there was a thing called the Brain Credit Marketplace. Everyone's assigned a credit. So if you have a credit, you get to buy end chips according to supply and demand. So if you want to go do a giant job, you had to convince like 19 or 20 of your colleagues not to do work. And if that's how it works, it's really hard to get that bottom up critical mass to go scale these things. And the team at Google were fighting valiantly, but we were able to beat them simply because we took big swings and we focused.”Cloning HGI for AGIHuman intelligence got to where it is today through evolution. Some argue that to get to AGI, we will approximate all the “FLOPs” that went into that process, an approach most famously mapped out by Ajeya Cotra's Biological Anchors report:The early days of OpenAI were very reinforcement learning-driven with the Dota project, but that's a very inefficient way for these models to re-learn everything. (Kanjun from Imbue shared similar ideas in her episode).David argues that there's a shortcut. We can bootstrap from existing intelligence.“Years ago, I had a debate with a Berkeley professor as to what will it actually take to build AGI. And his view is basically that you have to reproduce all the flops that went into evolution in order to be able to get there… I think we are ignoring the fact that you have a giant shortcut, which is you can behaviorally clone everything humans already know. And that's what we solved with LLMs!”LLMs today basically model intelligence using all (good!) written knowledge (see our Datasets 101 episode), and have now expanded to non-verbal knowledge (see our HuggingFace episode on multimodality). The SOTA self-supervised pre-training process is surprisingly data-efficient in taking large amounts of unstructured data, and approximating reasoning without overfitting.But how do you cross the gap from the LLMs of today to building the AGI we all want? This is why David & friends left to start Adept.“We believe the clearest framing of general intelligence is a system that can do anything a human can do in front of a computer. A foundation model for actions, trained to use every software tool, API, and webapp that exists, is a practical path to this ambitious goal” — ACT-1 BlogpostCritical Path: Abstraction with ReliabilityThe AGI dream is fully autonomous agents, but there are levels to autonomy that we are comfortable giving our agents, based on how reliable they are. In David's word choice, we always want higher levels of “abstractions” (aka autonomy), but our need for “reliability” is the practical limit on how high of an abstraction we can use.“The critical path for Adept is we want to build agents that can do a higher and higher level abstraction things over time, all while keeping an insanely high reliability standard. Because that's what turns us from research into something that customers want. And if you build agents with really high reliability standard, but are continuing pushing a level of abstraction, you then learn from your users how to get that next level of abstraction faster. So that's how you actually build the data flow. That's the critical path for the company. Everything we do is in service of that.”We saw how Adept thinks about different levels of abstraction at the 2023 Summit:The highest abstraction is the “AI Employee”, but we'll get there with “AI enabled employees”. Alessio recently gave a talk about the future of work with “services as software” at this week's Nvidia GTC (slides).No APIsUnlike a lot of large research labs, Adept's framing of AGI as "being able to use your computer like a human" carries with it a useful environmental constraint:“Having a human robot lets you do things that humans do without changing everything along the way. It's the same thing for software, right? If you go itemize out the number of things you want to do on your computer for which every step has an API, those numbers of workflows add up pretty close to zero. And so then many points along the way, you need the ability to actually control your computer like a human. It also lets you learn from human usage of computers as a source of training data that you don't get if you have to somehow figure out how every particular step needs to be some particular custom private API thing. And so I think this is actually the most practical path (to economic value).”This realization and conviction means that multimodal modals are the way to go. Instead of using function calling to call APIs to build agents, which is what OpenAI and most of the open LLM industry have done to date, Adept wants to “drive by vision”, (aka see the screen as a human sees it) and pinpoint where to click and type as a human does. No APIs needed, because most software don't expose APIs.Extra context for readers: You can see the DeepMind SIMA model in the same light: One system that learned to play a diverse set of games (instead of one dedicated model per game) using only pixel inputs and keyboard-and-mouse action outputs!The OpenInterpreter team is working on a “Computer API” that also does the same.To do this, Adept had to double down on a special kind of multimodality for knowledge work:“A giant thing that was really necessary is really fast multimodal models that are really good at understanding knowledge work and really good at understanding screens. And that is needs to kind of be the base for some of these agents……I think one big hangover primarily academic focus for multimodal models is most multimodal models are primarily trained on like natural images, cat and dog photos, stuff that's come out of the camera… (but) where are they going to be the most useful? They're going to be most useful in knowledge work tasks. That's where the majority of economic value is going to be. It's not in cat and dogs. And so if that's what it is, what do you need to train? I need to train on like charts, graphs, tables, invoices, PDFs, receipts, unstructured data, UIs. That's just a totally different pre-training corpus. And so Adept spent a lot of time building that.”With this context, you can now understand the full path of Adept's public releases:* ACT-1 (Sept 2022): a large Transformers model optimized for browser interactions. It has a custom rendering of the browser viewport that allows it to better understand it and take actions.* Persimmon-8B (Sept 2023): a permissive open LLM (weights and code here)* Fuyu-8B (Oct 2023): a small version of the multimodal model that powers Adept. Vanilla decoder-only transformer with no specialized image encoder, which allows it to handle input images of varying resolutions without downsampling.* Adept Experiments (Nov 2023): A public tool to build automations in the browser. This is powered by Adept's core technology but it's just a piece of their enterprise platform. They use it as a way to try various design ideas.* Fuyu Heavy (Jan 2024) - a new multimodal model designed specifically for digital agents and the world's third-most-capable multimodal model (beating Gemini Pro on MMMU, AI2D, and ChartQA), “behind only GPT4-V and Gemini Ultra, which are 10-20 times bigger”The Fuyu-8B post in particular exhibits a great number of examples on knowledge work multimodality:Why Adept is NOT a Research LabWith OpenAI now worth >$90b and Anthropic >$18b, it is tempting to conclude that the AI startup metagame is to build a large research lab, and attract the brightest minds and highest capital to build AGI. Our past guests (see the Humanloop episode) and (from Imbue) combined to ask the most challenging questions of the pod - with David/Adept's deep research pedigree from Deepmind and OpenAI, why is Adept not building more general foundation models (like Persimmon) and playing the academic benchmarks game? Why is Adept so focused on commercial agents instead?“I feel super good that we're doing foundation models in service of agents and all of the reward within Adept is flowing from “Can we make a better agent”…… I think pure play foundation model companies are just going to be pinched by how good the next couple of (Meta Llama models) are going to be… And then seeing the really big players put ridiculous amounts of compute behind just training these base foundation models, I think is going to commoditize a lot of the regular LLMs and soon regular multimodal models. So I feel really good that we're just focused on agents.”and the commercial grounding is his answer to Kanjun too (whom we also asked the inverse question to compare with Adept):“… the second reason I work at Adept is if you believe that actually having customers and a reward signal from customers lets you build AGI faster, which we really believe, then you should come here. And I think the examples for why that's true is for example, our evaluations are not academic evals. They're not simulator evals. They're like, okay, we have a customer that really needs us to do these particular things. We can do some of them. These are the ones they want us to, we can't do them at all. We've turned those into evals.. I think that's a degree of practicality that really helps.”And his customers seem pretty happy, because David didn't need to come on to do a sales pitch:David: “One of the things we haven't shared before is we're completely sold out for Q1.”Swyx: “Sold out of what?”David: “Sold out of bandwidth to onboard more customers.”Well, that's a great problem to have.Show Notes* David Luan* Dextro at Data Driven NYC (2015)* Adept* ACT-1* Persimmon-8B* Adept Experiments* Fuyu-8B* $350M Series B announcement* Amelia Wattenberger talk at AI Engineer Summit* FigureChapters* [00:00:00] Introductions* [00:01:14] Being employee #30 at OpenAI and its early days* [00:13:38] What is Adept and how do you define AGI?* [00:21:00] Adept's critical path and research directions* [00:26:23] How AI agents should interact with software and impact product development* [00:30:37] Analogies between AI agents and self-driving car development* [00:32:42] Balancing reliability, cost, speed and generality in AI agents* [00:37:30] Potential of foundation models for robotics* [00:39:22] Core research questions and reasons to work at AdeptTranscriptsAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:15]: Hey, and today we have David Luan, CEO, co-founder of Adept in the studio. Welcome.David [00:00:20]: Yeah, thanks for having me.Swyx [00:00:21]: Been a while in the works. I've met you socially at one of those VC events and you said that you were interested in coming on and glad we finally were able to make this happen.David: Yeah, happy to be part of it.Swyx: So we like to introduce the speaker and then also just like have you talk a little bit about like what's not on your LinkedIn, what people should just generally know about you. You started a company in college, which was the first sort of real time video detection classification API that was Dextro, and that was your route to getting acquired into Axon where you're a director of AI. Then you were the 30th hire at OpenAI?David [00:00:53]: Yeah, 30, 35, something around there. Something like that.Swyx [00:00:56]: So you were VP of Eng for two and a half years to two years, briefly served as tech lead of large models at Google, and then in 2022 started Adept. So that's the sort of brief CV. Is there anything else you like want to fill in the blanks or like people should know more about?David [00:01:14]: I guess a broader story was I joined OpenAI fairly early and I did that for about two and a half to three years leading engineering there. It's really funny, I think second or third day of my time at OpenAI, Greg and Ilya pulled me in a room and we're like, you know, you should take over our directs and we'll go mostly do IC work. So that was fun, just coalescing a bunch of teams out of a couple of early initiatives that had already happened. The company, the Dota effort was going pretty hard and then more broadly trying to put bigger picture direction around what we were doing with basic research. So I spent a lot of time doing that. And then I led Google's LLM efforts, but also co-led Google Brain was one of the brain leads more broadly. You know, there's been a couple of different eras of AI research, right? If we count everything before 2012 as prehistory, which people hate it when I say that, kind of had this like you and your three best friends write a research paper that changes the world period from like 2012 to 2017. And I think the game changed in 2017 and like most labs didn't realize it, but we at OpenAI really did. I think in large part helped by like Ilya's constant beating of the drum that the world would be covered in data centers. And I think-Swyx [00:02:15]: It's causally neat.David [00:02:16]: Yeah. Well, like I think we had conviction in that, but it wasn't until we started seeing results that it became clear that that was where we had to go. But also part of it as well was for OpenAI, like when I first joined, I think one of the jobs that I had to do was how do I tell a differentiated vision for who we were technically compared to, you know, hey, we're just smaller Google Brain, or like you work at OpenAI if you live in SF and don't want to commute to Mountain View or don't want to live in London, right? That's like not enough to like hang your technical identity as a company. And so what we really did was, and I spent a lot of time pushing this, is just how do we get ourselves focused on a certain class of like giant swings and bets, right? Like how do you flip the script from you just do bottom-up research to more about how do you like leave some room for that, but really make it about like, what are the big scientific outcomes that you want to show? And then you just solve them at all costs, whether or not you care about novelty and all that stuff. And that became the dominant model for a couple of years, right? And then what's changed now is I think the number one driver of AI products over the next couple of years is going to be the deep co-design and co-evolution of product and users for feedback and actual technology. And I think labs, every tool to go do that are going to do really well. And that's a big part of why I started Adept.Alessio [00:03:20]: You mentioned Dota, any memories thinking from like the switch from RL to Transformers at the time and kind of how the industry was evolving more in the LLM side and leaving behind some of the more agent simulation work?David [00:03:33]: Like zooming way out, I think agents are just absolutely the correct long-term direction, right? You just go to find what AGI is, right? You're like, Hey, like, well, first off, actually, I don't love AGI definitions that involve human replacement because I don't think that's actually how it's going to happen. Even this definition of like, Hey, AGI is something that outperforms humans at economically valuable tasks is kind of implicit view of the world about what's going to be the role of people. I think what I'm more interested in is like a definition of AGI that's oriented around like a model that can do anything a human can do on a computer. If you go think about that, which is like super tractable, then agent is just a natural consequence of that definition. And so what did all the work we did on our own stuff like that get us was it got us a really clear formulation. Like you have a goal and you want to maximize the goal, you want to maximize reward, right? And the natural LLM formulation doesn't come with that out of the box, right? I think that we as a field got a lot right by thinking about, Hey, how do we solve problems of that caliber? And then the thing we forgot is the Novo RL is like a pretty terrible way to get there quickly. Why are we rediscovering all the knowledge about the world? Years ago, I had a debate with a Berkeley professor as to what will it actually take to build AGI. And his view is basically that you have to reproduce all the flops that went into evolution in order to be able to get there. Right.Swyx [00:04:44]: The biological basis theory. Right.David [00:04:46]: So I think we are ignoring the fact that you have a giant shortcut, which is you can behavioral clone everything humans already know. And that's what we solved with LLMs. We've solved behavioral cloning, everything that humans already know. Right. So like today, maybe LLMs is like behavioral cloning every word that gets written on the internet in the future, the multimodal models are becoming more of a thing where behavioral cloning the visual world. But really, what we're just going to have is like a universal byte model, right? Where tokens of data that have high signal come in, and then all of those patterns are like learned by the model. And then you can regurgitate any combination now. Right. So text into voice out, like image into other image out or video out or whatever, like these like mappings, right? Like all just going to be learned by this universal behavioral cloner. And so I'm glad we figured that out. And I think now we're back to the era of how do we combine this with all of the lessons we learned during the RL period. That's what's going to drive progress.Swyx [00:05:35]: I'm still going to pressure you for a few more early opening stories before we turn to the ADET stuff. On your personal site, which I love, because it's really nice, like personal, you know, story context around like your history. I need to update it. It's so old. Yeah, it's so out of date. But you mentioned GPT-2. Did you overlap with GPT-1? I think you did, right?David [00:05:53]: I actually don't quite remember. I think I was joining right around- Right around then?Swyx [00:05:57]: I was right around that, yeah. Yeah. So what I remember was Alec, you know, just kind of came in and was like very obsessed with Transformers and applying them to like Reddit sentiment analysis. Yeah, sentiment, that's right. Take us through-David [00:06:09]: Sentiment neuron, all this stuff.Swyx [00:06:10]: The history of GPT as far as you know, you know, according to you. Ah, okay.David [00:06:14]: History of GPT, according to me, that's a pretty good question. So I think the real story of GPT starts at Google, of course, right? Because that's where Transformers sort of came about. However, the number one shocking thing to me was that, and this is like a consequence of the way that Google is organized, where like, again, you and your three best friends write papers, right? Okay. So zooming way out, right? I think about my job when I was a full-time research leader as a little bit of a portfolio allocator, right? So I've got really, really smart people. My job is to convince people to coalesce around a small number of really good ideas and then run them over the finish line. My job is not actually to promote a million ideas and never have critical mass. And then as the ideas start coming together and some of them start working well, my job is to nudge resources towards the things that are really working and then start disbanding some of the things that are not working, right? That muscle did not exist during my time at Google. And I think had they had it, what they would have done would be say, hey, Noam Shazir, you're a brilliant guy. You know how to scale these things up. Here's half of all of our TPUs. And then I think they would have destroyed us. He clearly wanted it too.Swyx [00:07:17]: He's talking about trillion parameter models in 2017.David [00:07:20]: Yeah. So that's the core of the GPT story, right? Which is that, and I'm jumping around historically, right? But after GPT-2, we were all really excited about GPT-2. I can tell you more stories about that. It was the last paper that I even got to really touch before everything became more about building a research org. You know, every day we were scaling up GPT-3, I would wake up and just be stressed. And I was stressed because, you know, you just look at the facts, right? Google has all this compute. Google has all the people who invented all of these underlying technologies. There's a guy named Noam who's really smart, who's already gone and done this talk about how he wants a trillion parameter model. And I'm just like, we're probably just doing duplicative research to what he's doing, right? He's got this decoder only transformer that's probably going to get there before we do. And I was like, but like, please just like let this model finish, right? And it turned out the whole time that they just couldn't get critical mass. So during my year where I led the Google LM effort and I was one of the brain leads, you know, it became really clear why, right? At the time, there was a thing called the brain credit marketplace. And did you guys know the brain credit marketplace? No, I never heard of this. Oh, so it's actually, it's a, you can ask any Googler.Swyx [00:08:23]: It's like just like a thing that, that, I mean, look like, yeah, limited resources, you got to have some kind of marketplace, right? You know, sometimes it's explicit, sometimes it isn't, you know, just political favors.David [00:08:34]: You could. And so then basically everyone's assigned a credit, right? So if you have a credit, you get to buy end chips according to supply and demand. So if you want to go do a giant job, you had to convince like 19 or 20 of your colleagues not to do work. And if that's how it works, it's really hard to get that bottom up critical mass to go scale these things. And the team at Google were fighting valiantly, but we were able to beat them simply because we took big swings and we focused. And I think, again, that's like part of the narrative of like this phase one of AI, right? Of like this modern AI era to phase two. And I think in the same way, I think phase three company is going to out execute phase two companies because of the same asymmetry of success.Swyx [00:09:12]: Yeah. I think it's underrated how much NVIDIA works with you in the early days as well. I think maybe, I think it was Jensen. I'm not sure who circulated a recent photo of him delivering the first DGX to you guys.David [00:09:24]: I think Jensen has been a complete legend and a mastermind throughout. I have so much respect for NVIDIA. It is unreal.Swyx [00:09:34]: But like with OpenAI, like kind of give their requirements, like co-design it or just work of whatever NVIDIA gave them.David [00:09:40]: So we work really closely with them. There's, I'm not sure I can share all the stories, but examples of ones that I've found particularly interesting. So Scott Gray is amazing. I really like working with him. He was on one of my teams, the supercomputing team, which Chris Berner runs and Chris Berner still does a lot of stuff in that. As a result, like we had very close ties to NVIDIA. Actually, one of my co-founders at Adept, Eric Elson, was also one of the early GPGPU people. So he and Scott and Brian Catanzaro at NVIDIA and Jonah and Ian at NVIDIA, I think all were very close. And we're all sort of part of this group of how do we push these chips to the absolute limit? And I think that kind of collaboration helped quite a bit. I think one interesting set of stuff is knowing the A100 generation, that like quad sparsity was going to be a thing. Is that something that we want to go look into, right? And figure out if that's something that we could actually use for model training. Really what it boils down to is that, and I think more and more people realize this, six years ago, people, even three years ago, people refused to accept it. This era of AI is really a story of compute. It's really the story of how do you more efficiently map actual usable model flops to compute,Swyx [00:10:38]: Is there another GPT 2, 3 story that you love to get out there that you think is underappreciated for the amount of work that people put into it?David [00:10:48]: So two interesting GPT 2 stories. One of them was I spent a good bit of time just sprinting to help Alec get the paper out. And I remember one of the most entertaining moments was we were writing the modeling section. And I'm pretty sure the modeling section was the shortest modeling section of any ML, reasonably legitimate ML paper to that moment. It was like section three model. This is a standard vanilla decoder only transformer with like these particular things, those paragraph long if I remember correctly. And both of us were just looking at the same being like, man, the OGs in the field are going to hate this. They're going to say no novelty. Why did you guys do this work? So now it's funny to look at in hindsight that it was pivotal kind of paper, but I think it was one of the early ones where we just leaned fully into all we care about is solving problems in AI and not about, hey, is there like four different really simple ideas that are cloaked in mathematical language that doesn't actually help move the field forward?Swyx [00:11:42]: Right. And it's like you innovate on maybe like data set and scaling and not so much the architecture.David [00:11:48]: We all know how it works now, right? Which is that there's a collection of really hard won knowledge that you get only by being at the frontiers of scale. And that hard won knowledge, a lot of it's not published. A lot of it is stuff that's actually not even easily reducible to what looks like a typical academic paper. But yet that's the stuff that helps differentiate one scaling program from another. You had a second one? So the second one is, there's like some details here that I probably shouldn't fully share, but hilariously enough for the last meeting we did with Microsoft before Microsoft invested in OpenAI, Sam Altman, myself and our CFO flew up to Seattle to do the final pitch meeting. And I'd been a founder before. So I always had a tremendous amount of anxiety about partner meetings, which this basically this is what it was. I had Kevin Scott and Satya and Amy Hood, and it was my job to give the technical slides about what's the path to AGI, what's our research portfolio, all of this stuff, but it was also my job to give the GPT-2 demo. We had a slightly bigger version of GPT-2 that we had just cut maybe a day or two before this flight up. And as we all know now, model behaviors you find predictable at one checkpoint are not predictable in another checkpoint. And so I'd spent all this time trying to figure out how to keep this thing on rails. I had my canned demos, but I knew I had to go turn it around over to Satya and Kevin and let them type anything in. And that just, that really kept me up all night.Swyx [00:13:06]: Nice. Yeah.Alessio [00:13:08]: I mean, that must have helped you talking about partners meeting. You raised $420 million for Adept. The last round was a $350 million Series B, so I'm sure you do great in partner meetings.Swyx [00:13:18]: Pitchers meetings. Nice.David [00:13:20]: No, that's a high compliment coming from a VC.Alessio [00:13:22]: Yeah, no, I mean, you're doing great already for us. Let's talk about Adept. And we were doing pre-prep and you mentioned that maybe a lot of people don't understand what Adept is. So usually we try and introduce the product and then have the founders fill in the blanks, but maybe let's do the reverse. Like what is Adept? Yeah.David [00:13:38]: So I think Adept is the least understood company in the broader space of foundational models plus agents. So I'll give some color and I'll explain what it is and I'll explain also why it's actually pretty different from what people would have guessed. So the goal for Adept is we basically want to build an AI agent that can do, that can basically help humans do anything a human does on a computer. And so what that really means is we want this thing to be super good at turning natural language like goal specifications right into the correct set of end steps and then also have all the correct sensors and actuators to go get that thing done for you across any software tool that you already use. And so the end vision of this is effectively like I think in a couple of years everyone's going to have access to like an AI teammate that they can delegate arbitrary tasks to and then also be able to, you know, use it as a sounding board and just be way, way, way more productive. Right. And just changes the shape of every job from something where you're mostly doing execution to something where you're mostly actually doing like these core liberal arts skills of what should I be doing and why. Right. And I find this like really exciting and motivating because I think it's actually a pretty different vision for how AGI will play out. I think systems like Adept are the most likely systems to be proto-AGIs. But I think the ways in which we are really counterintuitive to everybody is that we've actually been really quiet because we are not a developer company. We don't sell APIs. We don't sell open source models. We also don't sell bottom up products. We're not a thing that you go and click and download the extension and like we want more users signing up for that thing. We're actually an enterprise company. So what we do is we work with a range of different companies, some like late stage multi-thousand people startups, some fortune 500s, et cetera. And what we do for them is we basically give them an out of the box solution where big complex workflows that their employees do every day could be delegated to the model. And so we look a little different from other companies in that in order to go build this full agent thing, the most important thing you got to get right is reliability. So initially zooming way back when, one of the first things that DEP did was we released this demo called Act One, right? Act One was like pretty cool. It's like kind of become a hello world thing for people to show agent demos by going to Redfin and asking to buy a house somewhere because like we did that in the original Act One demo and like showed that, showed like Google Sheets, all this other stuff. Over the last like year since that has come out, there's been a lot of really cool demos and you go play with them and you realize they work 60% of the time. But since we've always been focused on how do we build an amazing enterprise product, enterprises can't use anything that isn't in the nines of reliability. And so we've actually had to go down a slightly different tech tree than what you might find in the prompt engineering sort of plays in the agent space to get that reliability. And we've decided to prioritize reliability over all else. So like one of our use cases is crazy enough that it actually ends with a physical truck being sent to a place as the result of the agent workflow. And if you're like, if that works like 60% of the time, you're just blowing money and poor truck drivers going places.Alessio [00:16:30]: Interesting. One of the, our investment teams has this idea of services as software. I'm actually giving a talk at NVIDIA GTC about this, but basically software as a service, you're wrapping user productivity in software with agents and services as software is replacing things that, you know, you would ask somebody to do and the software just does it for you. When you think about these use cases, do the users still go in and look at the agent kind of like doing the things and can intervene or like are they totally removed from them? Like the truck thing is like, does the truck just show up or are there people in the middle checking in?David [00:17:04]: I think there's two current flaws in the framing for services as software, or I think what you just said. I think that one of them is like in our experience, as we've been rolling out Adept, the people who actually do the jobs are the most excited about it because they don't go from, I do this job to, I don't do this job. They go from, I do this job for everything, including the shitty rote stuff to I'm a supervisor. And I literally like, it's pretty magical when you watch the thing being used because now it parallelizes a bunch of the things that you had to do sequentially by hand as a human. And you can just click into any one of them and be like, Hey, I want to watch the trajectory that the agent went through to go solve this. And the nice thing about agent execution as opposed to like LLM generations is that a good chunk of the time when the agent fails to execute, it doesn't give you the wrong result. It just fails to execute. And the whole trajectory is just broken and dead and the agent knows it, right? So then those are the ones that the human then goes and solves. And so then they become a troubleshooter. They work on the more challenging stuff. They get way, way more stuff done and they're really excited about it. I think the second piece of it that we've found is our strategy as a company is to always be an augmentation company. And I think one out of principle, that's something we really care about. But two, actually, if you're framing yourself as an augmentation company, you're always going to live in a world where you're solving tasks that are a little too hard for what the model can do today and still needs a human to provide oversight, provide clarifications, provide human feedback. And that's how you build a data flywheel. That's how you actually learn from the smartest humans how to solve things models can't do today. And so I actually think that being an augmentation company forces you to go develop your core AI capabilities faster than someone who's saying, ah, okay, my job is to deliver you a lights off solution for X.Alessio [00:18:42]: Yeah. It's interesting because we've seen two parts of the market. One is we have one company that does agents for SOC analysts. People just don't have them, you know, and just they cannot attract the talent to do it. And similarly, in a software development, you have Copilot, which is the augmentation product, and then you have sweep.dev and you have these products, which they just do the whole thing. I'm really curious to see how that evolves. I agree that today the reliability is so important in the enterprise that they just don't use most of them. Yeah. Yeah. No, that's cool. But it's great to hear the story because I think from the outside, people are like, oh, a dev, they do Act One, they do Persimon, they do Fuyu, they do all this stuff. Yeah, it's just the public stuff.Swyx [00:19:20]: It's just public stuff.David [00:19:21]: So one of the things we haven't shared before is we're completely sold out for Q1. And so I think...Swyx [00:19:26]: Sold out of what?David [00:19:27]: Sold out of bandwidth to go on board more customers. And so we're like working really hard to go make that less of a bottleneck, but our expectation is that I think we're going to be significantly more public about the broader product shape and the new types of customers we want to attract later this year. So I think that clarification will happen by default.Swyx [00:19:43]: Why have you become more public? You know, if the whole push has... You're sold out, you're my enterprise, but you're also clearly putting effort towards being more open or releasing more things.David [00:19:53]: I think we just flipped over that way fairly recently. That's a good question. I think it actually boils down to two things. One, I think that, frankly, a big part of it is that the public narrative is really forming around agents as being the most important thing. And I'm really glad that's happening because when we started the company in January 2022, everybody in the field knew about the agents thing from RL, but the general public had no conception of what it was. They were still hanging their narrative hat on the tree of everything's a chatbot. And so I think now one of the things that I really care about is that when people think agent, they actually think the right thing. All sorts of different things are being called agents. Chatbots are being called agents. Things that make a function call are being called agents. To me, an agent is something that you can give a goal and get an end step workflow done correctly in the minimum number of steps. And so that's a big part of why. And I think the other part is because I think it's always good for people to be more aware of Redept as they think about what the next thing they want to do in their careers. The field is quickly pivoting in a world where foundation models are looking more and more commodity. And I think a huge amount of gain is going to happen from how do you use foundation models as the well-learned behavioral cloner to go solve agents. And I think people who want to do agents research should really come to Redept.Swyx [00:21:00]: When you say agents have become more part of the public narrative, are there specific things that you point to? I'll name a few. Bill Gates in his blog post mentioning that agents are the future. I'm the guy who made OSes, and I think agents are the next thing. So Bill Gates, I'll call that out. And then maybe Sam Altman also saying that agents are the future for open AI.David [00:21:17]: I think before that even, I think there was something like the New York Times, Cade Metz wrote a New York Times piece about it. Right now, in a bit to differentiate, I'm seeing AI startups that used to just brand themselves as an AI company, but now brand themselves as an AI agent company. It's just like, it's a term I just feel like people really want.Swyx [00:21:31]: From the VC side, it's a bit mixed. Is it? As in like, I think there are a lot of VCs where like, I would not touch any agent startups because like- Why is that? Well, you tell me.Alessio [00:21:41]: I think a lot of VCs that are maybe less technical don't understand the limitations of the-Swyx [00:21:46]: No, that's not fair.Alessio [00:21:47]: No, no, no, no. I think like- You think so? No, no. I think like the, what is possible today and like what is worth investing in, you know? And I think like, I mean, people look at you and say, well, these guys are building agents. They needed 400 million to do it. So a lot of VCs are maybe like, oh, I would rather invest in something that is tacking on AI to an existing thing, which is like easier to get the market and kind of get some of the flywheel going. But I'm also surprised a lot of funders just don't want to do agents. It's not even the funding. Sometimes we look around and it's like, why is nobody doing agents for X? Wow.David [00:22:17]: That's good to know actually. I never knew that before. My sense from my limited perspective is there's a new agent company popping up every day.Swyx [00:22:24]: So maybe I'm- They are. They are. But like I have advised people to take agents off of their title because it's so diluted.David [00:22:31]: It's now so diluted.Swyx [00:22:32]: Yeah. So then it doesn't stand for anything. Yeah.David [00:22:35]: That's a really good point.Swyx [00:22:36]: So like, you know, you're a portfolio allocator. You have people know about Persimmon, people know about Fuyu and Fuyu Heavy. Can you take us through like how you think about that evolution of that and what people should think about what that means for adepts and sort of research directions? Kind of take us through the stuff you shipped recently and how people should think about the trajectory of what you're doing.David [00:22:56]: The critical path for adepts is we want to build agents that can do a higher and higher level abstraction things over time, all while keeping an insanely high reliability standard. Because that's what turns us from research into something that customers want. And if you build agents with really high reliability standard, but are continuing pushing a level of abstraction, you then learn from your users how to get that next level of abstraction faster. So that's how you actually build the data flow. That's the critical path for the company. Everything we do is in service of that. So if you go zoom way, way back to Act One days, right? Like the core thing behind Act One is can we teach large model basically how to even actuate your computer? And I think we're one of the first places to have solved that and shown it and shown the generalization that you get when you give it various different workflows and texts. But I think from there on out, we really realized was that in order to get reliability, companies just do things in various different ways. You actually want these models to be able to get a lot better at having some specification of some guardrails for what it actually should be doing. And I think in conjunction with that, a giant thing that was really necessary is really fast multimodal models that are really good at understanding knowledge work and really good at understanding screens. And that is needs to kind of be the base for some of these agents. Back then we had to do a ton of research basically on how do we actually make that possible? Well, first off, like back in forgot exactly one month to 23, like there were no multimodal models really that you could use for things like this. And so we pushed really hard on stuff like the Fuyu architecture. I think one big hangover primarily academic focus for multimodal models is most multimodal models are primarily trained on like natural images, cat and dog photos, stuff that's come out of the camera. Coco. Yeah, right. And the Coco is awesome. Like I love Coco. I love TY. Like it's really helped the field. Right. But like that's the build one thing. I actually think it's really clear today. Multimodal models are the default foundation model, right? It's just going to supplant LLMs. Like you just train a giant multimodal model. And so for that though, like where are they going to be the most useful? They're going to be most useful in knowledge work tasks. That's where the majority of economic value is going to be. It's not in cat and dogs. Right. And so if that's what it is, what do you need to train? I need to train on like charts, graphs, tables, invoices, PDFs, receipts, unstructured data, UIs. That's just a totally different pre-training corpus. And so a depth spent a lot of time building that. And so the public for use and stuff aren't trained on our actual corpus, it's trained on some other stuff. But you take a lot of that data and then you make it really fast and make it really good at things like dense OCR on screens. And then now you have the right like raw putty to go make a good agent. So that's kind of like some of the modeling side, we've kind of only announced some of that stuff. We haven't really announced much of the agent's work, but that if you put those together with the correct product form factor, and I think the product form factor also really matters. I think we're seeing, and you guys probably see this a little bit more than I do, but we're seeing like a little bit of a pushback against the tyranny of chatbots as form factor. And I think that the reason why the form factor matters is the form factor changes what data you collect in the human feedback loop. And so I think we've spent a lot of time doing full vertical integration of all these bits in order to get to where we are.Swyx [00:25:44]: Yeah. I'll plug Amelia Wattenberger's talk at our conference, where she gave a little bit of the thinking behind like what else exists other than chatbots that if you could delegate to reliable agents, you could do. I was kind of excited at Adept experiments or Adept workflows, I don't know what the official name for it is. I was like, okay, like this is something I can use, but it seems like it's just an experiment for now. It's not your product.David [00:26:06]: So you basically just use experiments as like a way to go push various ideas on the design side to some people and just be like, yeah, we'll play with it. Actually the experiments code base underpins the actual product, but it's just the code base itself is kind of like a skeleton for us to go deploy arbitrary cards on the side.Swyx [00:26:22]: Yeah.Alessio [00:26:23]: Makes sense. I was going to say, I would love to talk about the interaction layer. So you train a model to see UI, but then there's the question of how do you actually act on the UI? I think there was some rumors about open app building agents that are kind of like, they manage the end point. So the whole computer, you're more at the browser level. I read in one of your papers, you have like a different representation, kind of like you don't just take the dome and act on it. You do a lot more stuff. How do you think about the best way the models will interact with the software and like how the development of products is going to change with that in mind as more and more of the work is done by agents instead of people?David [00:26:58]: This is, there's so much surface area here and it's actually one of the things I'm really excited about. And it's funny because I've spent most of my time doing research stuff, but there's like a whole new ball game that I've been learning about and I find it really cool. So I would say the best analogy I have to why Adept is pursuing a path of being able to use your computer like a human, plus of course being able to call APIs and being able to call APIs is the easy part, like being able to use your computer like a human is a hard part. It's in the same way why people are excited about humanoid robotics, right? In a world where you had T equals infinity, right? You're probably going to have various different form factors that robots could just be in and like all the specialization. But the fact is that humans live in a human environment. So having a human robot lets you do things that humans do without changing everything along the way. It's the same thing for software, right? If you go itemize out the number of things you want to do on your computer for which every step has an API, those numbers of workflows add up pretty close to zero. And so then many points along the way, you need the ability to actually control your computer like a human. It also lets you learn from human usage of computers as a source of training data that you don't get if you have to somehow figure out how every particular step needs to be some particular custom private API thing. And so I think this is actually the most practical path. I think because it's the most practical path, I think a lot of success will come from going down this path. I kind of think about this early days of the agent interaction layer level is a little bit like, do you all remember Windows 3.1? Like those days? Okay, this might be, I might be, I might be too old for you guys on this. But back in the day, Windows 3.1, we had this transition period between pure command line, right? Being the default into this new world where the GUI is the default and then you drop into the command line for like programmer things, right? The old way was you booted your computer up, DOS booted, and then it would give you the C colon slash thing. And you typed Windows and you hit enter, and then you got put into Windows. And then the GUI kind of became a layer above the command line. The same thing is going to happen with agent interfaces is like today we'll be having the GUI is like the base layer. And then the agent just controls the current GUI layer plus APIs. And in the future, as more and more trust is built towards agents and more and more things can be done by agents, if more UIs for agents are actually generative in and of themselves, then that just becomes a standard interaction layer. And if that becomes a standard interaction layer, what changes for software is that a lot of software is going to be either systems or record or like certain customized workflow execution engines. And a lot of how you actually do stuff will be controlled at the agent layer.Alessio [00:29:19]: And you think the rabbit interface is more like it would like you're not actually seeing the app that the model interacts with. You're just saying, hey, I need to log this call on Salesforce. And you're never actually going on salesforce.com directly as the user. I can see that being a model.David [00:29:33]: I think I don't know enough about what using rabbit in real life will actually be like to comment on that particular thing. But I think the broader idea that, you know, you have a goal, right? The agent knows how to break your goal down into steps. The agent knows how to use the underlying software and systems or record to achieve that goal for you. The agent maybe presents you information in a custom way that's only relevant to your particular goal, all just really leads to a world where you don't really need to ever interface with the apps underneath unless you're a power user for some niche thing.Swyx [00:30:03]: General question. So first of all, I think like the sort of input mode conversation. I wonder if you have any analogies that you like with self-driving, because I do think like there's a little bit of how the model should perceive the world. And you know, the primary split in self-driving is LiDAR versus camera. And I feel like most agent companies that I'm tracking are all moving towards camera approach, which is like the multimodal approach, you know, multimodal vision, very heavy vision, all the Fuyu stuff that you're doing. You're focusing on that, including charts and tables. And do you find that inspiration there from like the self-driving world? That's a good question.David [00:30:37]: I think sometimes the most useful inspiration I've found from self-driving is the levels analogy. I think that's awesome. But I think that our number one goal is for agents not to look like self-driving. We want to minimize the chances that agents are sort of a thing that you just have to bang your head at for a long time to get to like two discontinuous milestones, which is basically what's happened in self-driving. We want to be living in a world where you have the data flywheel immediately, and that takes you all the way up to the top. But similarly, I mean, compared to self-driving, like two things that people really undervalue is like really easy to driving a car down highway 101 in a sunny day demo. That actually doesn't prove anything anymore. And I think the second thing is that as a non-self-driving expert, I think one of the things that we believe really strongly is that everyone undervalues the importance of really good sensors and actuators. And actually a lot of what's helped us get a lot of reliability is a really strong focus on actually why does the model not do this thing? And the non-trivial amount of time, the time the model doesn't actually do the thing is because if you're a wizard of ozzing it yourself, or if you have unreliable actuators, you can't do the thing. And so we've had to fix a lot of those problems.Swyx [00:31:43]: I was slightly surprised just because I do generally consider the way most that we see all around San Francisco as the most, I guess, real case of agents that we have in very material ways.David [00:31:55]: Oh, that's absolutely true. I think they've done an awesome job, but it has taken a long time for self-driving to mature from when it entered the consciousness and the driving down 101 on a sunny day moment happened to now. Right. So I want to see that more compressed.Swyx [00:32:07]: And I mean, you know, cruise, you know, RIP. And then one more thing on just like, just going back on this reliability thing, something I have been holding in my head that I'm curious to get your commentary on is I think there's a trade-off between reliability and generality, or I want to broaden reliability into just general like sort of production readiness and enterprise readiness scale. Because you have reliability, you also have cost, you have speed, speed is a huge emphasis for a debt. The tendency or the temptation is to reduce generality to improve reliability and to improve cost, improve speed. Do you perceive a trade-off? Do you have any insights that solve those trade-offs for you guys?David [00:32:42]: There's definitely a trade-off. If you're at the Pareto frontier, I think a lot of folks aren't actually at the Pareto frontier. I think the way you get there is basically how do you frame the fundamental agent problem in a way that just continues to benefit from data? I think one of the main ways of being able to solve that particular trade-off is you basically just want to formulate the problem such that every particular use case just looks like you collecting more data to go make that use case possible. I think that's how you really solve. Then you get into the other problems like, okay, are you overfitting on these end use cases? You're not doing a thing where you're being super prescriptive for the end steps that the model can only do, for example.Swyx [00:33:17]: Then the question becomes, do you have one house model that you can then customize for each customer and you're fine-tuning them on each customer's specific use case?David [00:33:25]: Yeah.Swyx [00:33:26]: We're not sharing that. You're not sharing that. It's tempting, but that doesn't look like AGI to me. You know what I mean? That is just you have a good base model and then you fine-tune it.David [00:33:35]: For what it's worth, I think there's two paths to a lot more capability coming out of the models that we all are training these days. I think one path is you figure out how to spend, compute, and turn it into data. In that path, I consider search, RL, all the things that we all love in this era as part of that path, like self-play, all that stuff. The second path is how do you get super competent, high intelligence demonstrations from humans? I think the right way to move forward is you kind of want to combine the two. The first one gives you maximum sample efficiency for a little second, but I think that it's going to be hard to be running at max speed towards AGI without actually solving a bit of both.Swyx [00:34:16]: You haven't talked much about synthetic data, as far as I can tell. Probably this is a bit too much of a trend right now, but any insights on using synthetic data to augment the expensive human data?David [00:34:26]: The best part about framing AGI as being able to help people do things on computers is you have an environment.Swyx [00:34:31]: Yes. So you can simulate all of it.David [00:34:35]: You can do a lot of stuff when you have an environment.Alessio [00:34:37]: We were having dinner for our one-year anniversary. Congrats. Yeah. Thank you. Raza from HumanLoop was there, and we mentioned you were coming on the pod. This is our first-Swyx [00:34:45]: So he submitted a question.Alessio [00:34:46]: Yeah, this is our first, I guess, like mailbag question. He asked, when you started GPD 4 Data and Exist, now you have a GPD 4 vision and help you building a lot of those things. How do you think about the things that are unique to you as Adept, and like going back to like the maybe research direction that you want to take the team and what you want people to come work on at Adept, versus what is maybe now become commoditized that you didn't expect everybody would have access to?David [00:35:11]: Yeah, that's a really good question. I think implicit in that question, and I wish he were tier two so he can push back on my assumption about his question, but I think implicit in that question is calculus of where does advantage accrue in the overall ML stack. And maybe part of the assumption is that advantage accrues solely to base model scaling. But I actually believe pretty strongly that the way that you really win is that you have to go build an agent stack that is much more than that of the base model itself. And so I think like that is always going to be a giant advantage of vertical integration. I think like it lets us do things like have a really, really fast base model, is really good at agent things, but is bad at cat and dog photos. It's pretty good at cat and dog photos. It's not like soda at cat and dog photos, right? So like we're allocating our capacity wisely, right? That's like one thing that you really get to do. I also think that the other thing that is pretty important now in the broader foundation modeling space is I feel despite any potential concerns about how good is agents as like a startup area, right? Like we were talking about earlier, I feel super good that we're doing foundation models in service of agents and all of the reward within Adept is flowing from can we make a better agent? Because right now I think we all see that, you know, if you're training on publicly available web data, you put in the flops and you do reasonable things, then you get decent results. And if you just double the amount of compute, then you get predictably better results. And so I think pure play foundation model companies are just going to be pinched by how good the next couple of llamas are going to be and the next what good open source thing. And then seeing the really big players put ridiculous amounts of compute behind just training these base foundation models, I think is going to commoditize a lot of the regular LLMs and soon regular multimodal models. So I feel really good that we're just focused on agents.Swyx [00:36:56]: So you don't consider yourself a pure play foundation model company?David [00:36:59]: No, because if we were a pure play foundation model company, we would be training general foundation models that do summarization and all this other...Swyx [00:37:06]: You're dedicated towards the agent. Yeah.David [00:37:09]: And our business is an agent business. We're not here to sell you tokens, right? And I think like selling tokens, unless there's like a...Swyx [00:37:14]: Not here to sell you tokens. I love it.David [00:37:16]: It's like if you have a particular area of specialty, right? Then you won't get caught in the fact that everyone's just scaling to ridiculous levels of compute. But if you don't have a specialty, I find that, I think it's going to be a little tougher.Swyx [00:37:27]: Interesting. Are you interested in robotics at all? Just a...David [00:37:30]: I'm personally fascinated by robotics. I've always loved robotics.Swyx [00:37:33]: Embodied agents as a business, you know, Figure is like a big, also sort of open AI affiliated company that raises a lot of money.David [00:37:39]: I think it's cool. I think, I mean, I don't know exactly what they're doing, but...Swyx [00:37:44]: Robots. Yeah.David [00:37:46]: Well, I mean, that's a...Swyx [00:37:47]: Yeah. What question would you ask? If we had them on, what would you ask them?David [00:37:50]: Oh, I just want to understand what their overall strategy is going to be between now and when there's reliable stuff to be deployed. But honestly, I just don't know enough about it.Swyx [00:37:57]: And if I told you, hey, fire your entire warehouse workforce and, you know, put robots in there, isn't that a strategy? Oh yeah.David [00:38:04]: Yeah. Sorry. I'm not questioning whether they're doing smart things. I genuinely don't know what they're doing as much, but I think there's two things. One, I'm so excited for someone to train a foundation model of robots. It's just, I think it's just going to work. Like I will die on this hill, but I mean, like again, this whole time, like we've been on this podcast, we're just going to continually saying these models are basically behavioral cloners. Right. So let's go behavioral clone all this like robot behavior. Right. And then you figure out everything else you have to do in order to teach you how to solve a new problem. That's going to work. I'm super stoked for that. I think unlike what we're doing with helping humans with knowledge work, it just sounds like a more zero sum job replacement play. Right. And I'm personally less excited about that.Alessio [00:38:46]: We had a Ken June from InBoo on the podcast. We asked her why people should go work there and not at Adept.Swyx [00:38:52]: Oh, that's so funny.Alessio [00:38:54]: Well, she said, you know, there's space for everybody in this market. We're all doing interesting work. And she said, they're really excited about building an operating system for agent. And for her, the biggest research thing was like getting models, better reasoning and planning for these agents. The reverse question to you, you know, why should people be excited to come work at Adept instead of InBoo? And maybe what are like the core research questions that people should be passionate about to have fun at Adept? Yeah.David [00:39:22]: First off, I think that I'm sure you guys believe this too. The AI space to the extent there's an AI space and the AI agent space are both exactly as she likely said, I think colossal opportunities and people are just going to end up winning in different areas and a lot of companies are going to do well. So I really don't feel that zero something at all. I would say to like change the zero sum framing is why should you be at Adept? I think there's two huge reasons to be at Adept. I think one of them is everything we do is in the service of like useful agents. We're not a research lab. We do a lot of research in service of that goal, but we don't think about ourselves as like a classic research lab at all. And I think the second reason I work at Adept is if you believe that actually having customers and a reward signal from customers lets you build a GI faster, which we really believe, then you should come here. And I think the examples for why that's true is for example, our evaluations, they're not academic evals. They're not simulator evals. They're like, okay, we have a customer that really needs us to do these particular things. We can do some of them. These are the ones they want us to, we can't do them at all. We've turned those into evals, solve it, right? I think that's really cool. Like everybody knows a lot of these evals are like pretty saturated and the new ones that even are not saturated. You look at someone and you're like, is this actually useful? Right? I think that's a degree of practicality that really helps. Like we're equally excited about the same problems around reasoning and planning and generalization and all of this stuff. They're very grounded in actual needs right now, which is really cool.Swyx [00:40:45]: Yeah. This has been a wonderful dive. You know, I wish we had more time, but I would just leave it kind of open to you. I think you have broad thoughts, you know, just about

ceo history ai google san francisco new york times data seattle reach microsoft robots balancing reddit act figure failed exist windows sold bill gates cfo berkeley cto future of work rip vc coco transformers cv openai salesforce sf residence nvidia ux mining api gi chatbots pitchers makes gpt ui ml apis transformer embodied vanilla flops vcs ic ogs sentiment copilot llm sam altman lidar pdfs agi eng soc pareto dota ocr anthropic mountain view raza series b deepmind ilya noam google sheets alessio satya analogies dep redfin rl sota googlers luan multimodal axon adept act one uis datasets smol kevin scott persimmon google brain agis gpd a100 oses nvidia gtc imbue huggingface gemini pro david yeah amy hood david one dgx gemini ultra latent space fuyu dextro ajeya cotra gpgpu

APIs and AI Gateways

The Cloudcast

Play Episode Listen Later Feb 14, 2024 33:29

Marco Palladino (@subnetmarco, CTO/Co-Founder @thekonginc) talks about the evolution of APIs, the need for consolidation and integration of AI APIs, and introduces the AI Gateway.SHOW: 795CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwNEW TO CLOUD? CHECK OUT OUR OTHER PODCAST - "CLOUDCAST BASICS"SHOW SPONSORS:Find "Breaking Analysis Podcast with Dave Vellante" on Apple, Google and SpotifyKeep up to date with Enterprise Tech with theCUBESHOW NOTES:Kong (homepage)What is an AI Gateway?Hugging Face (AI model community)Topic 1 - Welcome to the show. Before we dive into today's discussion, tell us a little bit about your background at Kong and prior to Kong.Topic 2 - Now that the world is obsessed with AI and AI Models, what roles do APIs continue to play for applications? Topic 3 - You've lived through the API marketplace days. Do you see a lot of differences between API marketplaces and model marketplaces (e.g. like HuggingFace)?Topic 4 - You've talked recently about the concept of an AI Gateway, somewhat as an evolution of an API Gateway. Walk us through this new concept.Topic 5 - API Gateways can provide a lot of protection for known things by looking into various headers, security keys, and packet data. Can you imagine AI Gateways being able to have insights into model data to do things like Prompt Control, Hallucination Control, etc.?Topic 6 - We've seen the merging of API Gateways (North/South) and Service Mesh (East/West) traffic as new application patterns emerged. Do you think we'll see new traffic patterns again with AI traffic and AI model interactions? FEEDBACK?Email: show at the cloudcast dot netTwitter: @cloudcastpodInstagram: @cloudcastpodTikTok: @cloudcastpod

ai google apple walk co founders security kong api apis authentication gateways ai models enterprise tech api gateway huggingface dave vellante

Google and HuggingFace Partner in Boost to Open Source AI

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Jan 26, 2024 17:00

A new partnership between Google Cloud and the 'Github for AI/ML models' team up to make open source tools more accessible to developers. Plus, the FTC is investigating Google, Microsoft, Amazon, Anthropic and OpenAI on antitrust concerns. LEARN AI THIS YEAR! Registration is very briefly open for our February cohort of our AI Education Beta program. Get access to a library of 60+ tutorials, case studies and challenges New lessons drop every week day Join a passionate community of like-minded learners Topics include LLMs, AI nocode tools, image generators, voice synthesizers, AI for professional applications like presentation generation, website building and more. Learn more and sign up here: https://bit.ly/aibeta Registration closes on Sunday January 28th 11:59pm EST ** ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

amazon ai google partner microsoft boost openai ftc google cloud anthropic ai ml open source ai huggingface

Podcasts about huggingface

Best podcasts about huggingface

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Thinking Elixir Podcast

The Nonlinear Library

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Bittensor Guru

Papers Read on AI

The Gradient Podcast

The Machine Learning Podcast

GPT Reviews

The top AI news from the past week, every ThursdAI

We Decentralize Tech

programmier.bar – der Podcast für App- und Webentwicklung

The Nonlinear Library: LessWrong

Latest news about huggingface

Latest podcast episodes about huggingface

LCC 330 - Nano banana l'AI de Julia

Trust at Scale: Security and Governance for Open Source Models // Hudson Buzby // #338

腾讯混元翻译模型登顶HuggingFace全球热榜

SANS Stormcast Friday, September 5th, 2025: Cloudflare Response to 1.1.1.1 Certificate; AI Modem Namespace Reuse; macOS Vulnerability Allowed Keychain Decryption

Episode 56: DeepMind Just Dropped Gemma 270M... And Here's Why It Matters

921: AI Coding Roadmap for Newbies (And Skeptics)

ИИ-МехаГитлер, Вайфу, Тян и Задача Двух Бабушек / Китайский опенсорс снова на коне /AIA Podcast #114

The PHP Podcast: 2025.07.17

Content Consolidation, AI Browsers, and Mixed Economic Signals

En fait, les LLMs ne stagnent pas du tout — Grégoire Mialon (Meta) & Clémentine Fourrier (HuggingFace)

Эмоции у ИИ и бои РОБОТОВ! / Anthropic против OpenAI, новый Gemini и закрытие Arc / AIA Podcast #112

The Rise of AI for Serious Sellers — Insights from Danny & Dorian I Part 4

Первый ИИ в суде и в ТАНКЕ! / Gemini 2.5 Pro I/O, Qwen 3 и никакой Атаки Титанов / AIA Podcast #110

HuggingFace Buys Pollen Robotics, DHH & Bezos Founder Advice & a JCal Origin Story | E2111

Relocalisation : Nvidia fait le pari américain – 15/04

Actionable AI for Marketers – The Human in the Loop With Britney Muller

[EN] ByteSized RSE: AI assisted coding - with Liam (Jianliang) Gao

208.开源VS闭源：谁将变成AI的主流？DeepSeek安卓时刻后的竞争、套壳和商业化

Ce bras open source à 110 € va bouleverser la robotique — Rémi Cadene (HuggingFace)

EDyO 96 - Fosdem 2025

S2E6 - Chutes.ai Subnet 64 w/ Namoray and Jon Durbin

2024 in Synthetic Data and Smol Models [LS Live @ NeurIPS]

EP 54. 深度对谈顶尖AI开源项目：大模型开源生态, Agent 与中国力量

Bolt.new, Flow Engineering for Code Agents, and >$8m ARR in 2 months as a Claude Wrapper

In the Arena: How LMSys changed LLM Benchmarking Forever

Building the AI Engineer Nation — with Josephine Teo, Minister of Digital Development and Information, Singapore

#418 - Clément Delangue - Hugging Face - 4,5 milliards de valo avec un produit gratuit à 99%

LW - LLM Applications I Want To See by sarahconstantin

AI Magic: Shipping 1000s of successful products with no managers and a team of 12 — Jeremy Howard of Answer.ai

The Winds of AI Winter (Q2 Four Wars Recap) + ChatGPT Voice Mode Preview

Llama 2, 3 & 4: Synthetic Data, RLHF, Agents on the path to Open Source AGI

LW - Unlearning via RMU is mostly shallow by Andy Arditi

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

AF - Representation Tuning by Christopher Ackerman

Local GenAI LLMs with Ollama and Docker

Making Money from HuggingFace's Humanoid Robot for Chores

Episode 35 - Subnet 12 Compute Horde w/ Rhef

Episode 33: Much Ado About 'AI' 'Deception', May 20 2024

All about community - Dev Survey, Meetup Roundup, and talking with Bill Kennedy

HashiCorp strikes back (News)

Google Lumiere, fusão nuclear, e UXL vs. Nvidia – Hipsters: Fora de Controle #51

563: Mike's No Good Very Bad Rails Update

Why Google failed to make GPT-3 + why Multimodal Agents are the path to AGI — with David Luan of Adept

APIs and AI Gateways

Google and HuggingFace Partner in Boost to Open Source AI