Podcasts about langchain

  • 110PODCASTS
  • 185EPISODES
  • 50mAVG DURATION
  • 5WEEKLY NEW EPISODES
  • Jun 2, 2025LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about langchain

Latest podcast episodes about langchain

Talk Python To Me - Python conversations for passionate developers
#507: Agentic AI Workflows with LangGraph

Talk Python To Me - Python conversations for passionate developers

Play Episode Listen Later Jun 2, 2025 63:59 Transcription Available


If you want to leverage the power of LLMs in your Python apps, you would be wise to consider an agentic framework. Agentic empowers the LLMs to use tools and take further action based on what it has learned at that point. And frameworks provide all the necessary building blocks to weave these into your apps with features like long-term memory and durable resumability. I'm excited to have Sydney Runkle back on the podcast to dive into building Python apps with LangChain and LangGraph. Episode sponsors Posit Auth0 Talk Python Courses Links from the show Sydney Runkle: linkedin.com LangGraph: github.com LangChain: langchain.com LangGraph Studio: github.com LangGraph (Web): langchain.com LangGraph Tutorials Introduction: langchain-ai.github.io How to Think About Agent Frameworks: blog.langchain.dev Human in the Loop Concept: langchain-ai.github.io GPT-4 Prompting Guide: cookbook.openai.com Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to Talk Python on YouTube: youtube.com Talk Python on Bluesky: @talkpython.fm at bsky.app Talk Python on Mastodon: talkpython Michael on Bluesky: @mkennedy.codes at bsky.app Michael on Mastodon: mkennedy

5 Minutes Podcast with Ricardo Vargas
The Rise of AI Agents in Project Work

5 Minutes Podcast with Ricardo Vargas

Play Episode Listen Later May 25, 2025 3:30


In this episode, Ricardo discusses how AI Agents are transforming project management. Unlike traditional tools, these agents are autonomous, understand context, make decisions, and interact with people and systems to deliver value. With the advancement of models like ChatGPT and platforms such as LangChain, Crew AI, and Google NotebookLM, building smart agents has become much easier. They can update schedules, write meeting notes, draft emails, generate reports, and monitor risks—all integrated with tools like Notion, Slack, Trello, and Google Docs. This shift changes the project manager's role to that of an “AI orchestrator.” However, caution is needed due to potential errors, hallucinations, and data security concerns. AI isn't here to replace project managers but to empower them to focus on what truly matters. Listen to the podcast to learn more!

5 Minutes Podcast com Ricardo Vargas
A Ascensão dos Agentes de IA na Gestão de Projetos

5 Minutes Podcast com Ricardo Vargas

Play Episode Listen Later May 25, 2025 4:13


Neste episódio, Ricardo apresenta como agentes de inteligência artificial (AI Agents) estão revolucionando o gerenciamento de projetos. Diferentes das automações tradicionais, esses agentes são autônomos, interpretam contextos, tomam decisões e interagem com ferramentas como Notion, Slack, Trello e Google Docs. Com o avanço de modelos como ChatGPT e plataformas como LangChain, Crew AI e NotebookLM, ficou mais fácil criar agentes que entendem linguagem natural e atuam com autonomia. Eles podem atualizar cronogramas, gerar atas, escrever e-mails e sugerir ações. O papel do gerente muda de executor para orquestrador de IA. Porém, há riscos como erros e alucinações, exigindo supervisão humana. A IA não substitui o gerente de projetos, mas libera tempo para decisões mais estratégicas. Escute o podcast para saber mais!

The New Stack Podcast
Google Cloud Next Wrap-Up

The New Stack Podcast

Play Episode Listen Later May 22, 2025 18:22


At the close of this year's Google Cloud Next, The New Stack's Alex Williams, AI editor Frederic Lardinois, and analyst Janakiram MSV discussed the event's dominant theme: AI agents. The conversation focused heavily on agent frameworks, noting a shift from last year's third-party tools like Langchain, CrewAI, and Microsoft's Autogen, to first-party offerings from model providers themselves. Google's newly announced Agent Development Kit (ADK) highlights this trend, following closely on the heels of OpenAI's agent SDK. MSV emphasized the significance of this shift, calling it a major milestone as Google joins the race alongside Microsoft and OpenAI. Despite the buzz, Lardinois pointed out that many companies are still exploring how AI agents can fit into real-world workflows. The panel also highlighted how Google now delivers a full-stack AI development experience — from models to deployment platforms like Vertex AI. New enterprise tools like Agent Space and Agent Garden further signal Google's commitment to making agents a core part of modern software development. Learn more from The New Stack about the latest in AI agents: How AI Agents Will Change the Web for Users and Developers AI Agents: A Comprehensive Introduction for Developers AI Agents Are Coming for Your SaaS Stack Join our community of newsletter subscribers to stay on top of the news and at the top of your game. 

MLOps.community
A Candid Conversation Around MCP and A2A // Rahul Parundekar and Sam Partee // #316 SF Live

MLOps.community

Play Episode Listen Later May 21, 2025 64:42


Demetrios, Sam Partee, and Rahul Parundekar unpack the chaos of AI agent tools and the evolving world of MCP (Model Context Protocol). With sharp insights and plenty of laughs, they dig into tool permissions, security quirks, agent memory, and the messy path to making agents actually useful.// BioSam ParteeSam Partee is the CTO and Co-Founder of Arcade AI. Previously a Principal Engineer leading the Applied AI team at Redis, Sam led the effort in creating the ecosystem around Redis as a vector database. He is a contributor to multiple OSS projects including Langchain, DeterminedAI, LlamaIndex and Chapel amongst others. While at Cray/HPE he created the SmartSim AI framework which is now used at national labs around the country to integrate HPC simulations like climate models with AI. Rahul ParundekarRahul Parundekar is the founder of AI Hero. He graduated with a Master's in Computer Science from USC Los Angeles in 2010, and embarked on a career focused on Artificial Intelligence. From 2010-2017, he worked as a Senior Researcher at Toyota ITC working on agent autonomy within vehicles. His journey continued as the Director of Data Science at FigureEight (later acquired by Appen), where he and his team developed an architecture supporting over 36 ML models and managing over a million predictions daily. Since 2021, he has been working on AI Hero, aiming to democratize AI access, while also consulting on LLMOps(Large Language Model Operations), and AI system scalability. Other than his full time role as a founder, he is also passionate about community engagement, and actively organizes MLOps events in SF, and contributes educational content on RAG and LLMOps at learn.mlops.community.// Related LinksWebsites: arcade.devaihero.studio~~~~~~~~ ✌️Connect With Us ✌️ ~~~~~~~Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExploreMLOps Swag/Merch: [https://shop.mlops.community/]Connect with Demetrios on LinkedIn: /dpbrinkmConnect with Rahul on LinkedIn: /rparundekarConnect with Sam on LinkedIn: /samparteeTimestamps:[00:00] Agents & Tools, Explained (Without Melting Your Brain)[09:51] MVP Servers: Why Everything's on Fire (and How to Fix It)[13:18] Can We Actually Trust the Protocol?[18:13] KYC, But Make It AI (and Less Painful)[25:25] Web Automation Tests: The Bugs Strike Back[28:18] MCP Dev: What Went Wrong (and What Saved Us)[33:53] Social Login: One Button to Rule Them All[39:33] What Even Is an AI-Native Developer?[42:21] Betting Big on Smarter Models (High Risk, High Reward)[51:40] Harrison's Bold New Tactic (With Real-Life Magic Tricks)[55:31] Async Task Handoffs: Herding Cats, But Digitally[1:00:37] Getting AI to Actually Help Your Workflow[1:03:53] The Infamous Varma System Error (And How We Dodge It)

Generation AI
Model Context Protocol (MCP): Making AI Agents Talk to Your Data

Generation AI

Play Episode Listen Later May 13, 2025 31:07


In this insightful episode of Generation AI, hosts Ardis Kadiu and JC Bonilla tackle Model Context Protocol (MCP), a new standardization that's gaining rapid adoption across the AI industry. They explain how MCP functions as a universal adapter between AI models and data sources, solving the "Frankenstein middleware" problem that makes building AI agents so complex today. The hosts break down why this matters for higher education professionals, how it reduces hallucinations by improving data access, and why major players like OpenAI, Google, and HubSpot are already implementing it. This episode offers critical insight into how standardization will make AI tools more useful and less complex for everyone.What is Model Context Protocol (MCP)? (00:01:00)Introduction to MCP as a standardization protocol for AI agentsHosts explain MCP as a way to help AI access context and dataThe "three-legged stool" of AI agents: intelligence, context, and actionMCP provides the standard for how agents communicate with data sourcesMCP as the Universal AI Adapter (00:04:42)JC compares MCP to standardized protocols like TCP/IP and USB-CMCP sits between models like Claude or Gemini and various data sourcesIt eliminates the need for custom connectors between each tool and AI modelThe protocol's simplicity as a minimal viable product (MVP) is key to its successHow MCP Works (00:07:03)MCP is a protocol, not an API, that describes format and flow"Discovery first" approach where AI asks "what can you do?"Uses JSON format for tools and data exchangeWorks both locally and remotely over HTTPThe Technical Benefits of MCP (00:13:14)Solves the "m by n headache" of needing separate connectors for each model-tool pairReduces hallucinations by providing AI with reliable data sourcesGives AI models access to specialized tools for tasks they struggle withEnables "grounding" in real data rather than making things upIndustry Adoption and Momentum (00:17:14)OpenAI, Google, HubSpot, LangChain and others already implementing MCPHubSpot's beta MCP server allows for direct CRM data access in ClaudeGrowing availability for tools like Slack, Teams, and ZapierDiscussion of how MCP layers on top of existing APIsPractical Applications (00:20:36)Higher education examples: connecting LMS, advisor notes, financial aid systemsSales use case: AI agents accessing CRM data through MCP for follow-upsDevOps: AI monitoring logs, creating tickets, and managing communicationAnalytics: Connecting data sources, models, and reporting tools seamlesslyChallenges and Considerations (00:23:17)MCP requires widespread adoption to be truly effectiveProduct teams must be convinced to implement it alongside existing APIsPossibility that another protocol might eventually win outCurrent technical hurdles for implementation that are being addressedCall to Action for Listeners (00:26:03)Experiment with MCP servers that connect to Claude desktopFor AI product builders: write MCP servers for your applications nowAsk AI vendors: "Do you speak MCP?" as a signal of cutting-edge capabilityMCP as the new standard, comparable to asking "Do you have an API?"Conclusion: The Future of AI Integration (00:29:14)MCP's architectural implications for more open, modular AI systemsThe need for agents to speak a common language across platformsInvitation for listeners to share which workflows they'll connect once MCP goes mainstream - - - -Connect With Our Co-Hosts:Ardis Kadiuhttps://www.linkedin.com/in/ardis/https://twitter.com/ardisDr. JC Bonillahttps://www.linkedin.com/in/jcbonilla/https://twitter.com/jbonillxAbout The Enrollify Podcast Network:Generation AI is a part of the Enrollify Podcast Network. If you like this podcast, chances are you'll like other Enrollify shows too! Enrollify is made possible by Element451 — the next-generation AI student engagement platform helping institutions create meaningful and personalized interactions with students. Learn more at element451.com. Attend the 2025 Engage Summit! The Engage Summit is the premier conference for forward-thinking leaders and practitioners dedicated to exploring the transformative power of AI in education. Explore the strategies and tools to step into the next generation of student engagement, supercharged by AI. You'll leave ready to deliver the most personalized digital engagement experience every step of the way.Register now to secure your spot in Charlotte, NC, on June 24-25, 2025! Early bird registration ends February 1st -- https://engage.element451.com/register

Convergence
AI-Driven Development: Driving adoption on your product teams, Team Culture, and AI-Native Engineering Practices

Convergence

Play Episode Listen Later May 7, 2025 44:22


How do you move from dabbling with AI and vibe coding to building real, production-grade software with it? In this episode, Austin Vance, CEO of Focused returns and we transition the conversation from building AI-enabled applications to fostering AI-native engineering teams. Austin shares how generative AI isn't just a shortcut—it's reshaping how we architect, code, and lead. We also get to hear Austin's thoughts on the leaked ‘AI Mandate' memo from Shopify's CEO, Tobi Lutke.  We cover what Austin refers to as ‘AI-driven development', how to win over the skeptics on your teams, and why traditional patterns of software engineering might not be the best fit for LLM-driven workflows.  Whether you're an engineer,product leader, or startup founder, this episode will give you a practical lens on what AI-native software development actually requires—and how to foster adoption on your teams quickly and safely to get the benefits of using AI in product delivery. Unlock the full potential of your product team with Integral's player coaches, experts in lean, human-centered design. Visit integral.io/convergence for a free Product Success Lab workshop to gain clarity and confidence in tackling any product design or engineering challenge. Inside the episode... Why Shopify's leaked AI memo was a "permission slip" for your own team The three personas in AI adoption: advocates, skeptics, and holdouts How AI-driven development (AIDD) differs from “AI-assisted” workflows Tools and practices Focused uses to ship faster and cheaper with AI Pair programming vs. pairing with an LLM: similarities and mindset shifts How teams are learning to prompt effectively—without prompt engineering training Vibe coding vs. integrating with entrenched systems: what's actually feasible Scaling engineering culture around non-determinism and experimentation Practical tips for onboarding dev teams to tools like Cursor, Windsurf, and Vercel AI SDK Using LLMs for deep codebase exploration, not just code generation Mentioned in this episode Cursor Windsurf LangChain Claude GPT-4 / ChatGPT V0.dev GitHub Copilot Focused (focused.io) Shopify internal AI memo Unlock the full potential of your product team with Integral's player coaches, experts in lean, human-centered design. Visit integral.io/convergence for a free Product Success Lab workshop to gain clarity and confidence in tackling any product design or engineering challenge. Subscribe to the Convergence podcast wherever you get podcasts including video episodes to get updated on the other crucial conversations that we'll post on YouTube at youtube.com/@convergencefmpodcast Learn something? Give us a 5 star review and like the podcast on YouTube. It's how we grow.   Follow the Pod Linkedin: https://www.linkedin.com/company/convergence-podcast/ X: https://twitter.com/podconvergence Instagram: @podconvergence

Angular Master Podcast
AMP 72: The Future Is AI + Angular – Here's Why with Nir Kaufman

Angular Master Podcast

Play Episode Listen Later May 7, 2025 25:46


Just dropped a fresh episode of the Angular Master Podcast – and it's a must-listen for every frontend developer thinking about the future.This time I'm joined by the one and only Nir Kaufman — Google Developer Expert, international speaker, tech lead at Tikal, and the brilliant mind behind our newest initiative:

Convergence
Building Agentic Apps With Craft: Field Stories from Austin Vance, CEO, Co-Founder of Focused

Convergence

Play Episode Listen Later May 1, 2025 56:49


What does it actually take to build agentic AI applications that hold up in the real world? In this episode, Ashok sits down with Austin, founder of Focused, to share field stories and hard-won lessons from building AI systems that go beyond flashy demos. From legal assistants to government transparency tools, Austin breaks down the concrete criteria for identifying where AI makes sense — and where it doesn't. They unpack how to find the right starting point for your first agentic app, why integration with legacy systems is the real hurdle, and the engineering must-haves that keep AI behavior safe and reliable. You'll hear practical guidance on designing eval frameworks, using abstraction layers like LangChain, and how observability can shape your development roadmap just like in traditional software. Whether you're a product leader or a CTO, this conversation will help you distinguish hype from real opportunity in AI. Unlock the full potential of your product team with Integral's player coaches, experts in lean, human-centered design. Visit integral.io/convergence for a free Product Success Lab workshop to gain clarity and confidence in tackling any product design or engineering challenge. Inside the episode... A practical checklist for identifying your first AI-powered app The hidden cost of "AI for AI's sake" and where traditional software is better Why repetitive knowledge work is prime territory for automation How Focused helped Hamlet build an AI for parsing government meeting data Where read-only data access gives you a safe starting point Why integration is often more complex than the AI itself The importance of eval frameworks and test-driven LLM development How to use observability to continuously improve AI agent behavior Speed vs. believability: surprising lessons from Groq-powered inference Using multiple models in one system and LLMs to QA each other Mentioned in this episode Hamlet (government transparency startup) - https://www.myhamlet.com/?convergence  LangChain - https://www.langchain.com/?convergence  Groq - https://groq.com/?convergence  Claude (Anthropic) - https://claud.ai/?convergence  Dspy Prompting framework - https://dspy.ai/?convergence  Shopify AI memo (referenced) - https://convergence.fm/episode/shopifys-leaked-ai-mandate-explained-6-takeaways-for-your-product-team?convergence Amazon Bedrock / SageMaker - https://aws.amazon.com/bedrock/?convergence Unlock the full potential of your product team with Integral's player coaches, experts in lean, human-centered design. Visit integral.io/convergence for a free Product Success Lab workshop to gain clarity and confidence in tackling any product design or engineering challenge. Subscribe to the Convergence podcast wherever you get podcasts including video episodes to get updated on the other crucial conversations that we'll post on YouTube at youtube.com/@convergencefmpodcast Learn something? Give us a 5 star review and like the podcast on YouTube. It's how we grow.   Follow the Pod Linkedin: https://www.linkedin.com/company/convergence-podcast/ X: https://twitter.com/podconvergence Instagram: @podconvergence

DigitalFeeling
Episode 114- Agents IA : ce que l'IA peut (vraiment) faire pour vous

DigitalFeeling

Play Episode Listen Later Apr 26, 2025 12:14


Dans ce 114 ème épisode j'explore une question d'actualité : comment utiliser concrètement l'IA aujourd'hui, au-delà des chatbots et des buzzwords ? Formateurs, marketeurs, entrepreneurs ou curieux de technologie, cet épisode vous offre une plongée claire et pratique dans le fonctionnement des IA autonomes et intelligentes.

The MongoDB Podcast
EP. 263 Building Agents with Natural Language with guest

The MongoDB Podcast

Play Episode Listen Later Apr 25, 2025 71:04


**(Note: Spotify listeners can also watch the screen sharing video accompanying the audio. Other podcast platforms offer the audio-only version.)**In this episode of MongoDB Podcast Live, host Shane McAllister is joined by Sachin Hejip from Dataworkz. Sachin will showcase “Dataworkz Agent Builder” which is built with MongoDB Atlas Vector Search, and demonstrate how it can use Natural Language to create Agents and in turn, automate and simplify the creation of Agentic RAG applications. Sachin will demo the MongoDB Leafy Portal Chatbot Agent, which combines operational data with unstructured data for personalised customer experience and support, built using Dataworkz and MongoDB.Struggling with millions of unstructured documents, legacy records, or scattered data formats? Discover how AI, Large Language Models (LLMs), and MongoDB are revolutionizing data management in this episode of the MongoDB Podcast.Join host Shane McAllister and the team as they delve into tackling complex data challenges using cutting-edge technology. Learn how MongoDB Atlas Vector Search enables powerful semantic search and Retrieval Augmented Generation (RAG) applications, transforming chaotic information into valuable insights. Explore integrations with popular frameworks like Langchain and Llama Index.Find out how to efficiently process and make sense of your unstructured data, potentially saving significant costs and unlocking new possibilities.Ready to dive deeper?#MongoDB #AI #LLM #LargeLanguageModels #VectorSearch #AtlasVectorSearch #UnstructuredData #Podcast #DataManagement #Dataworkz #Observability #Developer #BigData #RAG

Oracle University Podcast
Integrating APEX with OCI AI Services

Oracle University Podcast

Play Episode Listen Later Apr 22, 2025 20:01


Discover how Oracle APEX leverages OCI AI services to build smarter, more efficient applications. Hosts Lois Houston and Nikita Abraham interview APEX experts Chaitanya Koratamaddi, Apoorva Srinivas, and Toufiq Mohammed about how key services like OCI Vision, Oracle Digital Assistant, and Document Understanding integrate with Oracle APEX.   Packed with real-world examples, this episode highlights all the ways you can enhance your APEX apps.   Oracle APEX: Empowering Low Code Apps with AI: https://mylearn.oracle.com/ou/course/oracle-apex-empowering-low-code-apps-with-ai/146047/ Oracle University Learning Community: https://education.oracle.com/ou-community LinkedIn: https://www.linkedin.com/showcase/oracle-university/ X: https://x.com/Oracle_Edu   Special thanks to Arijit Ghosh, David Wright, Kris-Ann Nansen, Radhika Banka, and the OU Studio Team for helping us create this episode.   ---------------------------------------------------------------   Episode Transcript: 00:00 Welcome to the Oracle University Podcast, the first stop on your cloud journey. During this series of informative podcasts, we'll bring you foundational training on the most popular Oracle technologies. Let's get started! 00:25 Lois: Hello and welcome to the Oracle University Podcast. I'm Lois Houston, Director of Innovation Programs with Oracle University, and with me is Nikita Abraham, Team Lead: Editorial Services. Nikita: Hi everyone! Last week, we looked at how generative AI powers Oracle APEX and in today's episode, we're going to focus on integrating APEX with OCI AI Services. Lois: That's right, Niki. We're going to look at how you can use Oracle AI services like OCI Vision, Oracle Digital Assistant, Document Understanding, OCI Generative AI, and more to enhance your APEX apps. 01:03 Nikita: And to help us with it all, we've got three amazing experts with us, Chaitanya Koratamaddi, Director of Product Management at Oracle, and senior product managers, Apoorva Srinivas and Toufiq Mohammed. In today's episode, we'll go through each Oracle AI service and look at how it interacts with APEX. Apoorva, let's start with you. Can you explain what the OCI Vision service is? Apoorva: Oracle Cloud Infrastructure Vision is a serverless multi-tenant service accessible using the console or REST APIs. You can upload images to detect and classify objects in them. With prebuilt models available, developers can quickly build image recognition into their applications without machine learning expertise. OCI Vision service provides a fully managed model infrastructure. With complete integration with OCI Data Labeling, you can build custom models easily. OCI Vision service provides pretrained models-- Image Classification, Object Detection, Face Detection, and Text Recognition. You can build custom models for Image Classification and Object Detection. 02:24 Lois: Ok. What about its use cases? How can OCI Vision make APEX apps more powerful? Apoorva: Using OCI Vision, you can make images and videos discoverable and searchable in your APEX app.  You can use OCI Vision to detect and classify objects in the images. OCI Vision also highlights the objects using a red rectangular box. This comes in handy in use cases such as detecting vehicles that have violated the rules in traffic images. You can use OCI Vision to identify visual anomalies in your data. This is a very popular use case where you can detect anomalies in cancer X-ray images to detect cancer. These are some of the most popular use cases of using OCI Vision with your APEX app. But the possibilities are endless and you can use OCI Vision for any of your image analysis. 03:29 Nikita: Let's shift gears to Oracle Digital Assistant. Chaitanya, can you tell us what it's all about? Chaitanya: Oracle Digital Assistant is a low-code conversational AI platform that allows businesses to build and deploy AI assistants. It provides natural language understanding, automatic speech recognition, and text-to-speech capabilities to enable human-like interactions with customers and employees. Oracle Digital Assistant comes with prebuilt templates for you to get started.  04:00 Lois: What are its key features and benefits, Chaitanya? How does it enhance the user experience? Chaitanya: Oracle Digital Assistant provides conversational AI capabilities that include generative AI features, natural language understanding and ML, AI-powered voice, and analytics and insights. Integration with enterprise applications become easier with unified conversational experience, prebuilt chatbots for Oracle Cloud applications, and chatbot architecture frameworks. Oracle Digital Assistant provides advanced conversational design tools, conversational designer, dialogue and domain trainer, and native multilingual support. Oracle Digital Assistant is open, scalable, and secure. It provides multi-channel support, automated bot-to-agent transfer, and integrated authentication profile. 04:56 Nikita: And what about the architecture? What happens at the back end? Chaitanya: Developers assemble digital assistants from one or more skills. Skills can be based on prebuilt skills provided by Oracle or third parties, custom developed, or based on one of the many skill templates available. 05:16 Lois: Chaitanya, what exactly are “skills” within the Oracle Digital Assistant framework?  Chaitanya: Skills are individual chatbots that are designed to interact with users and fulfill specific type of tasks. Each skill helps a user complete a task through a combination of text messages and simple UI elements like select list. When a user request is submitted through a channel, the Digital Assistant routes the user's request to the most appropriate skill to satisfy the user's request. Skills can combine multilingual NLP deep learning engine, a powerful dialogflow engine, and integration components to connect to back-end systems.  Skills provide a modular way to build your chatbot functionality. Now users connect with a chatbot through channels such as Facebook, Microsoft Teams, or in our case, Oracle APEX chatbot, which is embedded into an APEX application. 06:21 Nikita: That's fascinating. So, what are some use cases of Oracle Digital Assistant in APEX apps? Chaitanya: Digital assistants streamline approval processes by collecting information, routing requests, and providing status updates. Digital assistants offer instant access to information and documentation, answering common questions and guiding users. Digital assistants assist sales teams by automating tasks, responding to inquiries, and guiding prospects through the sales funnel. Digital assistants facilitate procurement by managing orders, tracking deliveries, and handling supplier communication. Digital assistants simplify expense approvals by collecting reports, validating receipts, and routing them for managerial approval. Digital assistants manage inventory by tracking stock levels, reordering supplies, and providing real-time inventory updates. Digital assistants have become a common UX feature in any enterprise application. 07:28 Want to learn how to design stunning, responsive enterprise applications directly from your browser with minimal coding? The new Oracle APEX Developer Professional learning path and certification enables you to leverage AI-assisted development, including generative AI and Database 23ai, to build secure, scalable web and mobile applications with advanced AI-powered features. From now through May 15, 2025, we're waiving the certification exam fee (valued at $245). So, what are you waiting for? Visit mylearn.oracle.com to get started today. 08:09 Nikita: Welcome back! Thanks for that, Chaitanya. Toufiq, let's talk about the OCI Document Understanding service. What is it? Toufiq: Using this service, you can upload documents to extract text, tables, and other key data. This means the service can automatically identify and extract relevant information from various types of documents, such as invoices, receipts, contracts, etc. The service is serverless and multitenant, which means you don't need to manage any servers or infrastructure. You can access this service using the console, REST APIs, SDK, or CLI, giving you multiple ways to integrate. 08:55 Nikita: What do we use for APEX apps?  Toufiq: For APEX applications, we will be using REST APIs to integrate the service. Additionally, you can process individual files or batches of documents using the ProcessorJob API endpoint. This flexibility allows you to handle different volumes of documents efficiently, whether you need to process a single document or thousands at once. With these capabilities, the OCI Document Understanding service can significantly streamline your document processing tasks, saving time and reducing the potential for manual errors. 09:36 Lois: Ok.  What are the different types of models available? How do they cater to various business needs? Toufiq: Let us start with pre-trained models. These are ready-to-use models that come right out of the box, offering a range of functionalities. The available models are Optical Character Recognition (OCR) enables the service to extract text from documents, allowing you to digitize, scan the documents effortlessly. You can precisely extract text content from documents. Key-value extraction, useful in streamlining tasks like invoice processing. Table extraction can intelligently extract tabular data from documents. Document classification automatically categorizes documents based on their content. OCR PDF enables seamless extraction of text from PDF files. Now, what if your business needs go beyond these pre-trained models. That's where custom models come into play. You have the flexibility to train and build your own models on top of these foundational pre-trained models. Models available for training are key value extraction and document classification. 10:50 Nikita: What does the architecture look like for OCI Document Understanding? Toufiq: You can ingest or supply the input file in two different ways. You can upload the file to an OCI Object Storage location. And in your request, you can point the Document Understanding service to pick the file from this Object Storage location.  Alternatively, you can upload a file directly from your computer. Once the file is uploaded, the Document Understanding service can process the file and extract key information using the pre-trained models. You can also customize models to tailor the extraction to your data or use case. After processing the file, the Document Understanding service stores the results in JSON format in the Object Storage output bucket. Your Oracle APEX application can then read the JSON file from the Object Storage output location, parse the JSON, and store useful information at local table or display it on the screen to the end user. 11:52 Lois: And what about use cases? How are various industries using this service? Toufiq: In financial services, you can utilize Document Understanding to extract data from financial statements, classify and categorize transactions, identify and extract payment details, streamline tax document management. Under manufacturing, you can perform text extraction from shipping labels and bill of lading documents, extract data from production reports, identify and extract vendor details. In the healthcare industry, you can automatically process medical claims, extract patient information from forms, classify and categorize medical records, identify and extract diagnostic codes. This is not an exhaustive list, but provides insights into some industry-specific use cases for Document Understanding. 12:50 Nikita: Toufiq, let's switch to the big topic everyone's excited about—the OCI Generative AI Service. What exactly is it? Toufiq: OCI Generative AI is a fully managed service that provides a set of state of the art, customizable large language models that cover a wide range of use cases. It provides enterprise grade generative AI with data governance and security, which means only you have access to your data and custom-trained models. OCI Generative AI provides pre-trained out-of-the-box LLMs for text generation, summarization, and text embedding. OCI Generative AI also provides necessary tools and infrastructure to define models with your own business knowledge. 13:37 Lois: Generally speaking, how is OCI Generative AI useful?  Toufiq: It supports various large language models. New models available from Meta and Cohere include Llama2 developed by Meta, and Cohere's Command model, their flagship text generation model. Additionally, Cohere offers the Summarize model, which provides high-quality summaries, accurately capturing essential information from documents, and the Embed model, converting text to vector embeddings representation. OCI Generative AI also offers dedicated AI clusters, enabling you to host foundational models on private GPUs. It integrates LangChain and open-source framework for developing new interfaces for generative AI applications powered by language models. Moreover, OCI Generative AI facilitates generative AI operations, providing content moderation controls, zero downtime endpoint model swaps, and endpoint deactivation and activation capabilities. For each model endpoint, OCI Generative AI captures a series of analytics, including call statistics, tokens processed, and error counts. 14:58 Nikita: What about the architecture? How does it handle user input? Toufiq: Users can input natural language, input/output examples, and instructions. The LLM analyzes the text and can generate, summarize, transform, extract information, or classify text according to the user's request. The response is sent back to the user in the specified format, which can include raw text or formatting like bullets and numbering, etc. 15:30 Lois: Can you share some practical use cases for generative AI in APEX apps?  Toufiq: Some of the OCI generative AI use cases for your Oracle APEX apps include text summarization. Generative AI can quickly summarize lengthy documents such as articles, transcripts, doctor's notes, and internal documents. Businesses can utilize generative AI to draft marketing copy, emails, blog posts, and product descriptions efficiently. Generative AI-powered chatbots are capable of brainstorming, problem solving, and answering questions. With generative AI, content can be rewritten in different styles or languages. This is particularly useful for localization efforts and catering to diverse audience. Generative AI can classify intent in customer chat logs, support tickets, and more. This helps businesses understand customer needs better and provide tailored responses and solutions. By searching call transcripts, internal knowledge sources, Generative AI enables businesses to efficiently answer user queries. This enhances information retrieval and decision-making processes. 16:47 Lois: Before we let you go, can you explain what Select AI is? How is it different from the other AI services? Toufiq: Select AI is a feature of Autonomous Database. This is where Select AI differs from the other AI services. Be it OCI Vision, Document Understanding, or OCI Generative AI, these are all freely managed standalone services on Oracle Cloud, accessible via REST APIs. Whereas Select AI is a feature available in Autonomous Database. That means to use Select AI, you need Autonomous Database.  17:26 Nikita: And what can developers do with Select AI? Toufiq: Traditionally, SQL is the language used to query the data in the database. With Select AI, you can talk to the database and get insights from the data in the database using human language. At the very basic, what Select AI does is it generates SQL queries using natural language, like an NL2SQL capability.  17:52 Nikita: How does it actually do that? Toufiq: When a user asks a question, the first step Select AI does is look into the AI profile, which you, as a developer, define. The AI profile holds crucial information, such as table names, the LLM provider, and the credentials needed to authenticate with the LLM service. Next, Select AI constructs a prompt. This prompt includes information from the AI profile and the user's question.  Essentially, it's a packet of information containing everything the LLM service needs to generate SQL. The next step is generating SQL using LLM. The prompt prepared by Select AI is sent to the available LLM services via REST. Which LLM to use is configured in the AI profile. The supported providers are OpenAI, Cohere, Azure OpenAI, and OCI Generative AI. Once the SQL is generated by the LLM service, it is returned to the application. The app can then handle the SQL query in various ways, such as displaying the SQL results in a report format or as charts, etc.  19:05 Lois: This has been an incredible discussion! Thank you, Chaitanya, Apoorva, and Toufiq, for walking us through all of these amazing AI tools. If you're ready to dive deeper, visit mylearn.oracle.com and search for the Oracle APEX: Empowering Low Code Apps with AI course. You'll find step-by-step guides and demos for everything we covered today.  Nikita: Until next week, this is Nikita Abraham… Lois: And Lois Houston signing off! 19:31 That's all for this episode of the Oracle University Podcast. If you enjoyed listening, please click Subscribe to get all the latest episodes. We'd also love it if you would take a moment to rate and review us on your podcast app. See you again on the next episode of the Oracle University Podcast.  

The MongoDB Podcast
EP. 262 Solving Unstructured Data Challenges with AI & Vector Search

The MongoDB Podcast

Play Episode Listen Later Apr 16, 2025 56:30


**(Note: Spotify listeners can also watch the screen sharing video accompanying the audio. Other podcast platforms offer the audio-only version.)**Struggling with millions of unstructured documents, legacy records, or scattered data formats? Discover how AI, Large Language Models (LLMs), and MongoDB are revolutionizing data management in this episode of the MongoDB Podcast.Join host Shane McAllister and the team as they delve into tackling complex data challenges using cutting-edge technology. Learn how MongoDB Atlas Vector Search enables powerful semantic search and Retrieval Augmented Generation (RAG) applications, transforming chaotic information into valuable insights. Explore integrations with popular frameworks like Langchain and Llama Index.Find out how to efficiently process and make sense of your unstructured data, potentially saving significant costs and unlocking new possibilities.Ready to dive deeper?#MongoDB #AI #LLM #LargeLanguageModels #VectorSearch #AtlasVectorSearch #UnstructuredData #Podcast #DataManagement #RAG #SemanticSearch #Langchain #LlamaIndex #Developer #BigData

Data Gen
#196 - Pigment : Monter l'équipe GenAI appliquée au Produit (Licorne, +230 millions levés)

Data Gen

Play Episode Listen Later Apr 14, 2025 29:46


Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

If you're in SF: Join us for the Claude Plays Pokemon hackathon this Sunday!If you're not: Fill out the 2025 State of AI Eng survey for $250 in Amazon cards!We are SO excited to share our conversation with Dharmesh Shah, co-founder of HubSpot and creator of Agent.ai.A particularly compelling concept we discussed is the idea of "hybrid teams" - the next evolution in workplace organization where human workers collaborate with AI agents as team members. Just as we previously saw hybrid teams emerge in terms of full-time vs. contract workers, or in-office vs. remote workers, Dharmesh predicts that the next frontier will be teams composed of both human and AI members. This raises interesting questions about team dynamics, trust, and how to effectively delegate tasks between human and AI team members.The discussion of business models in AI reveals an important distinction between Work as a Service (WaaS) and Results as a Service (RaaS), something Dharmesh has written extensively about. While RaaS has gained popularity, particularly in customer support applications where outcomes are easily measurable, Dharmesh argues that this model may be over-indexed. Not all AI applications have clearly definable outcomes or consistent economic value per transaction, making WaaS more appropriate in many cases. This insight is particularly relevant for businesses considering how to monetize AI capabilities.The technical challenges of implementing effective agent systems are also explored, particularly around memory and authentication. Shah emphasizes the importance of cross-agent memory sharing and the need for more granular control over data access. He envisions a future where users can selectively share parts of their data with different agents, similar to how OAuth works but with much finer control. This points to significant opportunities in developing infrastructure for secure and efficient agent-to-agent communication and data sharing.Other highlights from our conversation* The Evolution of AI-Powered Agents – Exploring how AI agents have evolved from simple chatbots to sophisticated multi-agent systems, and the role of MCPs in enabling that.* Hybrid Digital Teams and the Future of Work – How AI agents are becoming teammates rather than just tools, and what this means for business operations and knowledge work.* Memory in AI Agents – The importance of persistent memory in AI systems and how shared memory across agents could enhance collaboration and efficiency.* Business Models for AI Agents – Exploring the shift from software as a service (SaaS) to work as a service (WaaS) and results as a service (RaaS), and what this means for monetization.* The Role of Standards Like MCP – Why MCP has been widely adopted and how it enables agent collaboration, tool use, and discovery.* The Future of AI Code Generation and Software Engineering – How AI-assisted coding is changing the role of software engineers and what skills will matter most in the future.* Domain Investing and Efficient Markets – Dharmesh's approach to domain investing and how inefficiencies in digital asset markets create business opportunities.* The Philosophy of Saying No – Lessons from "Sorry, You Must Pass" and how prioritization leads to greater productivity and focus.Timestamps* 00:00 Introduction and Guest Welcome* 02:29 Dharmesh Shah's Journey into AI* 05:22 Defining AI Agents* 06:45 The Evolution and Future of AI Agents* 13:53 Graph Theory and Knowledge Representation* 20:02 Engineering Practices and Overengineering* 25:57 The Role of Junior Engineers in the AI Era* 28:20 Multi-Agent Systems and MCP Standards* 35:55 LinkedIn's Legal Battles and Data Scraping* 37:32 The Future of AI and Hybrid Teams* 39:19 Building Agent AI: A Professional Network for Agents* 40:43 Challenges and Innovations in Agent AI* 45:02 The Evolution of UI in AI Systems* 01:00:25 Business Models: Work as a Service vs. Results as a Service* 01:09:17 The Future Value of Engineers* 01:09:51 Exploring the Role of Agents* 01:10:28 The Importance of Memory in AI* 01:11:02 Challenges and Opportunities in AI Memory* 01:12:41 Selective Memory and Privacy Concerns* 01:13:27 The Evolution of AI Tools and Platforms* 01:18:23 Domain Names and AI Projects* 01:32:08 Balancing Work and Personal Life* 01:35:52 Final Thoughts and ReflectionsTranscriptAlessio [00:00:04]: Hey everyone, welcome back to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small AI.swyx [00:00:12]: Hello, and today we're super excited to have Dharmesh Shah to join us. I guess your relevant title here is founder of Agent AI.Dharmesh [00:00:20]: Yeah, that's true for this. Yeah, creator of Agent.ai and co-founder of HubSpot.swyx [00:00:25]: Co-founder of HubSpot, which I followed for many years, I think 18 years now, gonna be 19 soon. And you caught, you know, people can catch up on your HubSpot story elsewhere. I should also thank Sean Puri, who I've chatted with back and forth, who's been, I guess, getting me in touch with your people. But also, I think like, just giving us a lot of context, because obviously, My First Million joined you guys, and they've been chatting with you guys a lot. So for the business side, we can talk about that, but I kind of wanted to engage your CTO, agent, engineer side of things. So how did you get agent religion?Dharmesh [00:01:00]: Let's see. So I've been working, I'll take like a half step back, a decade or so ago, even though actually more than that. So even before HubSpot, the company I was contemplating that I had named for was called Ingenisoft. And the idea behind Ingenisoft was a natural language interface to business software. Now realize this is 20 years ago, so that was a hard thing to do. But the actual use case that I had in mind was, you know, we had data sitting in business systems like a CRM or something like that. And my kind of what I thought clever at the time. Oh, what if we used email as the kind of interface to get to business software? And the motivation for using email is that it automatically works when you're offline. So imagine I'm getting on a plane or I'm on a plane. There was no internet on planes back then. It's like, oh, I'm going through business cards from an event I went to. I can just type things into an email just to have them all in the backlog. When it reconnects, it sends those emails to a processor that basically kind of parses effectively the commands and updates the software, sends you the file, whatever it is. And there was a handful of commands. I was a little bit ahead of the times in terms of what was actually possible. And I reattempted this natural language thing with a product called ChatSpot that I did back 20...swyx [00:02:12]: Yeah, this is your first post-ChatGPT project.Dharmesh [00:02:14]: I saw it come out. Yeah. And so I've always been kind of fascinated by this natural language interface to software. Because, you know, as software developers, myself included, we've always said, oh, we build intuitive, easy-to-use applications. And it's not intuitive at all, right? Because what we're doing is... We're taking the mental model that's in our head of what we're trying to accomplish with said piece of software and translating that into a series of touches and swipes and clicks and things like that. And there's nothing natural or intuitive about it. And so natural language interfaces, for the first time, you know, whatever the thought is you have in your head and expressed in whatever language that you normally use to talk to yourself in your head, you can just sort of emit that and have software do something. And I thought that was kind of a breakthrough, which it has been. And it's gone. So that's where I first started getting into the journey. I started because now it actually works, right? So once we got ChatGPT and you can take, even with a few-shot example, convert something into structured, even back in the ChatGP 3.5 days, it did a decent job in a few-shot example, convert something to structured text if you knew what kinds of intents you were going to have. And so that happened. And that ultimately became a HubSpot project. But then agents intrigued me because I'm like, okay, well, that's the next step here. So chat's great. Love Chat UX. But if we want to do something even more meaningful, it felt like the next kind of advancement is not this kind of, I'm chatting with some software in a kind of a synchronous back and forth model, is that software is going to do things for me in kind of a multi-step way to try and accomplish some goals. So, yeah, that's when I first got started. It's like, okay, what would that look like? Yeah. And I've been obsessed ever since, by the way.Alessio [00:03:55]: Which goes back to your first experience with it, which is like you're offline. Yeah. And you want to do a task. You don't need to do it right now. You just want to queue it up for somebody to do it for you. Yes. As you think about agents, like, let's start at the easy question, which is like, how do you define an agent? Maybe. You mean the hardest question in the universe? Is that what you mean?Dharmesh [00:04:12]: You said you have an irritating take. I do have an irritating take. I think, well, some number of people have been irritated, including within my own team. So I have a very broad definition for agents, which is it's AI-powered software that accomplishes a goal. Period. That's it. And what irritates people about it is like, well, that's so broad as to be completely non-useful. And I understand that. I understand the criticism. But in my mind, if you kind of fast forward months, I guess, in AI years, the implementation of it, and we're already starting to see this, and we'll talk about this, different kinds of agents, right? So I think in addition to having a usable definition, and I like yours, by the way, and we should talk more about that, that you just came out with, the classification of agents actually is also useful, which is, is it autonomous or non-autonomous? Does it have a deterministic workflow? Does it have a non-deterministic workflow? Is it working synchronously? Is it working asynchronously? Then you have the different kind of interaction modes. Is it a chat agent, kind of like a customer support agent would be? You're having this kind of back and forth. Is it a workflow agent that just does a discrete number of steps? So there's all these different flavors of agents. So if I were to draw it in a Venn diagram, I would draw a big circle that says, this is agents, and then I have a bunch of circles, some overlapping, because they're not mutually exclusive. And so I think that's what's interesting, and we're seeing development along a bunch of different paths, right? So if you look at the first implementation of agent frameworks, you look at Baby AGI and AutoGBT, I think it was, not Autogen, that's the Microsoft one. They were way ahead of their time because they assumed this level of reasoning and execution and planning capability that just did not exist, right? So it was an interesting thought experiment, which is what it was. Even the guy that, I'm an investor in Yohei's fund that did Baby AGI. It wasn't ready, but it was a sign of what was to come. And so the question then is, when is it ready? And so lots of people talk about the state of the art when it comes to agents. I'm a pragmatist, so I think of the state of the practical. It's like, okay, well, what can I actually build that has commercial value or solves actually some discrete problem with some baseline of repeatability or verifiability?swyx [00:06:22]: There was a lot, and very, very interesting. I'm not irritated by it at all. Okay. As you know, I take a... There's a lot of anthropological view or linguistics view. And in linguistics, you don't want to be prescriptive. You want to be descriptive. Yeah. So you're a goals guy. That's the key word in your thing. And other people have other definitions that might involve like delegated trust or non-deterministic work, LLM in the loop, all that stuff. The other thing I was thinking about, just the comment on Baby AGI, LGBT. Yeah. In that piece that you just read, I was able to go through our backlog and just kind of track the winter of agents and then the summer now. Yeah. And it's... We can tell the whole story as an oral history, just following that thread. And it's really just like, I think, I tried to explain the why now, right? Like I had, there's better models, of course. There's better tool use with like, they're just more reliable. Yep. Better tools with MCP and all that stuff. And I'm sure you have opinions on that too. Business model shift, which you like a lot. I just heard you talk about RAS with MFM guys. Yep. Cost is dropping a lot. Yep. Inference is getting faster. There's more model diversity. Yep. Yep. I think it's a subtle point. It means that like, you have different models with different perspectives. You don't get stuck in the basin of performance of a single model. Sure. You can just get out of it by just switching models. Yep. Multi-agent research and RL fine tuning. So I just wanted to let you respond to like any of that.Dharmesh [00:07:44]: Yeah. A couple of things. Connecting the dots on the kind of the definition side of it. So we'll get the irritation out of the way completely. I have one more, even more irritating leap on the agent definition thing. So here's the way I think about it. By the way, the kind of word agent, I looked it up, like the English dictionary definition. The old school agent, yeah. Is when you have someone or something that does something on your behalf, like a travel agent or a real estate agent acts on your behalf. It's like proxy, which is a nice kind of general definition. So the other direction I'm sort of headed, and it's going to tie back to tool calling and MCP and things like that, is if you, and I'm not a biologist by any stretch of the imagination, but we have these single-celled organisms, right? Like the simplest possible form of what one would call life. But it's still life. It just happens to be single-celled. And then you can combine cells and then cells become specialized over time. And you have much more sophisticated organisms, you know, kind of further down the spectrum. In my mind, at the most fundamental level, you can almost think of having atomic agents. What is the simplest possible thing that's an agent that can still be called an agent? What is the equivalent of a kind of single-celled organism? And the reason I think that's useful is right now we're headed down the road, which I think is very exciting around tool use, right? That says, okay, the LLMs now can be provided a set of tools that it calls to accomplish whatever it needs to accomplish in the kind of furtherance of whatever goal it's trying to get done. And I'm not overly bothered by it, but if you think about it, if you just squint a little bit and say, well, what if everything was an agent? And what if tools were actually just atomic agents? Because then it's turtles all the way down, right? Then it's like, oh, well, all that's really happening with tool use is that we have a network of agents that know about each other through something like an MMCP and can kind of decompose a particular problem and say, oh, I'm going to delegate this to this set of agents. And why do we need to draw this distinction between tools, which are functions most of the time? And an actual agent. And so I'm going to write this irritating LinkedIn post, you know, proposing this. It's like, okay. And I'm not suggesting we should call even functions, you know, call them agents. But there is a certain amount of elegance that happens when you say, oh, we can just reduce it down to one primitive, which is an agent that you can combine in complicated ways to kind of raise the level of abstraction and accomplish higher order goals. Anyway, that's my answer. I'd say that's a success. Thank you for coming to my TED Talk on agent definitions.Alessio [00:09:54]: How do you define the minimum viable agent? Do you already have a definition for, like, where you draw the line between a cell and an atom? Yeah.Dharmesh [00:10:02]: So in my mind, it has to, at some level, use AI in order for it to—otherwise, it's just software. It's like, you know, we don't need another word for that. And so that's probably where I draw the line. So then the question, you know, the counterargument would be, well, if that's true, then lots of tools themselves are actually not agents because they're just doing a database call or a REST API call or whatever it is they're doing. And that does not necessarily qualify them, which is a fair counterargument. And I accept that. It's like a good argument. I still like to think about—because we'll talk about multi-agent systems, because I think—so we've accepted, which I think is true, lots of people have said it, and you've hopefully combined some of those clips of really smart people saying this is the year of agents, and I completely agree, it is the year of agents. But then shortly after that, it's going to be the year of multi-agent systems or multi-agent networks. I think that's where it's going to be headed next year. Yeah.swyx [00:10:54]: Opening eyes already on that. Yeah. My quick philosophical engagement with you on this. I often think about kind of the other spectrum, the other end of the cell spectrum. So single cell is life, multi-cell is life, and you clump a bunch of cells together in a more complex organism, they become organs, like an eye and a liver or whatever. And then obviously we consider ourselves one life form. There's not like a lot of lives within me. I'm just one life. And now, obviously, I don't think people don't really like to anthropomorphize agents and AI. Yeah. But we are extending our consciousness and our brain and our functionality out into machines. I just saw you were a Bee. Yeah. Which is, you know, it's nice. I have a limitless pendant in my pocket.Dharmesh [00:11:37]: I got one of these boys. Yeah.swyx [00:11:39]: I'm testing it all out. You know, got to be early adopters. But like, we want to extend our personal memory into these things so that we can be good at the things that we're good at. And, you know, machines are good at it. Machines are there. So like, my definition of life is kind of like going outside of my own body now. I don't know if you've ever had like reflections on that. Like how yours. How our self is like actually being distributed outside of you. Yeah.Dharmesh [00:12:01]: I don't fancy myself a philosopher. But you went there. So yeah, I did go there. I'm fascinated by kind of graphs and graph theory and networks and have been for a long, long time. And to me, we're sort of all nodes in this kind of larger thing. It just so happens that we're looking at individual kind of life forms as they exist right now. But so the idea is when you put a podcast out there, there's these little kind of nodes you're putting out there of like, you know, conceptual ideas. Once again, you have varying kind of forms of those little nodes that are up there and are connected in varying and sundry ways. And so I just think of myself as being a node in a massive, massive network. And I'm producing more nodes as I put content or ideas. And, you know, you spend some portion of your life collecting dots, experiences, people, and some portion of your life then connecting dots from the ones that you've collected over time. And I found that really interesting things happen and you really can't know in advance how those dots are necessarily going to connect in the future. And that's, yeah. So that's my philosophical take. That's the, yes, exactly. Coming back.Alessio [00:13:04]: Yep. Do you like graph as an agent? Abstraction? That's been one of the hot topics with LandGraph and Pydantic and all that.Dharmesh [00:13:11]: I do. The thing I'm more interested in terms of use of graphs, and there's lots of work happening on that now, is graph data stores as an alternative in terms of knowledge stores and knowledge graphs. Yeah. Because, you know, so I've been in software now 30 plus years, right? So it's not 10,000 hours. It's like 100,000 hours that I've spent doing this stuff. And so I've grew up with, so back in the day, you know, I started on mainframes. There was a product called IMS from IBM, which is basically an index database, what we'd call like a key value store today. Then we've had relational databases, right? We have tables and columns and foreign key relationships. We all know that. We have document databases like MongoDB, which is sort of a nested structure keyed by a specific index. We have vector stores, vector embedding database. And graphs are interesting for a couple of reasons. One is, so it's not classically structured in a relational way. When you say structured database, to most people, they're thinking tables and columns and in relational database and set theory and all that. Graphs still have structure, but it's not the tables and columns structure. And you could wonder, and people have made this case, that they are a better representation of knowledge for LLMs and for AI generally than other things. So that's kind of thing number one conceptually, and that might be true, I think is possibly true. And the other thing that I really like about that in the context of, you know, I've been in the context of data stores for RAG is, you know, RAG, you say, oh, I have a million documents, I'm going to build the vector embeddings, I'm going to come back with the top X based on the semantic match, and that's fine. All that's very, very useful. But the reality is something gets lost in the chunking process and the, okay, well, those tend, you know, like, you don't really get the whole picture, so to speak, and maybe not even the right set of dimensions on the kind of broader picture. And it makes intuitive sense to me that if we did capture it properly in a graph form, that maybe that feeding into a RAG pipeline will actually yield better results for some use cases, I don't know, but yeah.Alessio [00:15:03]: And do you feel like at the core of it, there's this difference between imperative and declarative programs? Because if you think about HubSpot, it's like, you know, people and graph kind of goes hand in hand, you know, but I think maybe the software before was more like primary foreign key based relationship, versus now the models can traverse through the graph more easily.Dharmesh [00:15:22]: Yes. So I like that representation. There's something. It's just conceptually elegant about graphs and just from the representation of it, they're much more discoverable, you can kind of see it, there's observability to it, versus kind of embeddings, which you can't really do much with as a human. You know, once they're in there, you can't pull stuff back out. But yeah, I like that kind of idea of it. And the other thing that's kind of, because I love graphs, I've been long obsessed with PageRank from back in the early days. And, you know, one of the kind of simplest algorithms in terms of coming up, you know, with a phone, everyone's been exposed to PageRank. And the idea is that, and so I had this other idea for a project, not a company, and I have hundreds of these, called NodeRank, is to be able to take the idea of PageRank and apply it to an arbitrary graph that says, okay, I'm going to define what authority looks like and say, okay, well, that's interesting to me, because then if you say, I'm going to take my knowledge store, and maybe this person that contributed some number of chunks to the graph data store has more authority on this particular use case or prompt that's being submitted than this other one that may, or maybe this one was more. popular, or maybe this one has, whatever it is, there should be a way for us to kind of rank nodes in a graph and sort them in some, some useful way. Yeah.swyx [00:16:34]: So I think that's generally useful for, for anything. I think the, the problem, like, so even though at my conferences, GraphRag is super popular and people are getting knowledge, graph religion, and I will say like, it's getting space, getting traction in two areas, conversation memory, and then also just rag in general, like the, the, the document data. Yeah. It's like a source. Most ML practitioners would say that knowledge graph is kind of like a dirty word. The graph database, people get graph religion, everything's a graph, and then they, they go really hard into it and then they get a, they get a graph that is too complex to navigate. Yes. And so like the, the, the simple way to put it is like you at running HubSpot, you know, the power of graphs, the way that Google has pitched them for many years, but I don't suspect that HubSpot itself uses a knowledge graph. No. Yeah.Dharmesh [00:17:26]: So when is it over engineering? Basically? It's a great question. I don't know. So the question now, like in AI land, right, is the, do we necessarily need to understand? So right now, LLMs for, for the most part are somewhat black boxes, right? We sort of understand how the, you know, the algorithm itself works, but we really don't know what's going on in there and, and how things come out. So if a graph data store is able to produce the outcomes we want, it's like, here's a set of queries I want to be able to submit and then it comes out with useful content. Maybe the underlying data store is as opaque as a vector embeddings or something like that, but maybe it's fine. Maybe we don't necessarily need to understand it to get utility out of it. And so maybe if it's messy, that's okay. Um, that's, it's just another form of lossy compression. Uh, it's just lossy in a way that we just don't completely understand in terms of, because it's going to grow organically. Uh, and it's not structured. It's like, ah, we're just gonna throw a bunch of stuff in there. Let the, the equivalent of the embedding algorithm, whatever they called in graph land. Um, so the one with the best results wins. I think so. Yeah.swyx [00:18:26]: Or is this the practical side of me is like, yeah, it's, if it's useful, we don't necessarilyDharmesh [00:18:30]: need to understand it.swyx [00:18:30]: I have, I mean, I'm happy to push back as long as you want. Uh, it's not practical to evaluate like the 10 different options out there because it takes time. It takes people, it takes, you know, resources, right? Set. That's the first thing. Second thing is your evals are typically on small things and some things only work at scale. Yup. Like graphs. Yup.Dharmesh [00:18:46]: Yup. That's, yeah, no, that's fair. And I think this is one of the challenges in terms of implementation of graph databases is that the most common approach that I've seen developers do, I've done it myself, is that, oh, I've got a Postgres database or a MySQL or whatever. I can represent a graph with a very set of tables with a parent child thing or whatever. And that sort of gives me the ability, uh, why would I need anything more than that? And the answer is, well, if you don't need anything more than that, you don't need anything more than that. But there's a high chance that you're sort of missing out on the actual value that, uh, the graph representation gives you. Which is the ability to traverse the graph, uh, efficiently in ways that kind of going through the, uh, traversal in a relational database form, even though structurally you have the data, practically you're not gonna be able to pull it out in, in useful ways. Uh, so you wouldn't like represent a social graph, uh, in, in using that kind of relational table model. It just wouldn't scale. It wouldn't work.swyx [00:19:36]: Uh, yeah. Uh, I think we want to move on to MCP. Yeah. But I just want to, like, just engineering advice. Yeah. Uh, obviously you've, you've, you've run, uh, you've, you've had to do a lot of projects and run a lot of teams. Do you have a general rule for over-engineering or, you know, engineering ahead of time? You know, like, because people, we know premature engineering is the root of all evil. Yep. But also sometimes you just have to. Yep. When do you do it? Yes.Dharmesh [00:19:59]: It's a great question. This is, uh, a question as old as time almost, which is what's the right and wrong levels of abstraction. That's effectively what, uh, we're answering when we're trying to do engineering. I tend to be a pragmatist, right? So here's the thing. Um, lots of times doing something the right way. Yeah. It's like a marginal increased cost in those cases. Just do it the right way. And this is what makes a, uh, a great engineer or a good engineer better than, uh, a not so great one. It's like, okay, all things being equal. If it's going to take you, you know, roughly close to constant time anyway, might as well do it the right way. Like, so do things well, then the question is, okay, well, am I building a framework as the reusable library? To what degree, uh, what am I anticipating in terms of what's going to need to change in this thing? Uh, you know, along what dimension? And then I think like a business person in some ways, like what's the return on calories, right? So, uh, and you look at, um, energy, the expected value of it's like, okay, here are the five possible things that could happen, uh, try to assign probabilities like, okay, well, if there's a 50% chance that we're going to go down this particular path at some day, like, or one of these five things is going to happen and it costs you 10% more to engineer for that. It's basically, it's something that yields a kind of interest compounding value. Um, as you get closer to the time of, of needing that versus having to take on debt, which is when you under engineer it, you're taking on debt. You're going to have to pay off when you do get to that eventuality where something happens. One thing as a pragmatist, uh, so I would rather under engineer something than over engineer it. If I were going to err on the side of something, and here's the reason is that when you under engineer it, uh, yes, you take on tech debt, uh, but the interest rate is relatively known and payoff is very, very possible, right? Which is, oh, I took a shortcut here as a result of which now this thing that should have taken me a week is now going to take me four weeks. Fine. But if that particular thing that you thought might happen, never actually, you never have that use case transpire or just doesn't, it's like, well, you just save yourself time, right? And that has value because you were able to do other things instead of, uh, kind of slightly over-engineering it away, over-engineering it. But there's no perfect answers in art form in terms of, uh, and yeah, we'll, we'll bring kind of this layers of abstraction back on the code generation conversation, which we'll, uh, I think I have later on, butAlessio [00:22:05]: I was going to ask, we can just jump ahead quickly. Yeah. Like, as you think about vibe coding and all that, how does the. Yeah. Percentage of potential usefulness change when I feel like we over-engineering a lot of times it's like the investment in syntax, it's less about the investment in like arc exacting. Yep. Yeah. How does that change your calculus?Dharmesh [00:22:22]: A couple of things, right? One is, um, so, you know, going back to that kind of ROI or a return on calories, kind of calculus or heuristic you think through, it's like, okay, well, what is it going to cost me to put this layer of abstraction above the code that I'm writing now, uh, in anticipating kind of future needs. If the cost of fixing, uh, or doing under engineering right now. Uh, we'll trend towards zero that says, okay, well, I don't have to get it right right now because even if I get it wrong, I'll run the thing for six hours instead of 60 minutes or whatever. It doesn't really matter, right? Like, because that's going to trend towards zero to be able, the ability to refactor a code. Um, and because we're going to not that long from now, we're going to have, you know, large code bases be able to exist, uh, you know, as, as context, uh, for a code generation or a code refactoring, uh, model. So I think it's going to make it, uh, make the case for under engineering, uh, even stronger. Which is why I take on that cost. You just pay the interest when you get there, it's not, um, just go on with your life vibe coded and, uh, come back when you need to. Yeah.Alessio [00:23:18]: Sometimes I feel like there's no decision-making in some things like, uh, today I built a autosave for like our internal notes platform and I literally just ask them cursor. Can you add autosave? Yeah. I don't know if it's over under engineer. Yep. I just vibe coded it. Yep. And I feel like at some point we're going to get to the point where the models kindDharmesh [00:23:36]: of decide where the right line is, but this is where the, like the, in my mind, the danger is, right? So there's two sides to this. One is the cost of kind of development and coding and things like that stuff that, you know, we talk about. But then like in your example, you know, one of the risks that we have is that because adding a feature, uh, like a save or whatever the feature might be to a product as that price tends towards zero, are we going to be less discriminant about what features we add as a result of making more product products more complicated, which has a negative impact on the user and navigate negative impact on the business. Um, and so that's the thing I worry about if it starts to become too easy, are we going to be. Too promiscuous in our, uh, kind of extension, adding product extensions and things like that. It's like, ah, why not add X, Y, Z or whatever back then it was like, oh, we only have so many engineering hours or story points or however you measure things. Uh, that least kept us in check a little bit. Yeah.Alessio [00:24:22]: And then over engineering, you're like, yeah, it's kind of like you're putting that on yourself. Yeah. Like now it's like the models don't understand that if they add too much complexity, it's going to come back to bite them later. Yep. So they just do whatever they want to do. Yeah. And I'm curious where in the workflow that's going to be, where it's like, Hey, this is like the amount of complexity and over-engineering you can do before you got to ask me if we should actually do it versus like do something else.Dharmesh [00:24:45]: So you know, we've already, let's like, we're leaving this, uh, in the code generation world, this kind of compressed, um, cycle time. Right. It's like, okay, we went from auto-complete, uh, in the GitHub co-pilot to like, oh, finish this particular thing and hit tab to a, oh, I sort of know your file or whatever. I can write out a full function to you to now I can like hold a bunch of the context in my head. Uh, so we can do app generation, which we have now with lovable and bolt and repletage. Yeah. Association and other things. So then the question is, okay, well, where does it naturally go from here? So we're going to generate products. Make sense. We might be able to generate platforms as though I want a platform for ERP that does this, whatever. And that includes the API's includes the product and the UI, and all the things that make for a platform. There's no nothing that says we would stop like, okay, can you generate an entire software company someday? Right. Uh, with the platform and the monetization and the go-to-market and the whatever. And you know, that that's interesting to me in terms of, uh, you know, what, when you take it to almost ludicrous levels. of abstract.swyx [00:25:39]: It's like, okay, turn it to 11. You mentioned vibe coding, so I have to, this is a blog post I haven't written, but I'm kind of exploring it. Is the junior engineer dead?Dharmesh [00:25:49]: I don't think so. I think what will happen is that the junior engineer will be able to, if all they're bringing to the table is the fact that they are a junior engineer, then yes, they're likely dead. But hopefully if they can communicate with carbon-based life forms, they can interact with product, if they're willing to talk to customers, they can take their kind of basic understanding of engineering and how kind of software works. I think that has value. So I have a 14-year-old right now who's taking Python programming class, and some people ask me, it's like, why is he learning coding? And my answer is, is because it's not about the syntax, it's not about the coding. What he's learning is like the fundamental thing of like how things work. And there's value in that. I think there's going to be timeless value in systems thinking and abstractions and what that means. And whether functions manifested as math, which he's going to get exposed to regardless, or there are some core primitives to the universe, I think, that the more you understand them, those are what I would kind of think of as like really large dots in your life that will have a higher gravitational pull and value to them that you'll then be able to. So I want him to collect those dots, and he's not resisting. So it's like, okay, while he's still listening to me, I'm going to have him do things that I think will be useful.swyx [00:26:59]: You know, part of one of the pitches that I evaluated for AI engineer is a term. And the term is that maybe the traditional interview path or career path of software engineer goes away, which is because what's the point of lead code? Yeah. And, you know, it actually matters more that you know how to work with AI and to implement the things that you want. Yep.Dharmesh [00:27:16]: That's one of the like interesting things that's happened with generative AI. You know, you go from machine learning and the models and just that underlying form, which is like true engineering, right? Like the actual, what I call real engineering. I don't think of myself as a real engineer, actually. I'm a developer. But now with generative AI. We call it AI and it's obviously got its roots in machine learning, but it just feels like fundamentally different to me. Like you have the vibe. It's like, okay, well, this is just a whole different approach to software development to so many different things. And so I'm wondering now, it's like an AI engineer is like, if you were like to draw the Venn diagram, it's interesting because the cross between like AI things, generative AI and what the tools are capable of, what the models do, and this whole new kind of body of knowledge that we're still building out, it's still very young, intersected with kind of classic engineering, software engineering. Yeah.swyx [00:28:04]: I just described the overlap as it separates out eventually until it's its own thing, but it's starting out as a software. Yeah.Alessio [00:28:11]: That makes sense. So to close the vibe coding loop, the other big hype now is MCPs. Obviously, I would say Cloud Desktop and Cursor are like the two main drivers of MCP usage. I would say my favorite is the Sentry MCP. I can pull in errors and then you can just put the context in Cursor. How do you think about that abstraction layer? Does it feel... Does it feel almost too magical in a way? Do you think it's like you get enough? Because you don't really see how the server itself is then kind of like repackaging theDharmesh [00:28:41]: information for you? I think MCP as a standard is one of the better things that's happened in the world of AI because a standard needed to exist and absent a standard, there was a set of things that just weren't possible. Now, we can argue whether it's the best possible manifestation of a standard or not. Does it do too much? Does it do too little? I get that, but it's just simple enough to both be useful and unobtrusive. It's understandable and adoptable by mere mortals, right? It's not overly complicated. You know, a reasonable engineer can put a stand up an MCP server relatively easily. The thing that has me excited about it is like, so I'm a big believer in multi-agent systems. And so that's going back to our kind of this idea of an atomic agent. So imagine the MCP server, like obviously it calls tools, but the way I think about it, so I'm working on my current passion project is agent.ai. And we'll talk more about that in a little bit. More about the, I think we should, because I think it's interesting not to promote the project at all, but there's some interesting ideas in there. One of which is around, we're going to need a mechanism for, if agents are going to collaborate and be able to delegate, there's going to need to be some form of discovery and we're going to need some standard way. It's like, okay, well, I just need to know what this thing over here is capable of. We're going to need a registry, which Anthropic's working on. I'm sure others will and have been doing directories of, and there's going to be a standard around that too. How do you build out a directory of MCP servers? I think that's going to unlock so many things just because, and we're already starting to see it. So I think MCP or something like it is going to be the next major unlock because it allows systems that don't know about each other, don't need to, it's that kind of decoupling of like Sentry and whatever tools someone else was building. And it's not just about, you know, Cloud Desktop or things like, even on the client side, I think we're going to see very interesting consumers of MCP, MCP clients versus just the chat body kind of things. Like, you know, Cloud Desktop and Cursor and things like that. But yeah, I'm very excited about MCP in that general direction.swyx [00:30:39]: I think the typical cynical developer take, it's like, we have OpenAPI. Yeah. What's the new thing? I don't know if you have a, do you have a quick MCP versus everything else? Yeah.Dharmesh [00:30:49]: So it's, so I like OpenAPI, right? So just a descriptive thing. It's OpenAPI. OpenAPI. Yes, that's what I meant. So it's basically a self-documenting thing. We can do machine-generated, lots of things from that output. It's a structured definition of an API. I get that, love it. But MCPs sort of are kind of use case specific. They're perfect for exactly what we're trying to use them for around LLMs in terms of discovery. It's like, okay, I don't necessarily need to know kind of all this detail. And so right now we have, we'll talk more about like MCP server implementations, but We will? I think, I don't know. Maybe we won't. At least it's in my head. It's like a back processor. But I do think MCP adds value above OpenAPI. It's, yeah, just because it solves this particular thing. And if we had come to the world, which we have, like, it's like, hey, we already have OpenAPI. It's like, if that were good enough for the universe, the universe would have adopted it already. There's a reason why MCP is taking office because marginally adds something that was missing before and doesn't go too far. And so that's why the kind of rate of adoption, you folks have written about this and talked about it. Yeah, why MCP won. Yeah. And it won because the universe decided that this was useful and maybe it gets supplanted by something else. Yeah. And maybe we discover, oh, maybe OpenAPI was good enough the whole time. I doubt that.swyx [00:32:09]: The meta lesson, this is, I mean, he's an investor in DevTools companies. I work in developer experience at DevRel in DevTools companies. Yep. Everyone wants to own the standard. Yeah. I'm sure you guys have tried to launch your own standards. Actually, it's Houseplant known for a standard, you know, obviously inbound marketing. But is there a standard or protocol that you ever tried to push? No.Dharmesh [00:32:30]: And there's a reason for this. Yeah. Is that? And I don't mean, need to mean, speak for the people of HubSpot, but I personally. You kind of do. I'm not smart enough. That's not the, like, I think I have a. You're smart. Not enough for that. I'm much better off understanding the standards that are out there. And I'm more on the composability side. Let's, like, take the pieces of technology that exist out there, combine them in creative, unique ways. And I like to consume standards. I don't like to, and that's not that I don't like to create them. I just don't think I have the, both the raw wattage or the credibility. It's like, okay, well, who the heck is Dharmesh, and why should we adopt a standard he created?swyx [00:33:07]: Yeah, I mean, there are people who don't monetize standards, like OpenTelemetry is a big standard, and LightStep never capitalized on that.Dharmesh [00:33:15]: So, okay, so if I were to do a standard, there's two things that have been in my head in the past. I was one around, a very, very basic one around, I don't even have the domain, I have a domain for everything, for open marketing. Because the issue we had in HubSpot grew up in the marketing space. There we go. There was no standard around data formats and things like that. It doesn't go anywhere. But the other one, and I did not mean to go here, but I'm going to go here. It's called OpenGraph. I know the term was already taken, but it hasn't been used for like 15 years now for its original purpose. But what I think should exist in the world is right now, our information, all of us, nodes are in the social graph at Meta or the professional graph at LinkedIn. Both of which are actually relatively closed in actually very annoying ways. Like very, very closed, right? Especially LinkedIn. Especially LinkedIn. I personally believe that if it's my data, and if I would get utility out of it being open, I should be able to make my data open or publish it in whatever forms that I choose, as long as I have control over it as opt-in. So the idea is around OpenGraph that says, here's a standard, here's a way to publish it. I should be able to go to OpenGraph.org slash Dharmesh dot JSON and get it back. And it's like, here's your stuff, right? And I can choose along the way and people can write to it and I can prove. And there can be an entire system. And if I were to do that, I would do it as a... Like a public benefit, non-profit-y kind of thing, as this is a contribution to society. I wouldn't try to commercialize that. Have you looked at AdProto? What's that? AdProto.swyx [00:34:43]: It's the protocol behind Blue Sky. Okay. My good friend, Dan Abramov, who was the face of React for many, many years, now works there. And he actually did a talk that I can send you, which basically kind of tries to articulate what you just said. But he does, he loves doing these like really great analogies, which I think you'll like. Like, you know, a lot of our data is behind a handle, behind a domain. Yep. So he's like, all right, what if we flip that? What if it was like our handle and then the domain? Yep. So, and that's really like your data should belong to you. Yep. And I should not have to wait 30 days for my Twitter data to export. Yep.Dharmesh [00:35:19]: you should be able to at least be able to automate it or do like, yes, I should be able to plug it into an agentic thing. Yeah. Yes. I think we're... Because so much of our data is... Locked up. I think the trick here isn't that standard. It is getting the normies to care.swyx [00:35:37]: Yeah. Because normies don't care.Dharmesh [00:35:38]: That's true. But building on that, normies don't care. So, you know, privacy is a really hot topic and an easy word to use, but it's not a binary thing. Like there are use cases where, and we make these choices all the time, that I will trade, not all privacy, but I will trade some privacy for some productivity gain or some benefit to me that says, oh, I don't care about that particular data being online if it gives me this in return, or I don't mind sharing this information with this company.Alessio [00:36:02]: If I'm getting, you know, this in return, but that sort of should be my option. I think now with computer use, you can actually automate some of the exports. Yes. Like something we've been doing internally is like everybody exports their LinkedIn connections. Yep. And then internally, we kind of merge them together to see how we can connect our companies to customers or things like that.Dharmesh [00:36:21]: And not to pick on LinkedIn, but since we're talking about it, but they feel strongly enough on the, you know, do not take LinkedIn data that they will block even browser use kind of things or whatever. They go to great, great lengths, even to see patterns of usage. And it says, oh, there's no way you could have, you know, gotten that particular thing or whatever without, and it's, so it's, there's...swyx [00:36:42]: Wasn't there a Supreme Court case that they lost? Yeah.Dharmesh [00:36:45]: So the one they lost was around someone that was scraping public data that was on the public internet. And that particular company had not signed any terms of service or whatever. It's like, oh, I'm just taking data that's on, there was no, and so that's why they won. But now, you know, the question is around, can LinkedIn... I think they can. Like, when you use, as a user, you use LinkedIn, you are signing up for their terms of service. And if they say, well, this kind of use of your LinkedIn account that violates our terms of service, they can shut your account down, right? They can. And they, yeah, so, you know, we don't need to make this a discussion. By the way, I love the company, don't get me wrong. I'm an avid user of the product. You know, I've got... Yeah, I mean, you've got over a million followers on LinkedIn, I think. Yeah, I do. And I've known people there for a long, long time, right? And I have lots of respect. And I understand even where the mindset originally came from of this kind of members-first approach to, you know, a privacy-first. I sort of get that. But sometimes you sort of have to wonder, it's like, okay, well, that was 15, 20 years ago. There's likely some controlled ways to expose some data on some member's behalf and not just completely be a binary. It's like, no, thou shalt not have the data.swyx [00:37:54]: Well, just pay for sales navigator.Alessio [00:37:57]: Before we move to the next layer of instruction, anything else on MCP you mentioned? Let's move back and then I'll tie it back to MCPs.Dharmesh [00:38:05]: So I think the... Open this with agent. Okay, so I'll start with... Here's my kind of running thesis, is that as AI and agents evolve, which they're doing very, very quickly, we're going to look at them more and more. I don't like to anthropomorphize. We'll talk about why this is not that. Less as just like raw tools and more like teammates. They'll still be software. They should self-disclose as being software. I'm totally cool with that. But I think what's going to happen is that in the same way you might collaborate with a team member on Slack or Teams or whatever you use, you can imagine a series of agents that do specific things just like a team member might do, that you can delegate things to. You can collaborate. You can say, hey, can you take a look at this? Can you proofread that? Can you try this? You can... Whatever it happens to be. So I think it is... I will go so far as to say it's inevitable that we're going to have hybrid teams someday. And what I mean by hybrid teams... So back in the day, hybrid teams were, oh, well, you have some full-time employees and some contractors. Then it was like hybrid teams are some people that are in the office and some that are remote. That's the kind of form of hybrid. The next form of hybrid is like the carbon-based life forms and agents and AI and some form of software. So let's say we temporarily stipulate that I'm right about that over some time horizon that eventually we're going to have these kind of digitally hybrid teams. So if that's true, then the question you sort of ask yourself is that then what needs to exist in order for us to get the full value of that new model? It's like, okay, well... You sort of need to... It's like, okay, well, how do I... If I'm building a digital team, like, how do I... Just in the same way, if I'm interviewing for an engineer or a designer or a PM, whatever, it's like, well, that's why we have professional networks, right? It's like, oh, they have a presence on likely LinkedIn. I can go through that semi-structured, structured form, and I can see the experience of whatever, you know, self-disclosed. But, okay, well, agents are going to need that someday. And so I'm like, okay, well, this seems like a thread that's worth pulling on. That says, okay. So I... So agent.ai is out there. And it's LinkedIn for agents. It's LinkedIn for agents. It's a professional network for agents. And the more I pull on that thread, it's like, okay, well, if that's true, like, what happens, right? It's like, oh, well, they have a profile just like anyone else, just like a human would. It's going to be a graph underneath, just like a professional network would be. It's just that... And you can have its, you know, connections and follows, and agents should be able to post. That's maybe how they do release notes. Like, oh, I have this new version. Whatever they decide to post, it should just be able to... Behave as a node on the network of a professional network. As it turns out, the more I think about that and pull on that thread, the more and more things, like, start to make sense to me. So it may be more than just a pure professional network. So my original thought was, okay, well, it's a professional network and agents as they exist out there, which I think there's going to be more and more of, will kind of exist on this network and have the profile. But then, and this is always dangerous, I'm like, okay, I want to see a world where thousands of agents are out there in order for the... Because those digital employees, the digital workers don't exist yet in any meaningful way. And so then I'm like, oh, can I make that easier for, like... And so I have, as one does, it's like, oh, I'll build a low-code platform for building agents. How hard could that be, right? Like, very hard, as it turns out. But it's been fun. So now, agent.ai has 1.3 million users. 3,000 people have actually, you know, built some variation of an agent, sometimes just for their own personal productivity. About 1,000 of which have been published. And the reason this comes back to MCP for me, so imagine that and other networks, since I know agent.ai. So right now, we have an MCP server for agent.ai that exposes all the internally built agents that we have that do, like, super useful things. Like, you know, I have access to a Twitter API that I can subsidize the cost. And I can say, you know, if you're looking to build something for social media, these kinds of things, with a single API key, and it's all completely free right now, I'm funding it. That's a useful way for it to work. And then we have a developer to say, oh, I have this idea. I don't have to worry about open AI. I don't have to worry about, now, you know, this particular model is better. It has access to all the models with one key. And we proxy it kind of behind the scenes. And then expose it. So then we get this kind of community effect, right? That says, oh, well, someone else may have built an agent to do X. Like, I have an agent right now that I built for myself to do domain valuation for website domains because I'm obsessed with domains, right? And, like, there's no efficient market for domains. There's no Zillow for domains right now that tells you, oh, here are what houses in your neighborhood sold for. It's like, well, why doesn't that exist? We should be able to solve that problem. And, yes, you're still guessing. Fine. There should be some simple heuristic. So I built that. It's like, okay, well, let me go look for past transactions. You say, okay, I'm going to type in agent.ai, agent.com, whatever domain. What's it actually worth? I'm looking at buying it. It can go and say, oh, which is what it does. It's like, I'm going to go look at are there any published domain transactions recently that are similar, either use the same word, same top-level domain, whatever it is. And it comes back with an approximate value, and it comes back with its kind of rationale for why it picked the value and comparable transactions. Oh, by the way, this domain sold for published. Okay. So that agent now, let's say, existed on the web, on agent.ai. Then imagine someone else says, oh, you know, I want to build a brand-building agent for startups and entrepreneurs to come up with names for their startup. Like a common problem, every startup is like, ah, I don't know what to call it. And so they type in five random words that kind of define whatever their startup is. And you can do all manner of things, one of which is like, oh, well, I need to find the domain for it. What are possible choices? Now it's like, okay, well, it would be nice to know if there's an aftermarket price for it, if it's listed for sale. Awesome. Then imagine calling this valuation agent. It's like, okay, well, I want to find where the arbitrage is, where the agent valuation tool says this thing is worth $25,000. It's listed on GoDaddy for $5,000. It's close enough. Let's go do that. Right? And that's a kind of composition use case that in my future state. Thousands of agents on the network, all discoverable through something like MCP. And then you as a developer of agents have access to all these kind of Lego building blocks based on what you're trying to solve. Then you blend in orchestration, which is getting better and better with the reasoning models now. Just describe the problem that you have. Now, the next layer that we're all contending with is that how many tools can you actually give an LLM before the LLM breaks? That number used to be like 15 or 20 before you kind of started to vary dramatically. And so that's the thing I'm thinking about now. It's like, okay, if I want to... If I want to expose 1,000 of these agents to a given LLM, obviously I can't give it all 1,000. Is there some intermediate layer that says, based on your prompt, I'm going to make a best guess at which agents might be able to be helpful for this particular thing? Yeah.Alessio [00:44:37]: Yeah, like RAG for tools. Yep. I did build the Latent Space Researcher on agent.ai. Okay. Nice. Yeah, that seems like, you know, then there's going to be a Latent Space Scheduler. And then once I schedule a research, you know, and you build all of these things. By the way, my apologies for the user experience. You realize I'm an engineer. It's pretty good.swyx [00:44:56]: I think it's a normie-friendly thing. Yeah. That's your magic. HubSpot does the same thing.Alessio [00:45:01]: Yeah, just to like quickly run through it. You can basically create all these different steps. And these steps are like, you know, static versus like variable-driven things. How did you decide between this kind of like low-code-ish versus doing, you know, low-code with code backend versus like not exposing that at all? Any fun design decisions? Yeah. And this is, I think...Dharmesh [00:45:22]: I think lots of people are likely sitting in exactly my position right now, coming through the choosing between deterministic. Like if you're like in a business or building, you know, some sort of agentic thing, do you decide to do a deterministic thing? Or do you go non-deterministic and just let the alum handle it, right, with the reasoning models? The original idea and the reason I took the low-code stepwise, a very deterministic approach. A, the reasoning models did not exist at that time. That's thing number one. Thing number two is if you can get... If you know in your head... If you know in your head what the actual steps are to accomplish whatever goal, why would you leave that to chance? There's no upside. There's literally no upside. Just tell me, like, what steps do you need executed? So right now what I'm playing with... So one thing we haven't talked about yet, and people don't talk about UI and agents. Right now, the primary interaction model... Or they don't talk enough about it. I know some people have. But it's like, okay, so we're used to the chatbot back and forth. Fine. I get that. But I think we're going to move to a blend of... Some of those things are going to be synchronous as they are now. But some are going to be... Some are going to be async. It's just going to put it in a queue, just like... And this goes back to my... Man, I talk fast. But I have this... I only have one other speed. It's even faster. So imagine it's like if you're working... So back to my, oh, we're going to have these hybrid digital teams. Like, you would not go to a co-worker and say, I'm going to ask you to do this thing, and then sit there and wait for them to go do it. Like, that's not how the world works. So it's nice to be able to just, like, hand something off to someone. It's like, okay, well, maybe I expect a response in an hour or a day or something like that.Dharmesh [00:46:52]: In terms of when things need to happen. So the UI around agents. So if you look at the output of agent.ai agents right now, they are the simplest possible manifestation of a UI, right? That says, oh, we have inputs of, like, four different types. Like, we've got a dropdown, we've got multi-select, all the things. It's like back in HTML, the original HTML 1.0 days, right? Like, you're the smallest possible set of primitives for a UI. And it just says, okay, because we need to collect some information from the user, and then we go do steps and do things. And generate some output in HTML or markup are the two primary examples. So the thing I've been asking myself, if I keep going down that path. So people ask me, I get requests all the time. It's like, oh, can you make the UI sort of boring? I need to be able to do this, right? And if I keep pulling on that, it's like, okay, well, now I've built an entire UI builder thing. Where does this end? And so I think the right answer, and this is what I'm going to be backcoding once I get done here, is around injecting a code generation UI generation into, the agent.ai flow, right? As a builder, you're like, okay, I'm going to describe the thing that I want, much like you would do in a vibe coding world. But instead of generating the entire app, it's going to generate the UI that exists at some point in either that deterministic flow or something like that. It says, oh, here's the thing I'm trying to do. Go generate the UI for me. And I can go through some iterations. And what I think of it as a, so it's like, I'm going to generate the code, generate the code, tweak it, go through this kind of prompt style, like we do with vibe coding now. And at some point, I'm going to be happy with it. And I'm going to hit save. And that's going to become the action in that particular step. It's like a caching of the generated code that I can then, like incur any inference time costs. It's just the actual code at that point.Alessio [00:48:29]: Yeah, I invested in a company called E2B, which does code sandbox. And they powered the LM arena web arena. So it's basically the, just like you do LMS, like text to text, they do the same for like UI generation. So if you're asking a model, how do you do it? But yeah, I think that's kind of where.Dharmesh [00:48:45]: That's the thing I'm really fascinated by. So the early LLM, you know, we're understandably, but laughably bad at simple arithmetic, right? That's the thing like my wife, Normies would ask us, like, you call this AI, like it can't, my son would be like, it's just stupid. It can't even do like simple arithmetic. And then like we've discovered over time that, and there's a reason for this, right? It's like, it's a large, there's, you know, the word language is in there for a reason in terms of what it's been trained on. It's not meant to do math, but now it's like, okay, well, the fact that it has access to a Python interpreter that I can actually call at runtime, that solves an entire body of problems that it wasn't trained to do. And it's basically a form of delegation. And so the thought that's kind of rattling around in my head is that that's great. So it's, it's like took the arithmetic problem and took it first. Now, like anything that's solvable through a relatively concrete Python program, it's able to do a bunch of things that I couldn't do before. Can we get to the same place with UI? I don't know what the future of UI looks like in a agentic AI world, but maybe let the LLM handle it, but not in the classic sense. Maybe it generates it on the fly, or maybe we go through some iterations and hit cache or something like that. So it's a little bit more predictable. Uh, I don't know, but yeah.Alessio [00:49:48]: And especially when is the human supposed to intervene? So, especially if you're composing them, most of them should not have a UI because then they're just web hooking to somewhere else. I just want to touch back. I don't know if you have more comments on this.swyx [00:50:01]: I was just going to ask when you, you said you got, you're going to go back to code. What

ZD Tech : tout comprendre en moins de 3 minutes avec ZDNet
A quoi ressemble l'internet des agents d'IA ?

ZD Tech : tout comprendre en moins de 3 minutes avec ZDNet

Play Episode Listen Later Mar 11, 2025 3:02


Aujourd'hui, nous allons parler d'une seconde "explosion cambrienne". Il ne s'agit évidemment pas de celle survenue il y a 500 millions d'années. Mais la comparaison est intéressante.A cette époque sur les fonds marins, de très nombreuses espèces de vers, coquillages, crustacés, mollusques mais aussi de groupes d'animaux disparus depuis, comme les trilobites et les radiodontes, se sont soudain développés.La seconde explosion cambrienne, elle, promet tout autant de diversité. C'est celle des agents d'intelligence artificielle. Et cette soudaine évolution de l'IA, attendue dès cette année, pourrait bien redéfinir notre façon d'utiliser l'intelligence artificielle. Voici comment et pourquoi.Une multiplication foudroyante des agents d'IAD'abord, nous allons assister à une multiplication foudroyante des agents d'IA.Certes, l'IA générative a connu une ascension fulgurante. Mais aujourd'hui, les experts parlent d'une explosion cambrienne des agents d'IA.Ces entités autonomes sont conçues pour réaliser des tâches spécifiques, souvent en traversant différents systèmes et infrastructures informatiques. Un consortium d'experts, incluant Cisco, LangChain et Galileo, s'inquiète cependant d'un manque d'interopérabilité entre ces agents. En d'autres termes, chacun travaille dans son coin, et cela pourrait poser problème à terme.Remédier au risque de fragmentationCes acteurs viennent donc de mettre sur pied une initiative à l'intitulé imprononçable. Enfin j'essaie. Il s'agit de l'initiative AGNTCY. L'idée est de créer au final un internet des agents d'IA.Pour remédier au risque de fragmentation, ce collectif open source doit structurer la collaboration entre agents IA. Cet internet des agents d'IA doit donc permettre aux différentes IA de communiquer et de coopérer efficacement.Trois outils viennent d'être lancés pour démarrer les travaux. Il s'agit d'un répertoire d'agents, d'un framework pour construire des agents ouverts et enfin d'un protocole de communication.Reproduire un modèle inspiré de l'écosystème WordPressAu final, il s'agit pour les fondateurs de reproduire un modèle inspiré de l'écosystème WordPress.Le succès de cette plateforme repose en partie sur son immense répertoire de plugins. Ce sont ces plugins qui permettent à chaque site web d'être enrichi de nouvelles fonctionnalités en quelques clics.Ce nouveau consortium veut suivre cette logique en créant un écosystème où les agents IA peuvent être facilement identifiés, installés et interconnectés.Le ZD Tech est sur toutes les plateformes de podcast ! Abonnez-vous !Hébergé par Ausha. Visitez ausha.co/politique-de-confidentialite pour plus d'informations.

Microsoft Mechanics Podcast
Introducing Azure AI Foundry - Everything you need for AI development

Microsoft Mechanics Podcast

Play Episode Listen Later Mar 10, 2025 12:54 Transcription Available


Create agentic solutions quickly and efficiently with Azure AI Foundry. Choose the right models, ground your agents with knowledge, and seamlessly integrate AI into your development workflow—from early experimentation to production. Test, optimize, and deploy with built-in evaluation and management tools. See how to leverage the Azure AI Foundry SDK to code and orchestrate intelligent agents, monitor performance with tracing and assessments, and streamline DevOps with production-ready management. Yina Arenas, from the Azure AI Foundry team, shares its extensive capabilities as a unified platform that supports you throughout the entire AI development lifecycle. ► QUICK LINKS:  00:00 - Create agentic solutions with Azure AI Foundry 00:20 - Model catalog in Azure AI Foundry 02:15 - Experiment in the Azure AI Foundry playground 03:10 - Create and customize agents 04:13 - Assess and improve agents 05:58 - Monitor and manage apps 06:50 - Create a multi-agentic app in code 09:26 - Create a Sender agent 10:39 - How to connect orchestration logic 11:25 - Watch agents work 12:26 - Wrap up ► Link References Get started with Azure AI Foundry at https://ai.azure.com ► Unfamiliar with Microsoft Mechanics? As Microsoft's official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. • Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries • Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog • Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast ► Keep getting this insider knowledge, join us on social: • Follow us on Twitter: https://twitter.com/MSFTMechanics • Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/ • Enjoy us on Instagram: https://www.instagram.com/msftmechanics/ • Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics  

PodRocket - A web development podcast from LogRocket
LLMs for web developers with Roy Derks

PodRocket - A web development podcast from LogRocket

Play Episode Listen Later Mar 6, 2025 28:45


Roy Derks, Developer Experience at IBM, talks about the integration of Large Language Models (LLMs) in web development. We explore practical applications such as building agents, automating QA testing, and the evolving role of AI frameworks in software development. Links https://www.linkedin.com/in/gethackteam https://www.youtube.com/@gethackteam https://x.com/gethackteam https://hackteam.io We want to hear from you! How did you find us? Did you see us on Twitter? In a newsletter? Or maybe we were recommended by a friend? Let us know by sending an email to our producer, Emily, at emily.kochanekketner@logrocket.com (mailto:emily.kochanekketner@logrocket.com), or tweet at us at PodRocketPod (https://twitter.com/PodRocketpod). Follow us. Get free stickers. Follow us on Apple Podcasts, fill out this form (https://podrocket.logrocket.com/get-podrocket-stickers), and we'll send you free PodRocket stickers! What does LogRocket do? LogRocket provides AI-first session replay and analytics that surfaces the UX and technical issues impacting user experiences. Start understand where your users are struggling by trying it for free at [LogRocket.com]. Try LogRocket for free today.(https://logrocket.com/signup/?pdr) Special Guest: Roy Derks.

This Week in Startups
AI Agents & the Future of Work with LangChain's Harrison Chase | AI Basics with Google Cloud

This Week in Startups

Play Episode Listen Later Mar 4, 2025 19:58


In this episode: Jason sits down with Harrison Chase, CEO of LangChain, to explore how AI-powered agents are transforming the way startups operate. They discuss the shift from traditional entry-level roles to AI-driven automation, the importance of human-in-the-loop systems, and the future of AI-powered assistants in business. Harrison shares insights on how companies like Replit, Klarna, and GitLab are leveraging AI agents to streamline operations, plus a look ahead at what's next for AI-driven workflows. Brought to you in partnership with Google Cloud.*Timestamps:(0:00) Introduction to Startup Basics series & Importance of AI in startups(2:04) Partnership with Google Cloud & Introducing Harrison Chase from Langchain(4:38) Evolution of entry-level jobs & Examples of AI agents in startups(8:00) Challenges & Future of AI agents in startups(14:24) AI agents in collaborative spaces & Non-developers creating AI agents(18:40) Closing remarks and where to learn more*Uncover more valuable insights from AI leaders in Google Cloud's 'Future of AI: Perspectives for Startups' report. Discover what 23 AI industry leaders think about the future of AI—and how it impacts your business. Read their perspectives here: https://goo.gle/futureofai*Check out all of the Startup Basics episodes here: https://thisweekinstartups.com/basicsCheck out Google Cloud: https://cloud.google.com/Check out LangChain: https://www.langchain.com/*Follow Harrison:LinkedIn: https://www.linkedin.com/in/harrison-chase-961287118/X: https://x.com/hwchase17*Follow Jason:X: https://twitter.com/JasonLinkedIn: https://www.linkedin.com/in/jasoncalacanis*Follow TWiST:Twitter: https://twitter.com/TWiStartupsYouTube: https://www.youtube.com/thisweekinInstagram: https://www.instagram.com/thisweekinstartupsTikTok: https://www.tiktok.com/@thisweekinstartupsSubstack: https://twistartups.substack.com

Machine Learning Street Talk
Sakana AI - Chris Lu, Robert Tjarko Lange, Cong Lu

Machine Learning Street Talk

Play Episode Listen Later Mar 1, 2025 97:54


We speak with Sakana AI, who are building nature-inspired methods that could fundamentally transform how we develop AI systems.The guests include Chris Lu, a researcher who recently completed his DPhil at Oxford University under Prof. Jakob Foerster's supervision, where he focused on meta-learning and multi-agent systems. Chris is the first author of the DiscoPOP paper, which demonstrates how language models can discover and design better training algorithms. Also joining is Robert Tjarko Lange, a founding member of Sakana AI who specializes in evolutionary algorithms and large language models. Robert leads research at the intersection of evolutionary computation and foundation models, and is completing his PhD at TU Berlin on evolutionary meta-learning. The discussion also features Cong Lu, currently a Research Scientist at Google DeepMind's Open-Endedness team, who previously helped develop The AI Scientist and Intelligent Go-Explore.SPONSOR MESSAGES:***CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. Check out their super fast DeepSeek R1 hosting!https://centml.ai/pricing/Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich. Goto https://tufalabs.ai/**** DiscoPOP - A framework where language models discover their own optimization algorithms* EvoLLM - Using language models as evolution strategies for optimizationThe AI Scientist - A fully automated system that conducts scientific research end-to-end* Neural Attention Memory Models (NAMMs) - Evolved memory systems that make transformers both faster and more accurateTRANSCRIPT + REFS:https://www.dropbox.com/scl/fi/gflcyvnujp8cl7zlv3v9d/Sakana.pdf?rlkey=woaoo82943170jd4yyi2he71c&dl=0Robert Tjarko Langehttps://roberttlange.com/Chris Luhttps://chrislu.page/Cong Luhttps://www.conglu.co.uk/Sakanahttps://sakana.ai/blog/TOC:1. LLMs for Algorithm Generation and Optimization [00:00:00] 1.1 LLMs generating algorithms for training other LLMs [00:04:00] 1.2 Evolutionary black-box optim using neural network loss parameterization [00:11:50] 1.3 DiscoPOP: Non-convex loss function for noisy data [00:20:45] 1.4 External entropy Injection for preventing Model collapse [00:26:25] 1.5 LLMs for black-box optimization using abstract numerical sequences2. Model Learning and Generalization [00:31:05] 2.1 Fine-tuning on teacher algorithm trajectories [00:31:30] 2.2 Transformers learning gradient descent [00:33:00] 2.3 LLM tokenization biases towards specific numbers [00:34:50] 2.4 LLMs as evolution strategies for black box optimization [00:38:05] 2.5 DiscoPOP: LLMs discovering novel optimization algorithms3. AI Agents and System Architectures [00:51:30] 3.1 ARC challenge: Induction vs. transformer approaches [00:54:35] 3.2 LangChain / modular agent components [00:57:50] 3.3 Debate improves LLM truthfulness [01:00:55] 3.4 Time limits controlling AI agent systems [01:03:00] 3.5 Gemini: Million-token context enables flatter hierarchies [01:04:05] 3.6 Agents follow own interest gradients [01:09:50] 3.7 Go-Explore algorithm: archive-based exploration [01:11:05] 3.8 Foundation models for interesting state discovery [01:13:00] 3.9 LLMs leverage prior game knowledge4. AI for Scientific Discovery and Human Alignment [01:17:45] 4.1 Encoding Alignment & Aesthetics via Reward Functions [01:20:00] 4.2 AI Scientist: Automated Open-Ended Scientific Discovery [01:24:15] 4.3 DiscoPOP: LLM for Preference Optimization Algorithms [01:28:30] 4.4 Balancing AI Knowledge with Human Understanding [01:33:55] 4.5 AI-Driven Conferences and Paper Review

Training Data
MongoDB's Sahir Azam: Vector Databases and the Data Structure of AI

Training Data

Play Episode Listen Later Feb 13, 2025 44:26


MongoDB product leader Sahir Azam explains how vector databases have evolved from semantic search to become the essential memory and state layer for AI applications. He describes his view of how AI is transforming software development generally, and how combining vectors, graphs and traditional data structures enables high-quality retrieval needed for mission-critical enterprise AI use cases. Drawing from MongoDB's successful cloud transformation, Azam shares his vision for democratizing AI development by making sophisticated capabilities accessible to mainstream developers through integrated tools and abstractions. Hosted by: Sonya Huang and Pat Grady, Sequoia Capital  Mentioned in this episode: Introducing ambient agents: Blog post by Langchain on a new UX pattern where AI agents can listen to an event stream and act on it  Google Gemini Deep Research: Sahir enjoys its amazing product experience Perplexity: AI search app that Sahir admires for its product craft Snipd: AI powered podcast app Sahir likes

Business of Tech
AI Performance Declines Under Pressure, UK Orders Apple Backdoor, and IT Jobs Face Automation Crisis

Business of Tech

Play Episode Listen Later Feb 12, 2025 15:06


Artificial intelligence agents are facing significant challenges when tasked with complex responsibilities, as highlighted by a recent study from Langchain. The research indicates that AI performance deteriorates under cognitive overload, with one model's effectiveness dropping to just 2% when managing more than seven domains. This finding emphasizes the need for businesses to design AI systems that can effectively manage complexity rather than assuming that AI can scale to handle human-like multitasking. The implications are particularly relevant for industries reliant on automation, such as customer service and IT operations.In a related development, researchers from Stanford and the University of Washington have introduced a new AI reasoning model called S1, which can be trained at a fraction of the cost of existing high-end models. This innovation raises concerns about the commoditization of AI, as smaller teams can replicate sophisticated models with minimal resources. Meanwhile, Hugging Face has quickly developed an open-source AI research agent that aims to compete with OpenAI's offerings, showcasing the rapid advancements in AI capabilities and the importance of community contributions in this space.The podcast also discusses the ongoing legal battles surrounding AI and copyright, particularly the recent ruling in favor of Thomson Reuters against Ross Intelligence. This case underscores the complexities of how AI tools are trained using copyrighted material and sets a precedent that could impact future AI developments. As AI continues to evolve, the legal landscape surrounding its use and the rights of content creators remains a critical area of concern.Finally, the episode touches on the rising unemployment rate in the IT sector, attributed to the increasing influence of AI and automation. The data reveals a significant jump in unemployment among IT workers, with many routine jobs being automated rather than replaced. This shift highlights the need for IT professionals to upskill and adapt to the changing job market, focusing on areas such as AI integration and cybersecurity, as traditional software development roles decline. Four things to know today 00:00 AI Agents Struggle Under Pressure—More Complexity Means Less Accuracy03:23 Big AI, Big Trouble? Low-Cost Models Challenge Industry Leaders07:17 Apple Faces UK Encryption Fight, CISA's Role Strengthens, and AI Copyright Battle Heats Up10:38 AI Disrupts IT Hiring: Fewer Software Jobs, More Automation, and Higher Unemployment  Supported by: https://www.huntress.com/mspradio/ Event: https://nerdiocon.com/    All our Sponsors: https://businessof.tech/sponsors/ Do you want the show on your podcast app or the written versions of the stories? Subscribe to the Business of Tech: https://www.businessof.tech/subscribe/Looking for a link from the stories? The entire script of the show, with links to articles, are posted in each story on https://www.businessof.tech/ Support the show on Patreon: https://patreon.com/mspradio/ Want to be a guest on Business of Tech: Daily 10-Minute IT Services Insights? Send Dave Sobel a message on PodMatch, here: https://www.podmatch.com/hostdetailpreview/businessoftech Want our stuff? Cool Merch? Wear “Why Do We Care?” - Visit https://mspradio.myspreadshop.com Follow us on:LinkedIn: https://www.linkedin.com/company/28908079/YouTube: https://youtube.com/mspradio/Facebook: https://www.facebook.com/mspradionews/Instagram: https://www.instagram.com/mspradio/TikTok: https://www.tiktok.com/@businessoftechBluesky: https://bsky.app/profile/businessof.tech

AI DAILY: Breaking News in AI
VANCE PUSHES BACK ON AI

AI DAILY: Breaking News in AI

Play Episode Listen Later Feb 12, 2025 3:25


Plus AI Gets Confused By Too Many ToolsLike this? Get AIDAILY, delivered to your inbox, every weekday. Subscribe to our newsletter athttps://aidaily.usVance Pushes Back Against Europe's AI RegulationsAt the Paris AI Summit, Vice President JD Vance emphasized the U.S. commitment to leading in artificial intelligence, urging European nations to avoid heavy-handed regulations that could stifle innovation. He highlighted the need for international cooperation that fosters technological advancement rather than hindering it. This stance reflects the administration's "America First" approach to tech policy.AI Agents: Present Yet Untrusted in BusinessAt the Wall Street Journal's CIO Network Summit, 61% of tech leaders said they're testing AI agents, but only 21% have fully adopted them. The main hang-up? Reliability concerns. Vendors like OpenAI and Microsoft are pushing for quicker adoption, but businesses are still on the fence.AI Accelerationists vs. Big Tech: The Battle Over AI's FutureAI accelerationists are pushing for rapid AI development, arguing that slowing down could let other countries, like China, take the lead. On the flip side, big tech companies and some experts are calling for more guardrails and regulations to ensure AI's safe and ethical use. This debate is heating up as AI becomes more integrated into our daily lives.LangChain's AI Agents Overloaded by Too Many ToolsLangChain tested its AI agents and found that piling on too many tools and instructions makes them glitchy. When agents were tasked with multiple responsibilities, their performance took a nosedive. This shows that current AI agents can't handle overload and need better design to juggle complex tasks.Google's Super Assistant: The AI Revolution That Could Change EverythingGoogle's latest AI projects, Astra and Mariner, are pushing the boundaries of what's possible with autonomous assistants. These systems are designed to see, understand, and act in both digital and physical worlds, potentially transforming how we interact with technology in our daily lives.AI Revolutionizes Material Discovery, Paving the Way for Next-Gen Super MaterialsAI is transforming how we discover new materials by predicting and generating compounds with desired properties. Tools like Microsoft's MatterGen and MatterSim can design and validate novel materials, potentially leading to breakthroughs in energy storage, electronics, and medical devices. This shift accelerates innovation, making the search for advanced materials more efficient than ever.

Software Engineering Daily
LangChain and Agentic AI Engineering with Erick Friis

Software Engineering Daily

Play Episode Listen Later Feb 11, 2025 41:50


LangChain is a popular open-source framework to build applications that integrate LLMs with external data sources like APIs, databases, or custom knowledge bases. It's commonly used for chatbots, question-answering systems, and workflow automation. Its flexibility and extensibility have made it something of a standard for creating sophisticated AI-driven software. Erick Friis is a Founding Engineer The post LangChain and Agentic AI Engineering with Erick Friis appeared first on Software Engineering Daily.

Podcast – Software Engineering Daily
LangChain and Agentic AI Engineering with Erick Friis

Podcast – Software Engineering Daily

Play Episode Listen Later Feb 11, 2025 41:50


LangChain is a popular open-source framework to build applications that integrate LLMs with external data sources like APIs, databases, or custom knowledge bases. It's commonly used for chatbots, question-answering systems, and workflow automation. Its flexibility and extensibility have made it something of a standard for creating sophisticated AI-driven software. Erick Friis is a Founding Engineer The post LangChain and Agentic AI Engineering with Erick Friis appeared first on Software Engineering Daily.

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Did you know that adding a simple Code Interpreter took o3 from 9.2% to 32% on FrontierMath? The Latent Space crew is hosting a hack night Feb 11th in San Francisco focused on CodeGen use cases, co-hosted with E2B and Edge AGI; watch E2B's new workshop and RSVP here!We're happy to announce that today's guest Samuel Colvin will be teaching his very first Pydantic AI workshop at the newly announced AI Engineer NYC Workshops day on Feb 22! 25 tickets left.If you're a Python developer, it's very likely that you've heard of Pydantic. Every month, it's downloaded >300,000,000 times, making it one of the top 25 PyPi packages. OpenAI uses it in its SDK for structured outputs, it's at the core of FastAPI, and if you've followed our AI Engineer Summit conference, Jason Liu of Instructor has given two great talks about it: “Pydantic is all you need” and “Pydantic is STILL all you need”. Now, Samuel Colvin has raised $17M from Sequoia to turn Pydantic from an open source project to a full stack AI engineer platform with Logfire, their observability platform, and PydanticAI, their new agent framework.Logfire: bringing OTEL to AIOpenTelemetry recently merged Semantic Conventions for LLM workloads which provides standard definitions to track performance like gen_ai.server.time_per_output_token. In Sam's view at least 80% of new apps being built today have some sort of LLM usage in them, and just like web observability platform got replaced by cloud-first ones in the 2010s, Logfire wants to do the same for AI-first apps. If you're interested in the technical details, Logfire migrated away from Clickhouse to Datafusion for their backend. We spent some time on the importance of picking open source tools you understand and that you can actually contribute to upstream, rather than the more popular ones; listen in ~43:19 for that part.Agents are the killer app for graphsPydantic AI is their attempt at taking a lot of the learnings that LangChain and the other early LLM frameworks had, and putting Python best practices into it. At an API level, it's very similar to the other libraries: you can call LLMs, create agents, do function calling, do evals, etc.They define an “Agent” as a container with a system prompt, tools, structured result, and an LLM. Under the hood, each Agent is now a graph of function calls that can orchestrate multi-step LLM interactions. You can start simple, then move toward fully dynamic graph-based control flow if needed.“We were compelled enough by graphs once we got them right that our agent implementation [...] is now actually a graph under the hood.”Why Graphs?* More natural for complex or multi-step AI workflows.* Easy to visualize and debug with mermaid diagrams.* Potential for distributed runs, or “waiting days” between steps in certain flows.In parallel, you see folks like Emil Eifrem of Neo4j talk about GraphRAG as another place where graphs fit really well in the AI stack, so it might be time for more people to take them seriously.Full Video EpisodeLike and subscribe!Chapters* 00:00:00 Introductions* 00:00:24 Origins of Pydantic* 00:05:28 Pydantic's AI moment * 00:08:05 Why build a new agents framework?* 00:10:17 Overview of Pydantic AI* 00:12:33 Becoming a believer in graphs* 00:24:02 God Model vs Compound AI Systems* 00:28:13 Why not build an LLM gateway?* 00:31:39 Programmatic testing vs live evals* 00:35:51 Using OpenTelemetry for AI traces* 00:43:19 Why they don't use Clickhouse* 00:48:34 Competing in the observability space* 00:50:41 Licensing decisions for Pydantic and LogFire* 00:51:48 Building Pydantic.run* 00:55:24 Marimo and the future of Jupyter notebooks* 00:57:44 London's AI sceneShow Notes* Sam Colvin* Pydantic* Pydantic AI* Logfire* Pydantic.run* Zod* E2B* Arize* Langsmith* Marimo* Prefect* GLA (Google Generative Language API)* OpenTelemetry* Jason Liu* Sebastian Ramirez* Bogomil Balkansky* Hood Chatham* Jeremy Howard* Andrew LambTranscriptAlessio [00:00:03]: Hey, everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:12]: Good morning. And today we're very excited to have Sam Colvin join us from Pydantic AI. Welcome. Sam, I heard that Pydantic is all we need. Is that true?Samuel [00:00:24]: I would say you might need Pydantic AI and Logfire as well, but it gets you a long way, that's for sure.Swyx [00:00:29]: Pydantic almost basically needs no introduction. It's almost 300 million downloads in December. And obviously, in the previous podcasts and discussions we've had with Jason Liu, he's been a big fan and promoter of Pydantic and AI.Samuel [00:00:45]: Yeah, it's weird because obviously I didn't create Pydantic originally for uses in AI, it predates LLMs. But it's like we've been lucky that it's been picked up by that community and used so widely.Swyx [00:00:58]: Actually, maybe we'll hear it. Right from you, what is Pydantic and maybe a little bit of the origin story?Samuel [00:01:04]: The best name for it, which is not quite right, is a validation library. And we get some tension around that name because it doesn't just do validation, it will do coercion by default. We now have strict mode, so you can disable that coercion. But by default, if you say you want an integer field and you get in a string of 1, 2, 3, it will convert it to 123 and a bunch of other sensible conversions. And as you can imagine, the semantics around it. Exactly when you convert and when you don't, it's complicated, but because of that, it's more than just validation. Back in 2017, when I first started it, the different thing it was doing was using type hints to define your schema. That was controversial at the time. It was genuinely disapproved of by some people. I think the success of Pydantic and libraries like FastAPI that build on top of it means that today that's no longer controversial in Python. And indeed, lots of other people have copied that route, but yeah, it's a data validation library. It uses type hints for the for the most part and obviously does all the other stuff you want, like serialization on top of that. But yeah, that's the core.Alessio [00:02:06]: Do you have any fun stories on how JSON schemas ended up being kind of like the structure output standard for LLMs? And were you involved in any of these discussions? Because I know OpenAI was, you know, one of the early adopters. So did they reach out to you? Was there kind of like a structure output console in open source that people were talking about or was it just a random?Samuel [00:02:26]: No, very much not. So I originally. Didn't implement JSON schema inside Pydantic and then Sebastian, Sebastian Ramirez, FastAPI came along and like the first I ever heard of him was over a weekend. I got like 50 emails from him or 50 like emails as he was committing to Pydantic, adding JSON schema long pre version one. So the reason it was added was for OpenAPI, which is obviously closely akin to JSON schema. And then, yeah, I don't know why it was JSON that got picked up and used by OpenAI. It was obviously very convenient for us. That's because it meant that not only can you do the validation, but because Pydantic will generate you the JSON schema, it will it kind of can be one source of source of truth for structured outputs and tools.Swyx [00:03:09]: Before we dive in further on the on the AI side of things, something I'm mildly curious about, obviously, there's Zod in JavaScript land. Every now and then there is a new sort of in vogue validation library that that takes over for quite a few years and then maybe like some something else comes along. Is Pydantic? Is it done like the core Pydantic?Samuel [00:03:30]: I've just come off a call where we were redesigning some of the internal bits. There will be a v3 at some point, which will not break people's code half as much as v2 as in v2 was the was the massive rewrite into Rust, but also fixing all the stuff that was broken back from like version zero point something that we didn't fix in v1 because it was a side project. We have plans to move some of the basically store the data in Rust types after validation. Not completely. So we're still working to design the Pythonic version of it, in order for it to be able to convert into Python types. So then if you were doing like validation and then serialization, you would never have to go via a Python type we reckon that can give us somewhere between three and five times another three to five times speed up. That's probably the biggest thing. Also, like changing how easy it is to basically extend Pydantic and define how particular types, like for example, NumPy arrays are validated and serialized. But there's also stuff going on. And for example, Jitter, the JSON library in Rust that does the JSON parsing, has SIMD implementation at the moment only for AMD64. So we can add that. We need to go and add SIMD for other instruction sets. So there's a bunch more we can do on performance. I don't think we're going to go and revolutionize Pydantic, but it's going to continue to get faster, continue, hopefully, to allow people to do more advanced things. We might add a binary format like CBOR for serialization for when you'll just want to put the data into a database and probably load it again from Pydantic. So there are some things that will come along, but for the most part, it should just get faster and cleaner.Alessio [00:05:04]: From a focus perspective, I guess, as a founder too, how did you think about the AI interest rising? And then how do you kind of prioritize, okay, this is worth going into more, and we'll talk about Pydantic AI and all of that. What was maybe your early experience with LLAMP, and when did you figure out, okay, this is something we should take seriously and focus more resources on it?Samuel [00:05:28]: I'll answer that, but I'll answer what I think is a kind of parallel question, which is Pydantic's weird, because Pydantic existed, obviously, before I was starting a company. I was working on it in my spare time, and then beginning of 22, I started working on the rewrite in Rust. And I worked on it full-time for a year and a half, and then once we started the company, people came and joined. And it was a weird project, because that would never go away. You can't get signed off inside a startup. Like, we're going to go off and three engineers are going to work full-on for a year in Python and Rust, writing like 30,000 lines of Rust just to release open-source-free Python library. The result of that has been excellent for us as a company, right? As in, it's made us remain entirely relevant. And it's like, Pydantic is not just used in the SDKs of all of the AI libraries, but I can't say which one, but one of the big foundational model companies, when they upgraded from Pydantic v1 to v2, their number one internal model... The metric of performance is time to first token. That went down by 20%. So you think about all of the actual AI going on inside, and yet at least 20% of the CPU, or at least the latency inside requests was actually Pydantic, which shows like how widely it's used. So we've benefited from doing that work, although it didn't, it would have never have made financial sense in most companies. In answer to your question about like, how do we prioritize AI, I mean, the honest truth is we've spent a lot of the last year and a half building. Good general purpose observability inside LogFire and making Pydantic good for general purpose use cases. And the AI has kind of come to us. Like we just, not that we want to get away from it, but like the appetite, uh, both in Pydantic and in LogFire to go and build with AI is enormous because it kind of makes sense, right? Like if you're starting a new greenfield project in Python today, what's the chance that you're using GenAI 80%, let's say, globally, obviously it's like a hundred percent in California, but even worldwide, it's probably 80%. Yeah. And so everyone needs that stuff. And there's so much yet to be figured out so much like space to do things better in the ecosystem in a way that like to go and implement a database that's better than Postgres is a like Sisyphean task. Whereas building, uh, tools that are better for GenAI than some of the stuff that's about now is not very difficult. Putting the actual models themselves to one side.Alessio [00:07:40]: And then at the same time, then you released Pydantic AI recently, which is, uh, um, you know, agent framework and early on, I would say everybody like, you know, Langchain and like, uh, Pydantic kind of like a first class support, a lot of these frameworks, we're trying to use you to be better. What was the decision behind we should do our own framework? Were there any design decisions that you disagree with any workloads that you think people didn't support? Well,Samuel [00:08:05]: it wasn't so much like design and workflow, although I think there were some, some things we've done differently. Yeah. I think looking in general at the ecosystem of agent frameworks, the engineering quality is far below that of the rest of the Python ecosystem. There's a bunch of stuff that we have learned how to do over the last 20 years of building Python libraries and writing Python code that seems to be abandoned by people when they build agent frameworks. Now I can kind of respect that, particularly in the very first agent frameworks, like Langchain, where they were literally figuring out how to go and do this stuff. It's completely understandable that you would like basically skip some stuff.Samuel [00:08:42]: I'm shocked by the like quality of some of the agent frameworks that have come out recently from like well-respected names, which it just seems to be opportunism and I have little time for that, but like the early ones, like I think they were just figuring out how to do stuff and just as lots of people have learned from Pydantic, we were able to learn a bit from them. I think from like the gap we saw and the thing we were frustrated by was the production readiness. And that means things like type checking, even if type checking makes it hard. Like Pydantic AI, I will put my hand up now and say it has a lot of generics and you need to, it's probably easier to use it if you've written a bit of Rust and you really understand generics, but like, and that is, we're not claiming that that makes it the easiest thing to use in all cases, we think it makes it good for production applications in big systems where type checking is a no-brainer in Python. But there are also a bunch of stuff we've learned from maintaining Pydantic over the years that we've gone and done. So every single example in Pydantic AI's documentation is run on Python. As part of tests and every single print output within an example is checked during tests. So it will always be up to date. And then a bunch of things that, like I say, are standard best practice within the rest of the Python ecosystem, but I'm not followed surprisingly by some AI libraries like coverage, linting, type checking, et cetera, et cetera, where I think these are no-brainers, but like weirdly they're not followed by some of the other libraries.Alessio [00:10:04]: And can you just give an overview of the framework itself? I think there's kind of like the. LLM calling frameworks, there are the multi-agent frameworks, there's the workflow frameworks, like what does Pydantic AI do?Samuel [00:10:17]: I glaze over a bit when I hear all of the different sorts of frameworks, but I like, and I will tell you when I built Pydantic, when I built Logfire and when I built Pydantic AI, my methodology is not to go and like research and review all of the other things. I kind of work out what I want and I go and build it and then feedback comes and we adjust. So the fundamental building block of Pydantic AI is agents. The exact definition of agents and how you want to define them. is obviously ambiguous and our things are probably sort of agent-lit, not that we would want to go and rename them to agent-lit, but like the point is you probably build them together to build something and most people will call an agent. So an agent in our case has, you know, things like a prompt, like system prompt and some tools and a structured return type if you want it, that covers the vast majority of cases. There are situations where you want to go further and the most complex workflows where you want graphs and I resisted graphs for quite a while. I was sort of of the opinion you didn't need them and you could use standard like Python flow control to do all of that stuff. I had a few arguments with people, but I basically came around to, yeah, I can totally see why graphs are useful. But then we have the problem that by default, they're not type safe because if you have a like add edge method where you give the names of two different edges, there's no type checking, right? Even if you go and do some, I'm not, not all the graph libraries are AI specific. So there's a, there's a graph library called, but it allows, it does like a basic runtime type checking. Ironically using Pydantic to try and make up for the fact that like fundamentally that graphs are not typed type safe. Well, I like Pydantic, but it did, that's not a real solution to have to go and run the code to see if it's safe. There's a reason that starting type checking is so powerful. And so we kind of, from a lot of iteration eventually came up with a system of using normally data classes to define nodes where you return the next node you want to call and where we're able to go and introspect the return type of a node to basically build the graph. And so the graph is. Yeah. Inherently type safe. And once we got that right, I, I wasn't, I'm incredibly excited about graphs. I think there's like masses of use cases for them, both in gen AI and other development, but also software's all going to have interact with gen AI, right? It's going to be like web. There's no longer be like a web department in a company is that there's just like all the developers are building for web building with databases. The same is going to be true for gen AI.Alessio [00:12:33]: Yeah. I see on your docs, you call an agent, a container that contains a system prompt function. Tools, structure, result, dependency type model, and then model settings. Are the graphs in your mind, different agents? Are they different prompts for the same agent? What are like the structures in your mind?Samuel [00:12:52]: So we were compelled enough by graphs once we got them right, that we actually merged the PR this morning. That means our agent implementation without changing its API at all is now actually a graph under the hood as it is built using our graph library. So graphs are basically a lower level tool that allow you to build these complex workflows. Our agents are technically one of the many graphs you could go and build. And we just happened to build that one for you because it's a very common, commonplace one. But obviously there are cases where you need more complex workflows where the current agent assumptions don't work. And that's where you can then go and use graphs to build more complex things.Swyx [00:13:29]: You said you were cynical about graphs. What changed your mind specifically?Samuel [00:13:33]: I guess people kept giving me examples of things that they wanted to use graphs for. And my like, yeah, but you could do that in standard flow control in Python became a like less and less compelling argument to me because I've maintained those systems that end up with like spaghetti code. And I could see the appeal of this like structured way of defining the workflow of my code. And it's really neat that like just from your code, just from your type hints, you can get out a mermaid diagram that defines exactly what can go and happen.Swyx [00:14:00]: Right. Yeah. You do have very neat implementation of sort of inferring the graph from type hints, I guess. Yeah. Is what I would call it. Yeah. I think the question always is I have gone back and forth. I used to work at Temporal where we would actually spend a lot of time complaining about graph based workflow solutions like AWS step functions. And we would actually say that we were better because you could use normal control flow that you already knew and worked with. Yours, I guess, is like a little bit of a nice compromise. Like it looks like normal Pythonic code. But you just have to keep in mind what the type hints actually mean. And that's what we do with the quote unquote magic that the graph construction does.Samuel [00:14:42]: Yeah, exactly. And if you look at the internal logic of actually running a graph, it's incredibly simple. It's basically call a node, get a node back, call that node, get a node back, call that node. If you get an end, you're done. We will add in soon support for, well, basically storage so that you can store the state between each node that's run. And then the idea is you can then distribute the graph and run it across computers. And also, I mean, the other weird, the other bit that's really valuable is across time. Because it's all very well if you look at like lots of the graph examples that like Claude will give you. If it gives you an example, it gives you this lovely enormous mermaid chart of like the workflow, for example, managing returns if you're an e-commerce company. But what you realize is some of those lines are literally one function calls another function. And some of those lines are wait six days for the customer to print their like piece of paper and put it in the post. And if you're writing like your demo. Project or your like proof of concept, that's fine because you can just say, and now we call this function. But when you're building when you're in real in real life, that doesn't work. And now how do we manage that concept to basically be able to start somewhere else in the in our code? Well, this graph implementation makes it incredibly easy because you just pass the node that is the start point for carrying on the graph and it continues to run. So it's things like that where I was like, yeah, I can just imagine how things I've done in the past would be fundamentally easier to understand if we had done them with graphs.Swyx [00:16:07]: You say imagine, but like right now, this pedantic AI actually resume, you know, six days later, like you said, or is this just like a theoretical thing we can go someday?Samuel [00:16:16]: I think it's basically Q&A. So there's an AI that's asking the user a question and effectively you then call the CLI again to continue the conversation. And it basically instantiates the node and calls the graph with that node again. Now, we don't have the logic yet for effectively storing state in the database between individual nodes that we're going to add soon. But like the rest of it is basically there.Swyx [00:16:37]: It does make me think that not only are you competing with Langchain now and obviously Instructor, and now you're going into sort of the more like orchestrated things like Airflow, Prefect, Daxter, those guys.Samuel [00:16:52]: Yeah, I mean, we're good friends with the Prefect guys and Temporal have the same investors as us. And I'm sure that my investor Bogomol would not be too happy if I was like, oh, yeah, by the way, as well as trying to take on Datadog. We're also going off and trying to take on Temporal and everyone else doing that. Obviously, we're not doing all of the infrastructure of deploying that right yet, at least. We're, you know, we're just building a Python library. And like what's crazy about our graph implementation is, sure, there's a bit of magic in like introspecting the return type, you know, extracting things from unions, stuff like that. But like the actual calls, as I say, is literally call a function and get back a thing and call that. It's like incredibly simple and therefore easy to maintain. The question is, how useful is it? Well, I don't know yet. I think we have to go and find out. We have a whole. We've had a slew of people joining our Slack over the last few days and saying, tell me how good Pydantic AI is. How good is Pydantic AI versus Langchain? And I refuse to answer. That's your job to go and find that out. Not mine. We built a thing. I'm compelled by it, but I'm obviously biased. The ecosystem will work out what the useful tools are.Swyx [00:17:52]: Bogomol was my board member when I was at Temporal. And I think I think just generally also having been a workflow engine investor and participant in this space, it's a big space. Like everyone needs different functions. I think the one thing that I would say like yours, you know, as a library, you don't have that much control of it over the infrastructure. I do like the idea that each new agents or whatever or unit of work, whatever you call that should spin up in this sort of isolated boundaries. Whereas yours, I think around everything runs in the same process. But you ideally want to sort of spin out its own little container of things.Samuel [00:18:30]: I agree with you a hundred percent. And we will. It would work now. Right. As in theory, you're just like as long as you can serialize the calls to the next node, you just have to all of the different containers basically have to have the same the same code. I mean, I'm super excited about Cloudflare workers running Python and being able to install dependencies. And if Cloudflare could only give me my invitation to the private beta of that, we would be exploring that right now because I'm super excited about that as a like compute level for some of this stuff where exactly what you're saying, basically. You can run everything as an individual. Like worker function and distribute it. And it's resilient to failure, et cetera, et cetera.Swyx [00:19:08]: And it spins up like a thousand instances simultaneously. You know, you want it to be sort of truly serverless at once. Actually, I know we have some Cloudflare friends who are listening, so hopefully they'll get in front of the line. Especially.Samuel [00:19:19]: I was in Cloudflare's office last week shouting at them about other things that frustrate me. I have a love-hate relationship with Cloudflare. Their tech is awesome. But because I use it the whole time, I then get frustrated. So, yeah, I'm sure I will. I will. I will get there soon.Swyx [00:19:32]: There's a side tangent on Cloudflare. Is Python supported at full? I actually wasn't fully aware of what the status of that thing is.Samuel [00:19:39]: Yeah. So Pyodide, which is Python running inside the browser in scripting, is supported now by Cloudflare. They basically, they're having some struggles working out how to manage, ironically, dependencies that have binaries, in particular, Pydantic. Because these workers where you can have thousands of them on a given metal machine, you don't want to have a difference. You basically want to be able to have a share. Shared memory for all the different Pydantic installations, effectively. That's the thing they work out. They're working out. But Hood, who's my friend, who is the primary maintainer of Pyodide, works for Cloudflare. And that's basically what he's doing, is working out how to get Python running on Cloudflare's network.Swyx [00:20:19]: I mean, the nice thing is that your binary is really written in Rust, right? Yeah. Which also compiles the WebAssembly. Yeah. So maybe there's a way that you'd build... You have just a different build of Pydantic and that ships with whatever your distro for Cloudflare workers is.Samuel [00:20:36]: Yes, that's exactly what... So Pyodide has builds for Pydantic Core and for things like NumPy and basically all of the popular binary libraries. Yeah. It's just basic. And you're doing exactly that, right? You're using Rust to compile the WebAssembly and then you're calling that shared library from Python. And it's unbelievably complicated, but it works. Okay.Swyx [00:20:57]: Staying on graphs a little bit more, and then I wanted to go to some of the other features that you have in Pydantic AI. I see in your docs, there are sort of four levels of agents. There's single agents, there's agent delegation, programmatic agent handoff. That seems to be what OpenAI swarms would be like. And then the last one, graph-based control flow. Would you say that those are sort of the mental hierarchy of how these things go?Samuel [00:21:21]: Yeah, roughly. Okay.Swyx [00:21:22]: You had some expression around OpenAI swarms. Well.Samuel [00:21:25]: And indeed, OpenAI have got in touch with me and basically, maybe I'm not supposed to say this, but basically said that Pydantic AI looks like what swarms would become if it was production ready. So, yeah. I mean, like, yeah, which makes sense. Awesome. Yeah. I mean, in fact, it was specifically saying, how can we give people the same feeling that they were getting from swarms that led us to go and implement graphs? Because my, like, just call the next agent with Python code was not a satisfactory answer to people. So it was like, okay, we've got to go and have a better answer for that. It's not like, let us to get to graphs. Yeah.Swyx [00:21:56]: I mean, it's a minimal viable graph in some sense. What are the shapes of graphs that people should know? So the way that I would phrase this is I think Anthropic did a very good public service and also kind of surprisingly influential blog post, I would say, when they wrote Building Effective Agents. We actually have the authors coming to speak at my conference in New York, which I think you're giving a workshop at. Yeah.Samuel [00:22:24]: I'm trying to work it out. But yes, I think so.Swyx [00:22:26]: Tell me if you're not. yeah, I mean, like, that was the first, I think, authoritative view of, like, what kinds of graphs exist in agents and let's give each of them a name so that everyone is on the same page. So I'm just kind of curious if you have community names or top five patterns of graphs.Samuel [00:22:44]: I don't have top five patterns of graphs. I would love to see what people are building with them. But like, it's been it's only been a couple of weeks. And of course, there's a point is that. Because they're relatively unopinionated about what you can go and do with them. They don't suit them. Like, you can go and do lots of lots of things with them, but they don't have the structure to go and have like specific names as much as perhaps like some other systems do. I think what our agents are, which have a name and I can't remember what it is, but this basically system of like, decide what tool to call, go back to the center, decide what tool to call, go back to the center and then exit. One form of graph, which, as I say, like our agents are effectively one implementation of a graph, which is why under the hood they are now using graphs. And it'll be interesting to see over the next few years whether we end up with these like predefined graph names or graph structures or whether it's just like, yep, I built a graph or whether graphs just turn out not to match people's mental image of what they want and die away. We'll see.Swyx [00:23:38]: I think there is always appeal. Every developer eventually gets graph religion and goes, oh, yeah, everything's a graph. And then they probably over rotate and go go too far into graphs. And then they have to learn a whole bunch of DSLs. And then they're like, actually, I didn't need that. I need this. And they scale back a little bit.Samuel [00:23:55]: I'm at the beginning of that process. I'm currently a graph maximalist, although I haven't actually put any into production yet. But yeah.Swyx [00:24:02]: This has a lot of philosophical connections with other work coming out of UC Berkeley on compounding AI systems. I don't know if you know of or care. This is the Gartner world of things where they need some kind of industry terminology to sell it to enterprises. I don't know if you know about any of that.Samuel [00:24:24]: I haven't. I probably should. I should probably do it because I should probably get better at selling to enterprises. But no, no, I don't. Not right now.Swyx [00:24:29]: This is really the argument is that instead of putting everything in one model, you have more control and more maybe observability to if you break everything out into composing little models and changing them together. And obviously, then you need an orchestration framework to do that. Yeah.Samuel [00:24:47]: And it makes complete sense. And one of the things we've seen with agents is they work well when they work well. But when they. Even if you have the observability through log five that you can see what was going on, if you don't have a nice hook point to say, hang on, this is all gone wrong. You have a relatively blunt instrument of basically erroring when you exceed some kind of limit. But like what you need to be able to do is effectively iterate through these runs so that you can have your own control flow where you're like, OK, we've gone too far. And that's where one of the neat things about our graph implementation is you can basically call next in a loop rather than just running the full graph. And therefore, you have this opportunity to to break out of it. But yeah, basically, it's the same point, which is like if you have two bigger unit of work to some extent, whether or not it involves gen AI. But obviously, it's particularly problematic in gen AI. You only find out afterwards when you've spent quite a lot of time and or money when it's gone off and done done the wrong thing.Swyx [00:25:39]: Oh, drop on this. We're not going to resolve this here, but I'll drop this and then we can move on to the next thing. This is the common way that we we developers talk about this. And then the machine learning researchers look at us. And laugh and say, that's cute. And then they just train a bigger model and they wipe us out in the next training run. So I think there's a certain amount of we are fighting the bitter lesson here. We're fighting AGI. And, you know, when AGI arrives, this will all go away. Obviously, on Latent Space, we don't really discuss that because I think AGI is kind of this hand wavy concept that isn't super relevant. But I think we have to respect that. For example, you could do a chain of thoughts with graphs and you could manually orchestrate a nice little graph that does like. Reflect, think about if you need more, more inference time, compute, you know, that's the hot term now. And then think again and, you know, scale that up. Or you could train Strawberry and DeepSeq R1. Right.Samuel [00:26:32]: I saw someone saying recently, oh, they were really optimistic about agents because models are getting faster exponentially. And I like took a certain amount of self-control not to describe that it wasn't exponential. But my main point was. If models are getting faster as quickly as you say they are, then we don't need agents and we don't really need any of these abstraction layers. We can just give our model and, you know, access to the Internet, cross our fingers and hope for the best. Agents, agent frameworks, graphs, all of this stuff is basically making up for the fact that right now the models are not that clever. In the same way that if you're running a customer service business and you have loads of people sitting answering telephones, the less well trained they are, the less that you trust them, the more that you need to give them a script to go through. Whereas, you know, so if you're running a bank and you have lots of customer service people who you don't trust that much, then you tell them exactly what to say. If you're doing high net worth banking, you just employ people who you think are going to be charming to other rich people and set them off to go and have coffee with people. Right. And the same is true of models. The more intelligent they are, the less we need to tell them, like structure what they go and do and constrain the routes in which they take.Swyx [00:27:42]: Yeah. Yeah. Agree with that. So I'm happy to move on. So the other parts of Pydantic AI that are worth commenting on, and this is like my last rant, I promise. So obviously, every framework needs to do its sort of model adapter layer, which is, oh, you can easily swap from OpenAI to Cloud to Grok. You also have, which I didn't know about, Google GLA, which I didn't really know about until I saw this in your docs, which is generative language API. I assume that's AI Studio? Yes.Samuel [00:28:13]: Google don't have good names for it. So Vertex is very clear. That seems to be the API that like some of the things use, although it returns 503 about 20% of the time. So... Vertex? No. Vertex, fine. But the... Oh, oh. GLA. Yeah. Yeah.Swyx [00:28:28]: I agree with that.Samuel [00:28:29]: So we have, again, another example of like, well, I think we go the extra mile in terms of engineering is we run on every commit, at least commit to main, we run tests against the live models. Not lots of tests, but like a handful of them. Oh, okay. And we had a point last week where, yeah, GLA is a little bit better. GLA1 was failing every single run. One of their tests would fail. And we, I think we might even have commented out that one at the moment. So like all of the models fail more often than you might expect, but like that one seems to be particularly likely to fail. But Vertex is the same API, but much more reliable.Swyx [00:29:01]: My rant here is that, you know, versions of this appear in Langchain and every single framework has to have its own little thing, a version of that. I would put to you, and then, you know, this is, this can be agree to disagree. This is not needed in Pydantic AI. I would much rather you adopt a layer like Lite LLM or what's the other one in JavaScript port key. And that's their job. They focus on that one thing and they, they normalize APIs for you. All new models are automatically added and you don't have to duplicate this inside of your framework. So for example, if I wanted to use deep seek, I'm out of luck because Pydantic AI doesn't have deep seek yet.Samuel [00:29:38]: Yeah, it does.Swyx [00:29:39]: Oh, it does. Okay. I'm sorry. But you know what I mean? Should this live in your code or should it live in a layer that's kind of your API gateway that's a defined piece of infrastructure that people have?Samuel [00:29:49]: And I think if a company who are well known, who are respected by everyone had come along and done this at the right time, maybe we should have done it a year and a half ago and said, we're going to be the universal AI layer. That would have been a credible thing to do. I've heard varying reports of Lite LLM is the truth. And it didn't seem to have exactly the type safety that we needed. Also, as I understand it, and again, I haven't looked into it in great detail. Part of their business model is proxying the request through their, through their own system to do the generalization. That would be an enormous put off to an awful lot of people. Honestly, the truth is I don't think it is that much work unifying the model. I get where you're coming from. I kind of see your point. I think the truth is that everyone is centralizing around open AIs. Open AI's API is the one to do. So DeepSeq support that. Grok with OK support that. Ollama also does it. I mean, if there is that library right now, it's more or less the open AI SDK. And it's very high quality. It's well type checked. It uses Pydantic. So I'm biased. But I mean, I think it's pretty well respected anyway.Swyx [00:30:57]: There's different ways to do this. Because also, it's not just about normalizing the APIs. You have to do secret management and all that stuff.Samuel [00:31:05]: Yeah. And there's also. There's Vertex and Bedrock, which to one extent or another, effectively, they host multiple models, but they don't unify the API. But they do unify the auth, as I understand it. Although we're halfway through doing Bedrock. So I don't know about it that well. But they're kind of weird hybrids because they support multiple models. But like I say, the auth is centralized.Swyx [00:31:28]: Yeah, I'm surprised they don't unify the API. That seems like something that I would do. You know, we can discuss all this all day. There's a lot of APIs. I agree.Samuel [00:31:36]: It would be nice if there was a universal one that we didn't have to go and build.Alessio [00:31:39]: And I guess the other side of, you know, routing model and picking models like evals. How do you actually figure out which one you should be using? I know you have one. First of all, you have very good support for mocking in unit tests, which is something that a lot of other frameworks don't do. So, you know, my favorite Ruby library is VCR because it just, you know, it just lets me store the HTTP requests and replay them. That part I'll kind of skip. I think you are busy like this test model. We're like just through Python. You try and figure out what the model might respond without actually calling the model. And then you have the function model where people can kind of customize outputs. Any other fun stories maybe from there? Or is it just what you see is what you get, so to speak?Samuel [00:32:18]: On those two, I think what you see is what you get. On the evals, I think watch this space. I think it's something that like, again, I was somewhat cynical about for some time. Still have my cynicism about some of the well, it's unfortunate that so many different things are called evals. It would be nice if we could agree. What they are and what they're not. But look, I think it's a really important space. I think it's something that we're going to be working on soon, both in Pydantic AI and in LogFire to try and support better because it's like it's an unsolved problem.Alessio [00:32:45]: Yeah, you do say in your doc that anyone who claims to know for sure exactly how your eval should be defined can safely be ignored.Samuel [00:32:52]: We'll delete that sentence when we tell people how to do their evals.Alessio [00:32:56]: Exactly. I was like, we need we need a snapshot of this today. And so let's talk about eval. So there's kind of like the vibe. Yeah. So you have evals, which is what you do when you're building. Right. Because you cannot really like test it that many times to get statistical significance. And then there's the production eval. So you also have LogFire, which is kind of like your observability product, which I tried before. It's very nice. What are some of the learnings you've had from building an observability tool for LEMPs? And yeah, as people think about evals, even like what are the right things to measure? What are like the right number of samples that you need to actually start making decisions?Samuel [00:33:33]: I'm not the best person to answer that is the truth. So I'm not going to come in here and tell you that I think I know the answer on the exact number. I mean, we can do some back of the envelope statistics calculations to work out that like having 30 probably gets you most of the statistical value of having 200 for, you know, by definition, 15% of the work. But the exact like how many examples do you need? For example, that's a much harder question to answer because it's, you know, it's deep within the how models operate in terms of LogFire. One of the reasons we built LogFire the way we have and we allow you to write SQL directly against your data and we're trying to build the like powerful fundamentals of observability is precisely because we know we don't know the answers. And so allowing people to go and innovate on how they're going to consume that stuff and how they're going to process it is we think that's valuable. Because even if we come along and offer you an evals framework on top of LogFire, it won't be right in all regards. And we want people to be able to go and innovate and being able to write their own SQL connected to the API. And effectively query the data like it's a database with SQL allows people to innovate on that stuff. And that's what allows us to do it as well. I mean, we do a bunch of like testing what's possible by basically writing SQL directly against LogFire as any user could. I think the other the other really interesting bit that's going on in observability is OpenTelemetry is centralizing around semantic attributes for GenAI. So it's a relatively new project. A lot of it's still being added at the moment. But basically the idea that like. They unify how both SDKs and or agent frameworks send observability data to to any OpenTelemetry endpoint. And so, again, we can go and having that unification allows us to go and like basically compare different libraries, compare different models much better. That stuff's in a very like early stage of development. One of the things we're going to be working on pretty soon is basically, I suspect, GenAI will be the first agent framework that implements those semantic attributes properly. Because, again, we control and we can say this is important for observability, whereas most of the other agent frameworks are not maintained by people who are trying to do observability. With the exception of Langchain, where they have the observability platform, but they chose not to go down the OpenTelemetry route. So they're like plowing their own furrow. And, you know, they're a lot they're even further away from standardization.Alessio [00:35:51]: Can you maybe just give a quick overview of how OTEL ties into the AI workflows? There's kind of like the question of is, you know, a trace. And a span like a LLM call. Is it the agent? It's kind of like the broader thing you're tracking. How should people think about it?Samuel [00:36:06]: Yeah, so they have a PR that I think may have now been merged from someone at IBM talking about remote agents and trying to support this concept of remote agents within GenAI. I'm not particularly compelled by that because I don't think that like that's actually by any means the common use case. But like, I suppose it's fine for it to be there. The majority of the stuff in OTEL is basically defining how you would instrument. A given call to an LLM. So basically the actual LLM call, what data you would send to your telemetry provider, how you would structure that. Apart from this slightly odd stuff on remote agents, most of the like agent level consideration is not yet implemented in is not yet decided effectively. And so there's a bit of ambiguity. Obviously, what's good about OTEL is you can in the end send whatever attributes you like. But yeah, there's quite a lot of churn in that space and exactly how we store the data. I think that one of the most interesting things, though, is that if you think about observability. Traditionally, it was sure everyone would say our observability data is very important. We must keep it safe. But actually, companies work very hard to basically not have anything that sensitive in their observability data. So if you're a doctor in a hospital and you search for a drug for an STI, the sequel might be sent to the observability provider. But none of the parameters would. It wouldn't have the patient number or their name or the drug. With GenAI, that distinction doesn't exist because it's all just messed up in the text. If you have that same patient asking an LLM how to. What drug they should take or how to stop smoking. You can't extract the PII and not send it to the observability platform. So the sensitivity of the data that's going to end up in observability platforms is going to be like basically different order of magnitude to what's in what you would normally send to Datadog. Of course, you can make a mistake and send someone's password or their card number to Datadog. But that would be seen as a as a like mistake. Whereas in GenAI, a lot of data is going to be sent. And I think that's why companies like Langsmith and are trying hard to offer observability. On prem, because there's a bunch of companies who are happy for Datadog to be cloud hosted, but want self-hosted self-hosting for this observability stuff with GenAI.Alessio [00:38:09]: And are you doing any of that today? Because I know in each of the spans you have like the number of tokens, you have the context, you're just storing everything. And then you're going to offer kind of like a self-hosting for the platform, basically. Yeah. Yeah.Samuel [00:38:23]: So we have scrubbing roughly equivalent to what the other observability platforms have. So if we, you know, if we see password as the key, we won't send the value. But like, like I said, that doesn't really work in GenAI. So we're accepting we're going to have to store a lot of data and then we'll offer self-hosting for those people who can afford it and who need it.Alessio [00:38:42]: And then this is, I think, the first time that most of the workloads performance is depending on a third party. You know, like if you're looking at Datadog data, usually it's your app that is driving the latency and like the memory usage and all of that. Here you're going to have spans that maybe take a long time to perform because the GLA API is not working or because OpenAI is kind of like overwhelmed. Do you do anything there since like the provider is almost like the same across customers? You know, like, are you trying to surface these things for people and say, hey, this was like a very slow span, but actually all customers using OpenAI right now are seeing the same thing. So maybe don't worry about it or.Samuel [00:39:20]: Not yet. We do a few things that people don't generally do in OTA. So we send. We send information at the beginning. At the beginning of a trace as well as sorry, at the beginning of a span, as well as when it finishes. By default, OTA only sends you data when the span finishes. So if you think about a request which might take like 20 seconds, even if some of the intermediate spans finished earlier, you can't basically place them on the page until you get the top level span. And so if you're using standard OTA, you can't show anything until those requests are finished. When those requests are taking a few hundred milliseconds, it doesn't really matter. But when you're doing Gen AI calls or when you're like running a batch job that might take 30 minutes. That like latency of not being able to see the span is like crippling to understanding your application. And so we've we do a bunch of slightly complex stuff to basically send data about a span as it starts, which is closely related. Yeah.Alessio [00:40:09]: Any thoughts on all the other people trying to build on top of OpenTelemetry in different languages, too? There's like the OpenLEmetry project, which doesn't really roll off the tongue. But how do you see the future of these kind of tools? Is everybody going to have to build? Why does everybody want to build? They want to build their own open source observability thing to then sell?Samuel [00:40:29]: I mean, we are not going off and trying to instrument the likes of the OpenAI SDK with the new semantic attributes, because at some point that's going to happen and it's going to live inside OTEL and we might help with it. But we're a tiny team. We don't have time to go and do all of that work. So OpenLEmetry, like interesting project. But I suspect eventually most of those semantic like that instrumentation of the big of the SDKs will live, like I say, inside the main OpenTelemetry report. I suppose. What happens to the agent frameworks? What data you basically need at the framework level to get the context is kind of unclear. I don't think we know the answer yet. But I mean, I was on the, I guess this is kind of semi-public, because I was on the call with the OpenTelemetry call last week talking about GenAI. And there was someone from Arize talking about the challenges they have trying to get OpenTelemetry data out of Langchain, where it's not like natively implemented. And obviously they're having quite a tough time. And I was realizing, hadn't really realized this before, but how lucky we are to primarily be talking about our own agent framework, where we have the control rather than trying to go and instrument other people's.Swyx [00:41:36]: Sorry, I actually didn't know about this semantic conventions thing. It looks like, yeah, it's merged into main OTel. What should people know about this? I had never heard of it before.Samuel [00:41:45]: Yeah, I think it looks like a great start. I think there's some unknowns around how you send the messages that go back and forth, which is kind of the most important part. It's the most important thing of all. And that is moved out of attributes and into OTel events. OTel events in turn are moving from being on a span to being their own top-level API where you send data. So there's a bunch of churn still going on. I'm impressed by how fast the OTel community is moving on this project. I guess they, like everyone else, get that this is important, and it's something that people are crying out to get instrumentation off. So I'm kind of pleasantly surprised at how fast they're moving, but it makes sense.Swyx [00:42:25]: I'm just kind of browsing through the specification. I can already see that this basically bakes in whatever the previous paradigm was. So now they have genai.usage.prompt tokens and genai.usage.completion tokens. And obviously now we have reasoning tokens as well. And then only one form of sampling, which is top-p. You're basically baking in or sort of reifying things that you think are important today, but it's not a super foolproof way of doing this for the future. Yeah.Samuel [00:42:54]: I mean, that's what's neat about OTel is you can always go and send another attribute and that's fine. It's just there are a bunch that are agreed on. But I would say, you know, to come back to your previous point about whether or not we should be relying on one centralized abstraction layer, this stuff is moving so fast that if you start relying on someone else's standard, you risk basically falling behind because you're relying on someone else to keep things up to date.Swyx [00:43:14]: Or you fall behind because you've got other things going on.Samuel [00:43:17]: Yeah, yeah. That's fair. That's fair.Swyx [00:43:19]: Any other observations just about building LogFire, actually? Let's just talk about this. So you announced LogFire. I was kind of only familiar with LogFire because of your Series A announcement. I actually thought you were making a separate company. I remember some amount of confusion with you when that came out. So to be clear, it's Pydantic LogFire and the company is one company that has kind of two products, an open source thing and an observability thing, correct? Yeah. I was just kind of curious, like any learnings building LogFire? So classic question is, do you use ClickHouse? Is this like the standard persistence layer? Any learnings doing that?Samuel [00:43:54]: We don't use ClickHouse. We started building our database with ClickHouse, moved off ClickHouse onto Timescale, which is a Postgres extension to do analytical databases. Wow. And then moved off Timescale onto DataFusion. And we're basically now building, it's DataFusion, but it's kind of our own database. Bogomil is not entirely happy that we went through three databases before we chose one. I'll say that. But like, we've got to the right one in the end. I think we could have realized that Timescale wasn't right. I think ClickHouse. They both taught us a lot and we're in a great place now. But like, yeah, it's been a real journey on the database in particular.Swyx [00:44:28]: Okay. So, you know, as a database nerd, I have to like double click on this, right? So ClickHouse is supposed to be the ideal backend for anything like this. And then moving from ClickHouse to Timescale is another counterintuitive move that I didn't expect because, you know, Timescale is like an extension on top of Postgres. Not super meant for like high volume logging. But like, yeah, tell us those decisions.Samuel [00:44:50]: So at the time, ClickHouse did not have good support for JSON. I was speaking to someone yesterday and said ClickHouse doesn't have good support for JSON and got roundly stepped on because apparently it does now. So they've obviously gone and built their proper JSON support. But like back when we were trying to use it, I guess a year ago or a bit more than a year ago, everything happened to be a map and maps are a pain to try and do like looking up JSON type data. And obviously all these attributes, everything you're talking about there in terms of the GenAI stuff. You can choose to make them top level columns if you want. But the simplest thing is just to put them all into a big JSON pile. And that was a problem with ClickHouse. Also, ClickHouse had some really ugly edge cases like by default, or at least until I complained about it a lot, ClickHouse thought that two nanoseconds was longer than one second because they compared intervals just by the number, not the unit. And I complained about that a lot. And then they caused it to raise an error and just say you have to have the same unit. Then I complained a bit more. And I think as I understand it now, they have some. They convert between units. But like stuff like that, when all you're looking at is when a lot of what you're doing is comparing the duration of spans was really painful. Also things like you can't subtract two date times to get an interval. You have to use the date sub function. But like the fundamental thing is because we want our end users to write SQL, the like quality of the SQL, how easy it is to write, matters way more to us than if you're building like a platform on top where your developers are going to write the SQL. And once it's written and it's working, you don't mind too much. So I think that's like one of the fundamental differences. The other problem that I have with the ClickHouse and Impact Timescale is that like the ultimate architecture, the like snowflake architecture of binary data in object store queried with some kind of cache from nearby. They both have it, but it's closed sourced and you only get it if you go and use their hosted versions. And so even if we had got through all the problems with Timescale or ClickHouse, we would end up like, you know, they would want to be taking their 80% margin. And then we would be wanting to take that would basically leave us less space for margin. Whereas data fusion. Properly open source, all of that same tooling is open source. And for us as a team of people with a lot of Rust expertise, data fusion, which is implemented in Rust, we can literally dive into it and go and change it. So, for example, I found that there were some slowdowns in data fusion's string comparison kernel for doing like string contains. And it's just Rust code. And I could go and rewrite the string comparison kernel to be faster. Or, for example, data fusion, when we started using it, didn't have JSON support. Obviously, as I've said, it's something we can do. It's something we needed. I was able to go and implement that in a weekend using our JSON parser that we built for Pydantic Core. So it's the fact that like data fusion is like for us the perfect mixture of a toolbox to build a database with, not a database. And we can go and implement stuff on top of it in a way that like if you were trying to do that in Postgres or in ClickHouse. I mean, ClickHouse would be easier because it's C++, relatively modern C++. But like as a team of people who are not C++ experts, that's much scarier than data fusion for us.Swyx [00:47:47]: Yeah, that's a beautiful rant.Alessio [00:47:49]: That's funny. Most people don't think they have agency on these projects. They're kind of like, oh, I should use this or I should use that. They're not really like, what should I pick so that I contribute the most back to it? You know, so but I think you obviously have an open source first mindset. So that makes a lot of sense.Samuel [00:48:05]: I think if we were probably better as a startup, a better startup and faster moving and just like headlong determined to get in front of customers as fast as possible, we should have just started with ClickHouse. I hope that long term we're in a better place for having worked with data fusion. We like we're quite engaged now with the data fusion community. Andrew Lam, who maintains data fusion, is an advisor to us. We're in a really good place now. But yeah, it's definitely slowed us down relative to just like building on ClickHouse and moving as fast as we can.Swyx [00:48:34]: OK, we're about to zoom out and do Pydantic run and all the other stuff. But, you know, my last question on LogFire is really, you know, at some point you run out sort of community goodwill just because like, oh, I use Pydantic. I love Pydantic. I'm going to use LogFire. OK, then you start entering the territory of the Datadogs, the Sentrys and the honeycombs. Yeah. So where are you going to really spike here? What differentiator here?Samuel [00:48:59]: I wasn't writing code in 2001, but I'm assuming that there were people talking about like web observability and then web observability stopped being a thing, not because the web stopped being a thing, but because all observability had to do web. If you were talking to people in 2010 or 2012, they would have talked about cloud observability. Now that's not a term because all observability is cloud first. The same is going to happen to gen AI. And so whether or not you're trying to compete with Datadog or with Arise and Langsmith, you've got to do first class. You've got to do general purpose observability with first class support for AI. And as far as I know, we're the only people really trying to do that. I mean, I think Datadog is starting in that direction. And to be honest, I think Datadog is a much like scarier company to compete with than the AI specific observability platforms. Because in my opinion, and I've also heard this from lots of customers, AI specific observability where you don't see everything else going on in your app is not actually that useful. Our hope is that we can build the first general purpose observability platform with first class support for AI. And that we have this open source heritage of putting developer experience first that other companies haven't done. For all I'm a fan of Datadog and what they've done. If you search Datadog logging Python. And you just try as a like a non-observability expert to get something up and running with Datadog and Python. It's not trivial, right? That's something Sentry have done amazingly well. But like there's enormous space in most of observability to do DX better.Alessio [00:50:27]: Since you mentioned Sentry, I'm curious how you thought about licensing and all of that. Obviously, your MIT license, you don't have any rolling license like Sentry has where you can only use an open source, like the one year old version of it. Was that a hard decision?Samuel [00:50:41]: So to be clear, LogFire is co-sourced. So Pydantic and Pydantic AI are MIT licensed and like properly open source. And then LogFire for now is completely closed source. And in fact, the struggles that Sentry have had with licensing and the like weird pushback the community gives when they take something that's closed source and make it source available just meant that we just avoided that whole subject matter. I think the other way to look at it is like in terms of either headcount or revenue or dollars in the bank. The amount of open source we do as a company is we've got to be open source. We're up there with the most prolific open source companies, like I say, per head. And so we didn't feel like we were morally obligated to make LogFire open source. We have Pydantic. Pydantic is a foundational library in Python. That and now Pydantic AI are our contribution to open source. And then LogFire is like openly for profit, right? As in we're not claiming otherwise. We're not sort of trying to walk a line if it's open source. But really, we want to make it hard to deploy. So you probably want to pay us. We're trying to be straight. That it's to pay for. We could change that at some point in the future, but it's not an immediate plan.Alessio [00:51:48]: All right. So the first one I saw this new I don't know if it's like a product you're building the Pydantic that run, which is a Python browser sandbox. What was the inspiration behind that? We talk a lot about code interpreter for lamps. I'm an investor in a company called E2B, which is a code sandbox as a service for remote execution. Yeah. What's the Pydantic that run story?Samuel [00:52:09]: So Pydantic that run is again completely open source. I have no interest in making it into a product. We just needed a sandbox to be able to demo LogFire in particular, but also Pydantic AI. So it doesn't have it yet, but I'm going to add basically a proxy to OpenAI and the other models so that you can run Pydantic AI in the browser. See how it works. Tweak the prompt, et cetera, et cetera. And we'll have some kind of limit per day of what you can spend on it or like what the spend is. The other thing we wanted to b

Thinking Elixir Podcast
239: Scaling to Unicorn Status

Thinking Elixir Podcast

Play Episode Listen Later Feb 4, 2025 29:09


News includes an impressive case study from Remote showing how they scaled Elixir to support nearly 300 engineers and reach unicorn status, Tailwind CSS 4.0's major release with Phoenix integration in progress, Chris McCord teasing an exciting AI code generator project on Fly.io, the release of Elixir LangChain v0.3.0 with expanded support for multiple AI providers, ElixirConfEU 2025 tickets going on sale in Kraków, and more! Show Notes online - http://podcast.thinkingelixir.com/239 (http://podcast.thinkingelixir.com/239) Elixir Community News https://elixir-lang.org/blog/2025/01/21/remote-elixir-case/ (https://elixir-lang.org/blog/2025/01/21/remote-elixir-case/?utm_source=thinkingelixir&utm_medium=shownotes) – New case study about Remote, a unicorn company using Elixir as their primary technology with nearly 300 engineers. https://github.com/sasa1977/boundary (https://github.com/sasa1977/boundary?utm_source=thinkingelixir&utm_medium=shownotes) – Remote uses Saša Jurić's Boundary library to help enforce boundaries in their monolithic codebase. https://www.reddit.com/r/elixir/comments/1i77ia9/comment/m8il2ho/ (https://www.reddit.com/r/elixir/comments/1i77ia9/comment/m8il2ho/?utm_source=thinkingelixir&utm_medium=shownotes) – Discussion about the type spec future in Elixir, with plans to replace Dialyzer typespecs in versions 1.19 and 1.20. https://bsky.app/profile/zachdaniel.dev/post/3lgqdugp7zs2b (https://bsky.app/profile/zachdaniel.dev/post/3lgqdugp7zs2b?utm_source=thinkingelixir&utm_medium=shownotes) – Ash installer now supports Oban integration via a flag option. https://tailwindcss.com/blog/tailwindcss-v4 (https://tailwindcss.com/blog/tailwindcss-v4?utm_source=thinkingelixir&utm_medium=shownotes) – Tailwind CSS 4.0 released with major changes including moving theme configuration to CSS variables. https://tailwindcss.com/docs/upgrade-guide (https://tailwindcss.com/docs/upgrade-guide?utm_source=thinkingelixir&utm_medium=shownotes) – Comprehensive upgrade guide for Tailwind CSS v4. https://github.com/phoenixframework/phoenix/pull/5990 (https://github.com/phoenixframework/phoenix/pull/5990?utm_source=thinkingelixir&utm_medium=shownotes) – WIP PR to support Tailwind v4 in Phoenix. https://bsky.app/profile/zachdaniel.dev/post/3lggmuk4dis2x (https://bsky.app/profile/zachdaniel.dev/post/3lggmuk4dis2x?utm_source=thinkingelixir&utm_medium=shownotes) – Zach Daniel shares how Tailwind v4 changes will improve igniter's utility configuration capabilities. https://github.com/brainlid/langchain (https://github.com/brainlid/langchain?utm_source=thinkingelixir&utm_medium=shownotes) – Elixir LangChain v0.3.0 released with expanded support for OpenAI, Anthropic, Gemini, Llama, and more. https://x.com/chris_mccord/status/1880377175200669770 (https://x.com/chris_mccord/status/1880377175200669770?utm_source=thinkingelixir&utm_medium=shownotes) – Chris McCord teases new Fly.io AI code generator project with IDE/terminal integration. https://x.com/chris_mccord/status/1880392153924530376 (https://x.com/chris_mccord/status/1880392153924530376?utm_source=thinkingelixir&utm_medium=shownotes) – Demo video of Chris McCord's AI-integrated editor creating a multiplayer Phoenix LiveView app. https://www.elixirconf.eu/ (https://www.elixirconf.eu/?utm_source=thinkingelixir&utm_medium=shownotes) – ElixirConfEU 2025 tickets on sale, happening May 15-16 in Kraków Poland & Virtual. Do you have some Elixir news to share? Tell us at @ThinkingElixir (https://twitter.com/ThinkingElixir) or email at show@thinkingelixir.com (mailto:show@thinkingelixir.com) Find us online - Message the show - Bluesky (https://bsky.app/profile/thinkingelixir.com) - Message the show - X (https://x.com/ThinkingElixir) - Message the show on Fediverse - @ThinkingElixir@genserver.social (https://genserver.social/ThinkingElixir) - Email the show - show@thinkingelixir.com (mailto:show@thinkingelixir.com) - Mark Ericksen on X - @brainlid (https://x.com/brainlid) - Mark Ericksen on Bluesky - @brainlid.bsky.social (https://bsky.app/profile/brainlid.bsky.social) - Mark Ericksen on Fediverse - @brainlid@genserver.social (https://genserver.social/brainlid) - David Bernheisel on Bluesky - @david.bernheisel.com (https://bsky.app/profile/david.bernheisel.com) - David Bernheisel on Fediverse - @dbern@genserver.social (https://genserver.social/dbern)

En.Digital Podcast
La Tertul-IA #43: La IA amenaza las matemáticas, agentes ambientales y AI washing

En.Digital Podcast

Play Episode Listen Later Jan 31, 2025 72:13


Esta semana revisamos las novedades de OpenAI con Tasks y Operator, las palabras de premio Nobel de química que dice que la IA acabará con las profesiones técnicas, la tendencia de los ambient agents, la correlación entre los resultados de búsqueda de Google y la aparición como respuestas de LLMs y mucho más. Suscríbete a la newsletter de la Tertul-IA y nuestro podcast en https://tertulia.mumbler.io/(00:00) Intro y presentación.(02:25) OpenAI Tasks y Operator(19:08) El premio Nobel de química dice que la IA acabará con las carreras técnicas(35:44) Ambient Agents(46:11) Un estudio correlaciona las menciones en las respuestas de los LLMs con el ranking en (Google Search(53:30) La familia ALIA de modelos de IA en español(1:03:30) AI WashingFuentes: OpenA tasks y operatorhttps://help.openai.com/en/articles/10291617-scheduled-tasks-in-chatgpt https://openai.com/index/introducing-operator/ What's next for agentic AI? LangChain founder looks to ambient agents https://venturebeat.com/ai/whats-next-for-agentic-ai-langchain-founder-looks-to-ambient-agents/Un estudio muestra una correlación en marcas entre su ranking en Google y las menciones en las respuestas de ChatGPT https://www.linkedin.com/posts/oscarlopezcuesta_google-search-rankings-correlate-with-chatgpt-activity-7284433988855705600-7FZb/ALIA: la familia de modelos españoles https://www.linkedin.com/posts/mintradigital_alia-ugcPost-7287164111388315648-TOgC/India needs regulatory guidelines to combat AI washing https://www.lexology.com/library/detail.aspx?g=3cf1b95b-3094-479e-a4c6-2c7352164d88

IFTTD - If This Then Dev
#307.src - Langchain: Faire de l'IA comme des Lego avec Maxime Thoonsen

IFTTD - If This Then Dev

Play Episode Listen Later Jan 29, 2025 67:49


"c'est hyper facile en fait" Le D.E.V. de la semaine est Maxime Thoonsen, fondateur @ AGO. Maxime nous parle des avantages de LangChain, un outil d'intelligence artificielle générative qu'il juge crucial pour l'exécution de tâches complexes. Comparant les agents IA modernes aux chatbots traditionnels, Maxime met en lumière la capacité des IA à interpréter des requêtes textuelles. Il souligne l'utilité de LangChain face à d'autres frameworks, mettant en avant son aide dans l'automatisation du codage. Maxime insiste aussi sur l'émergence des outils no-code qui, selon lui, rendent l'IA accessible aux non-développeurs. Face à l'avenir technologique, il encourage l'apprentissage continu et l'exploitation de ces nouveaux outils.Chapitrages00:00:54 : Introduction à LongChain00:02:59 : Présentation de Maxime00:20:06 : Chatbots et IA Agentiques00:37:50 : Évolution des Agents et Développement00:52:45 : Outils Graphiques pour Non-Développeurs01:02:43 : Création d'Agents Sans Code01:06:04 : Conclusion et Recommandations Liens évoqués pendant l'émission "lean tech manifesto" de TheodoNot the end of the world **Recrutez les meilleurs développeurs grâce à Indeed !** "Trouver des développeurs compétents et passionnés, comme les auditeurs d'If This Then Dev, peut être un vrai défi. Avec Indeed, connectez-vous rapidement avec des candidats qualifiés qui sauront s'épanouir dans votre entreprise. Profitez dès maintenant d'un crédit de 100 euros pour sponsoriser votre offre d'emploi : go.indeed.com/IFTTD."🎙️ Soutenez le podcast If This Then Dev ! 🎙️ Chaque contribution aide à maintenir et améliorer nos épisodes. Cliquez ici pour nous soutenir sur Tipeee 🙏Archives | Site | Boutique | TikTok | Discord | Twitter | LinkedIn | Instagram | Youtube | Twitch | Job Board |

Les Cast Codeurs Podcast
LCC 321 - Les évènements écran large

Les Cast Codeurs Podcast

Play Episode Listen Later Jan 21, 2025 73:53


Arnaud et Emmanuel discutent des versions Java, font un résumé de l'ecosystème WebAssembly, discutent du nouveau Model Context Protocol, parlent d'observabilité avec notamment les Wide Events et de pleins d'autres choses encore. Enregistré le 17 janvier 2025 Téléchargement de l'épisode LesCastCodeurs-Episode–321.mp3 ou en vidéo sur YouTube. News Langages java trend par InfoQ https://www.infoq.com/articles/java-trends-report–2024/ Java 17 finalement depasse 11 et 8 ~30/33% Java 21 est à 1.4% commonhaus apparait GraalVM en early majority Spring AI et langchain4j en innovateurs SB 3 voit son adoption augmenter Un bon résumé sur WebAssembly, les différentes specs comme WASM GC, WASI, WIT, etc https://2ality.com/2025/01/webassembly-language-ecosystem.html WebAssembly (Wasm) est un format d'instructions binaires pour une machine virtuelle basée sur une pile, permettant la portabilité et l'efficacité du code. Wasm a évolué à partir d'asm.js, un sous-ensemble de JavaScript qui pouvait fonctionner à des vitesses proches de celles natives. WASI (WebAssembly System Interface) permet à Wasm de fonctionner en dehors des navigateurs Web, fournissant des API pour le système de fichiers, CLI, HTTP, etc. Le modèle de composant WebAssembly permet l'interopérabilité entre les langages Wasm à l'aide de WIT (Wasm Interface Type) et d'ABI canonique. Les composants Wasm se composent d'un module central et d'interfaces WIT pour les importations/exportations, facilitant l'interaction indépendante du langage. Les interfaces WIT décrivent les types et les fonctions, tandis que les mondes WIT définissent les capacités et les besoins d'un composant (importations/exportations). La gestion des packages Wasm est assurée par Warg, un protocole pour les registres de packages Wasm. Une enquête a montré que Rust est le langage Wasm le plus utilisé, suivi de Kotlin et de C++; de nombreux autres langages sont également en train d'émerger. Un algorithme de comptage a taille limitée ne mémoire a été inventé https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count–20240516/ élimine un mot de manière aléatoire mais avec une probabilité connue quand il y a besoin de récupérer de l'espace cela se fait par round et on augmente la probabilité de suppression à chaque round donc au final, ne nombre de mots / la probabilité d'avoir été éliminé donne une mesure approximative mais plutot précise Librairies Les contributions Spring passent du CLA au DCO https://spring.io/blog/2025/01/06/hello-dco-goodbye-cla-simplifying-contributions-to-spring d'abord manuel amis meme automatisé le CLA est une document legal complexe qui peut limiter les contribuitions le DCO vient le Linux je crois et est super simple accord que la licence de la conmtrib est celle du projet accord que le code est public et distribué en perpetuité s'appuie sur les -s de git pour le sign off Ecrire un serveur MCP en Quarkus https://quarkus.io/blog/mcp-server/ MCP est un protocol proposé paor Antropic pour integrer des outils orchestrables par les LLMs MCP est frais et va plus loin que les outils offre la notion de resource (file), de functions (tools), et de proimpts pre-built pour appeler l'outil de la meilleure façon On en reparlera a pres avec les agent dans un article suivant il y a une extension Quarkus pour simplifier le codage un article plus detaillé sur l'integration Quarkus https://quarkus.io/blog/quarkus-langchain4j-mcp/ GreenMail un mini mail server en java https://greenmail-mail-test.github.io/greenmail/#features-api Utile pour les tests d'integration Supporte SMTP, POP3 et IMAP avec TLS/SSL Propose des integrations JUnit, Spring Une mini UI et des APIs REST permettent d'interagir avec le serveur si par exemple vous le partagé dans un container (il n'y a pas d'integration TestContainer existante mais elle n'est pas compliquée à écrire) Infrastructure Docker Bake in a visual way https://dev.to/aurelievache/understanding-docker-part–47-docker-bake–4p05 docker back propose d'utiliser des fichiers de configuration (format HCL) pour lancer ses builds d'images et docker compose en gros voyez ce DSL comme un Makefile très simplifié pour les commandes docker qui souvent peuvent avoir un peu trop de paramètres Datadog continue de s'etendre avec l'acquisition de Quickwit https://www.datadoghq.com/blog/datadog-acquires-quickwit/ Solution open-source de recherche des logs qui peut être déployée on-premise et dans le cloud https://quickwit.io/ Les logs ne quittent plus votre environment ce qui permet de répondre à des besoins de sécurité, privacy et réglementaire Web 33 concepts en javascript https://github.com/leonardomso/33-js-concepts Call Stack, Primitive Types, Value Types and Reference Types, Implicit, Explicit, Nominal, Structuring and Duck Typing, == vs === vs typeof, Function Scope, Block Scope and Lexical Scope, Expression vs Statement, IIFE, Modules and Namespaces, Message Queue and Event Loop, setTimeout, setInterval and requestAnimationFrame, JavaScript Engines, Bitwise Operators, Type Arrays and Array Buffers, DOM and Layout Trees, Factories and Classes, this, call, apply and bind, new, Constructor, instanceof and Instances, Prototype Inheritance and Prototype Chain, Object.create and Object.assign, map, reduce, filter, Pure Functions, Side Effects, State Mutation and Event Propagation, Closures, High Order Functions, Recursion, Collections and Generators, Promises, async/await, Data Structures, Expensive Operation and Big O Notation, Algorithms, Inheritance, Polymorphism and Code Reuse, Design Patterns, Partial Applications, Currying, Compose and Pipe, Clean Code Data et Intelligence Artificielle Phi 4 et les small language models https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi–4-microsoft%e2%80%99s-newest-small-language-model-specializing-in-comple/4357090 Phi 4 un SML pour les usages locaux notamment 14B de parametres belle progression de ~20 points sur un score aggregé et qui le rapproche de Llama 3.3 et ses 70B de parametres bon en math (data set synthétique) Comment utiliser Gemini 2.0 Flash Thinking (le modèle de Google qui fait du raisonnement à la sauce chain of thought) en Java avec LangChain4j https://glaforge.dev/posts/2024/12/20/lets-think-with-gemini–2-thinking-mode-and-langchain4j/ Google a sorti Gemini 2.0 Flash, un petit modèle de la famille Gemini the “thinking mode” simule les cheminements de pensée (Chain of thoughts etc) décompose beaucoup plus les taches coplexes en plusiewurs taches un exemple est montré sur le modele se battant avec le probleme Les recommendations d'Antropic sur les systèmes d'agents https://www.anthropic.com/research/building-effective-agents défini les agents et les workflow Ne recommence pas les frameworks (LangChain, Amazon Bedrock AI Agent etc) le fameux débat sur l'abstraction Beaucoup de patterns implementable avec quelques lignes sans frameworks Plusieurs blocks de complexité croissante Augmented LLM (RAG, memory etc): Anthropic dit que les LLMs savent coordonner cela via MCP apr exemple Second: workflow prompt chaining : avec des gates et appelle les LLMs savent coordonner successivement ; favorise la precision vs la latence vu que les taches sont décomposées en plusieurs calls LLMs Workflow routing: classifie une entree et choisie la route a meilleure: separation de responsabilité Workflow : parallelisation: LLM travaillent en paralllele sur une tache et un aggregateur fait la synthèse. Paralleisaiton avec saucissonage de la tache ou voter sur le meilleur réponse Workflow : orchestrator workers: quand les taches ne sont pas bounded ou connues (genre le nombre de fichiers de code à changer) - les sous taches ne sont pas prédéfinies Workflow: evaluator optimizer: nun LLM propose une réponse, un LLM l'évalue et demande une meilleure réponse au besoin Agents: commande ou interaction avec l;humain puis autonome meme si il peut revenir demander des precisions à l'humain. Agents sont souvent des LLM utilisât des outil pour modifier l'environnement et réagir a feedback en boucle Ideal pour les problèmes ouverts et ou le nombre d'étapes n'est pas connu Recommende d'y aller avec une complexité progressive L'IA c'est pas donné https://techcrunch.com/2025/01/05/openai-is-losing-money-on-its-pricey-chatgpt-pro-plan-ceo-sam-altman-says/ OpenAI annonce que même avec des licenses à 200$/mois ils ne couvrent pas leurs couts associés… A quand l'explosion de la bulle IA ? Outillage Ghostty, un nouveau terminal pour Linux et macOS : https://ghostty.org/ Initié par Mitchell Hashimoto (hashicorp) Ghostty est un émulateur de terminal natif pour macOS et Linux. Il est écrit en Swift et utilise AppKit et SwiftUI sur macOS, et en Zig et utilise l'API GTK4 C sur Linux. Il utilise des composants d'interface utilisateur native et des raccourcis clavier et souris standard. Il prend en charge Quick Look, Force Touch et d'autres fonctionnalités spécifiques à macOS. Ghostty essaie de fournir un ensemble riche de fonctionnalités utiles pour un usage quotidien. Comment Pinterest utilise Honeycomb pour améliorer sa CI https://medium.com/pinterest-engineering/how-pinterest-leverages-honeycomb-to-enhance-ci-observability-and-improve-ci-build-stability–15eede563d75 Pinterest utilise Honeycomb pour améliorer l'observabilité de l'intégration continue (CI). Honeycomb permet à Pinterest de visualiser les métriques de build, d'analyser les tendances et de prendre des décisions basées sur les données. Honeycomb aide également Pinterest à identifier les causes potentielles des échecs de build et à rationaliser les tâches d'astreinte. Honeycomb peut également être utilisé pour suivre les métriques de build locales iOS aux côtés des détails de la machine, ce qui aide Pinterest à prioriser les mises à niveau des ordinateurs portables pour les développeurs. Méthodologies Suite à notre épisode sur les différents types de documentation, cet article parle des bonnes pratiques à suivre pour les tutoriels https://refactoringenglish.com/chapters/rules-for-software-tutorials/ Écrivez des tutoriels pour les débutants, en évitant le jargon et la terminologie complexe. Promettez un résultat clair dans le titre et expliquez l'objectif dans l'introduction. Montrez le résultat final tôt pour réduire les ambiguïtés. Rendez les extraits de code copiables et collables, en évitant les invites de shell et les commandes interactives. Utilisez les versions longues des indicateurs de ligne de commande pour plus de clarté. Séparez les valeurs définies par l'utilisateur de la logique réutilisable à l'aide de variables d'environnement ou de constantes nommées. Épargnez au lecteur les tâches inutiles en utilisant des scripts. Laissez les ordinateurs évaluer la logique conditionnelle, pas le lecteur. Maintenez le code en état de fonctionnement tout au long du tutoriel. Enseignez une chose par tutoriel et minimisez les dépendances. Les Wide events, un “nouveau” concept en observabilité https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/ un autre article https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics L'idée est de logger des evenements (genre JSON log) avec le plus d'infos possible de la machine, la ram, la versiond e l'appli, l'utilisateur, le numero de build qui a produit l'appli, la derniere PR etc etc ca permet de filtrer et grouper by et de voir des correlations visuelles tres rapidement et de zoomer tiens les ventes baisses de 20% tiens en fait ca vient de l'appli andriod tiens aps correle a la version de l'appli mais la version de l'os si! le deuxieme article est facile a lire le premier est un guide d'usage exhaustif du concept Entre argumenter et se donner 5 minutes https://signalvnoise.com/posts/3124-give-it-five-minutes on veut souvent argumenter aka poser des questions en ayant déjà la reponse en soi emotionnellement mais ca amene beaucoup de verbiage donner 5 minutes à l'idée le temps d'y penser avant d'argumenter Loi, société et organisation Des juges fédéraux arrêtent le principe de la neutralité du net https://www.lemonde.fr/pixels/article/2025/01/03/les-etats-unis-reviennent-en-arriere-sur-le-principe-de-la-neutralite-du-net_6479575_4408996.html?lmd_medium=al&lmd_campaign=envoye-par-appli&lmd_creation=ios&lmd_source=default la neutralité du net c'est l'interdiction de traiter un paquet différemment en fonction de son émetteur Par exemple un paquet Netflix qui serait ralenti vs un paquet Amazon Donald trump est contre cette neutralité. À voir les impacts concrets dans un marché moins régulé. Rubrique débutant Un petit article sur les float vs les double en Java https://www.baeldung.com/java-float-vs-double 4 vs 8 bytes precision max de 7 vs 15 echele 10^38 vs 10^308 (ordre de grandeur) perf a peu pret similaire sauf peut etre pour des modeles d'IA qui vont privilegier une taille plus petite parfois attention overflow et les accumulation d'erreurs d'approximation BigDecimal Conférences La liste des conférences provenant de Developers Conferences Agenda/List par Aurélie Vache et contributeurs : 20 janvier 2025 : Elastic{ON} - Paris (France) 22–25 janvier 2025 : SnowCamp 2025 - Grenoble (France) 24–25 janvier 2025 : Agile Games Île-de-France 2025 - Paris (France) 6–7 février 2025 : Touraine Tech - Tours (France) 21 février 2025 : LyonJS 100 - Lyon (France) 28 février 2025 : Paris TS La Conf - Paris (France) 6 mars 2025 : DevCon #24 : 100% IA - Paris (France) 13 mars 2025 : Oracle CloudWorld Tour Paris - Paris (France) 14 mars 2025 : Rust In Paris 2025 - Paris (France) 19–21 mars 2025 : React Paris - Paris (France) 20 mars 2025 : PGDay Paris - Paris (France) 20–21 mars 2025 : Agile Niort - Niort (France) 25 mars 2025 : ParisTestConf - Paris (France) 26–29 mars 2025 : JChateau Unconference 2025 - Cour-Cheverny (France) 27–28 mars 2025 : SymfonyLive Paris 2025 - Paris (France) 28 mars 2025 : DataDays - Lille (France) 28–29 mars 2025 : Agile Games France 2025 - Lille (France) 3 avril 2025 : DotJS - Paris (France) 3 avril 2025 : SoCraTes Rennes 2025 - Rennes (France) 4 avril 2025 : Flutter Connection 2025 - Paris (France) 10–11 avril 2025 : Android Makers - Montrouge (France) 10–12 avril 2025 : Devoxx Greece - Athens (Greece) 16–18 avril 2025 : Devoxx France - Paris (France) 23–25 avril 2025 : MODERN ENDPOINT MANAGEMENT EMEA SUMMIT 2025 - Paris (France) 24 avril 2025 : IA Data Day 2025 - Strasbourg (France) 29–30 avril 2025 : MixIT - Lyon (France) 7–9 mai 2025 : Devoxx UK - London (UK) 15 mai 2025 : Cloud Toulouse - Toulouse (France) 16 mai 2025 : AFUP Day 2025 Lille - Lille (France) 16 mai 2025 : AFUP Day 2025 Lyon - Lyon (France) 16 mai 2025 : AFUP Day 2025 Poitiers - Poitiers (France) 24 mai 2025 : Polycloud - Montpellier (France) 5–6 juin 2025 : AlpesCraft - Grenoble (France) 5–6 juin 2025 : Devquest 2025 - Niort (France) 11–13 juin 2025 : Devoxx Poland - Krakow (Poland) 12–13 juin 2025 : Agile Tour Toulouse - Toulouse (France) 12–13 juin 2025 : DevLille - Lille (France) 17 juin 2025 : Mobilis In Mobile - Nantes (France) 24 juin 2025 : WAX 2025 - Aix-en-Provence (France) 25–27 juin 2025 : BreizhCamp 2025 - Rennes (France) 26–27 juin 2025 : Sunny Tech - Montpellier (France) 1–4 juillet 2025 : Open edX Conference - 2025 - Palaiseau (France) 7–9 juillet 2025 : Riviera DEV 2025 - Sophia Antipolis (France) 18–19 septembre 2025 : API Platform Conference - Lille (France) & Online 2–3 octobre 2025 : Volcamp - Clermont-Ferrand (France) 6–10 octobre 2025 : Devoxx Belgium - Antwerp (Belgium) 9–10 octobre 2025 : Forum PHP 2025 - Marne-la-Vallée (France) 16–17 octobre 2025 : DevFest Nantes - Nantes (France) 4–7 novembre 2025 : NewCrafts 2025 - Paris (France) 6 novembre 2025 : dotAI 2025 - Paris (France) 7 novembre 2025 : BDX I/O - Bordeaux (France) 12–14 novembre 2025 : Devoxx Morocco - Marrakech (Morocco) 23–25 avril 2026 : Devoxx Greece - Athens (Greece) 17 juin 2026 : Devoxx Poland - Krakow (Poland) Nous contacter Pour réagir à cet épisode, venez discuter sur le groupe Google https://groups.google.com/group/lescastcodeurs Contactez-nous via X/twitter https://twitter.com/lescastcodeurs ou Bluesky https://bsky.app/profile/lescastcodeurs.com Faire un crowdcast ou une crowdquestion Soutenez Les Cast Codeurs sur Patreon https://www.patreon.com/LesCastCodeurs Tous les épisodes et toutes les infos sur https://lescastcodeurs.com/

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Applications for the 2025 AI Engineer Summit are up, and you can save the date for AIE Singapore in April and AIE World's Fair 2025 in June.Happy new year, and thanks for 100 great episodes! Please let us know what you want to see/hear for the next 100!Full YouTube Episode with Slides/ChartsLike and subscribe and hit that bell to get notifs!Timestamps* 00:00 Welcome to the 100th Episode!* 00:19 Reflecting on the Journey* 00:47 AI Engineering: The Rise and Impact* 03:15 Latent Space Live and AI Conferences* 09:44 The Competitive AI Landscape* 21:45 Synthetic Data and Future Trends* 35:53 Creative Writing with AI* 36:12 Legal and Ethical Issues in AI* 38:18 The Data War: GPU Poor vs. GPU Rich* 39:12 The Rise of GPU Ultra Rich* 40:47 Emerging Trends in AI Models* 45:31 The Multi-Modality War* 01:05:31 The Future of AI Benchmarks* 01:13:17 Pionote and Frontier Models* 01:13:47 Niche Models and Base Models* 01:14:30 State Space Models and RWKB* 01:15:48 Inference Race and Price Wars* 01:22:16 Major AI Themes of the Year* 01:22:48 AI Rewind: January to March* 01:26:42 AI Rewind: April to June* 01:33:12 AI Rewind: July to September* 01:34:59 AI Rewind: October to December* 01:39:53 Year-End Reflections and PredictionsTranscript[00:00:00] Welcome to the 100th Episode![00:00:00] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co host Swyx for the 100th time today.[00:00:12] swyx: Yay, um, and we're so glad that, yeah, you know, everyone has, uh, followed us in this journey. How do you feel about it? 100 episodes.[00:00:19] Alessio: Yeah, I know.[00:00:19] Reflecting on the Journey[00:00:19] Alessio: Almost two years that we've been doing this. We've had four different studios. Uh, we've had a lot of changes. You know, we used to do this lightning round. When we first started that we didn't like, and we tried to change the question. The answer[00:00:32] swyx: was cursor and perplexity.[00:00:34] Alessio: Yeah, I love mid journey. It's like, do you really not like anything else?[00:00:38] Alessio: Like what's, what's the unique thing? And I think, yeah, we, we've also had a lot more research driven content. You know, we had like 3DAO, we had, you know. Jeremy Howard, we had more folks like that.[00:00:47] AI Engineering: The Rise and Impact[00:00:47] Alessio: I think we want to do more of that too in the new year, like having, uh, some of the Gemini folks, both on the research and the applied side.[00:00:54] Alessio: Yeah, but it's been a ton of fun. I think we both started, I wouldn't say as a joke, we were kind of like, Oh, we [00:01:00] should do a podcast. And I think we kind of caught the right wave, obviously. And I think your rise of the AI engineer posts just kind of get people. Sombra to congregate, and then the AI engineer summit.[00:01:11] Alessio: And that's why when I look at our growth chart, it's kind of like a proxy for like the AI engineering industry as a whole, which is almost like, like, even if we don't do that much, we keep growing just because there's so many more AI engineers. So did you expect that growth or did you expect that would take longer for like the AI engineer thing to kind of like become, you know, everybody talks about it today.[00:01:32] swyx: So, the sign of that, that we have won is that Gartner puts it at the top of the hype curve right now. So Gartner has called the peak in AI engineering. I did not expect, um, to what level. I knew that I was correct when I called it because I did like two months of work going into that. But I didn't know, You know, how quickly it could happen, and obviously there's a chance that I could be wrong.[00:01:52] swyx: But I think, like, most people have come around to that concept. Hacker News hates it, which is a good sign. But there's enough people that have defined it, you know, GitHub, when [00:02:00] they launched GitHub Models, which is the Hugging Face clone, they put AI engineers in the banner, like, above the fold, like, in big So I think it's like kind of arrived as a meaningful and useful definition.[00:02:12] swyx: I think people are trying to figure out where the boundaries are. I think that was a lot of the quote unquote drama that happens behind the scenes at the World's Fair in June. Because I think there's a lot of doubt or questions about where ML engineering stops and AI engineering starts. That's a useful debate to be had.[00:02:29] swyx: In some sense, I actually anticipated that as well. So I intentionally did not. Put a firm definition there because most of the successful definitions are necessarily underspecified and it's actually useful to have different perspectives and you don't have to specify everything from the outset.[00:02:45] Alessio: Yeah, I was at um, AWS reInvent and the line to get into like the AI engineering talk, so to speak, which is, you know, applied AI and whatnot was like, there are like hundreds of people just in line to go in.[00:02:56] Alessio: I think that's kind of what enabled me. People, right? Which is what [00:03:00] you kind of talked about. It's like, Hey, look, you don't actually need a PhD, just, yeah, just use the model. And then maybe we'll talk about some of the blind spots that you get as an engineer with the earlier posts that we also had on on the sub stack.[00:03:11] Alessio: But yeah, it's been a heck of a heck of a two years.[00:03:14] swyx: Yeah.[00:03:15] Latent Space Live and AI Conferences[00:03:15] swyx: You know, I was, I was trying to view the conference as like, so NeurIPS is I think like 16, 17, 000 people. And the Latent Space Live event that we held there was 950 signups. I think. The AI world, the ML world is still very much research heavy. And that's as it should be because ML is very much in a research phase.[00:03:34] swyx: But as we move this entire field into production, I think that ratio inverts into becoming more engineering heavy. So at least I think engineering should be on the same level, even if it's never as prestigious, like it'll always be low status because at the end of the day, you're manipulating APIs or whatever.[00:03:51] swyx: But Yeah, wrapping GPTs, but there's going to be an increasing stack and an art to doing these, these things well. And I, you know, I [00:04:00] think that's what we're focusing on for the podcast, the conference and basically everything I do seems to make sense. And I think we'll, we'll talk about the trends here that apply.[00:04:09] swyx: It's, it's just very strange. So, like, there's a mix of, like, keeping on top of research while not being a researcher and then putting that research into production. So, like, people always ask me, like, why are you covering Neuralibs? Like, this is a ML research conference and I'm like, well, yeah, I mean, we're not going to, to like, understand everything Or reproduce every single paper, but the stuff that is being found here is going to make it through into production at some point, you hope.[00:04:32] swyx: And then actually like when I talk to the researchers, they actually get very excited because they're like, oh, you guys are actually caring about how this goes into production and that's what they really really want. The measure of success is previously just peer review, right? Getting 7s and 8s on their um, Academic review conferences and stuff like citations is one metric, but money is a better metric.[00:04:51] Alessio: Money is a better metric. Yeah, and there were about 2200 people on the live stream or something like that. Yeah, yeah. Hundred on the live stream. So [00:05:00] I try my best to moderate, but it was a lot spicier in person with Jonathan and, and Dylan. Yeah, that it was in the chat on YouTube.[00:05:06] swyx: I would say that I actually also created.[00:05:09] swyx: Layen Space Live in order to address flaws that are perceived in academic conferences. This is not NeurIPS specific, it's ICML, NeurIPS. Basically, it's very sort of oriented towards the PhD student, uh, market, job market, right? Like literally all, basically everyone's there to advertise their research and skills and get jobs.[00:05:28] swyx: And then obviously all the, the companies go there to hire them. And I think that's great for the individual researchers, but for people going there to get info is not great because you have to read between the lines, bring a ton of context in order to understand every single paper. So what is missing is effectively what I ended up doing, which is domain by domain, go through and recap the best of the year.[00:05:48] swyx: Survey the field. And there are, like NeurIPS had a, uh, I think ICML had a like a position paper track, NeurIPS added a benchmarks, uh, datasets track. These are ways in which to address that [00:06:00] issue. Uh, there's always workshops as well. Every, every conference has, you know, a last day of workshops and stuff that provide more of an overview.[00:06:06] swyx: But they're not specifically prompted to do so. And I think really, uh, Organizing a conference is just about getting good speakers and giving them the correct prompts. And then they will just go and do that thing and they do a very good job of it. So I think Sarah did a fantastic job with the startups prompt.[00:06:21] swyx: I can't list everybody, but we did best of 2024 in startups, vision, open models. Post transformers, synthetic data, small models, and agents. And then the last one was the, uh, and then we also did a quick one on reasoning with Nathan Lambert. And then the last one, obviously, was the debate that people were very hyped about.[00:06:39] swyx: It was very awkward. And I'm really, really thankful for John Franco, basically, who stepped up to challenge Dylan. Because Dylan was like, yeah, I'll do it. But He was pro scaling. And I think everyone who is like in AI is pro scaling, right? So you need somebody who's ready to publicly say, no, we've hit a wall.[00:06:57] swyx: So that means you're saying Sam Altman's wrong. [00:07:00] You're saying, um, you know, everyone else is wrong. It helps that this was the day before Ilya went on, went up on stage and then said pre training has hit a wall. And data has hit a wall. So actually Jonathan ended up winning, and then Ilya supported that statement, and then Noam Brown on the last day further supported that statement as well.[00:07:17] swyx: So it's kind of interesting that I think the consensus kind of going in was that we're not done scaling, like you should believe in a better lesson. And then, four straight days in a row, you had Sepp Hochreiter, who is the creator of the LSTM, along with everyone's favorite OG in AI, which is Juergen Schmidhuber.[00:07:34] swyx: He said that, um, we're pre trading inside a wall, or like, we've run into a different kind of wall. And then we have, you know John Frankel, Ilya, and then Noam Brown are all saying variations of the same thing, that we have hit some kind of wall in the status quo of what pre trained, scaling large pre trained models has looked like, and we need a new thing.[00:07:54] swyx: And obviously the new thing for people is some make, either people are calling it inference time compute or test time [00:08:00] compute. I think the collective terminology has been inference time, and I think that makes sense because test time, calling it test, meaning, has a very pre trained bias, meaning that the only reason for running inference at all is to test your model.[00:08:11] swyx: That is not true. Right. Yeah. So, so, I quite agree that. OpenAI seems to have adopted, or the community seems to have adopted this terminology of ITC instead of TTC. And that, that makes a lot of sense because like now we care about inference, even right down to compute optimality. Like I actually interviewed this author who recovered or reviewed the Chinchilla paper.[00:08:31] swyx: Chinchilla paper is compute optimal training, but what is not stated in there is it's pre trained compute optimal training. And once you start caring about inference, compute optimal training, you have a different scaling law. And in a way that we did not know last year.[00:08:45] Alessio: I wonder, because John is, he's also on the side of attention is all you need.[00:08:49] Alessio: Like he had the bet with Sasha. So I'm curious, like he doesn't believe in scaling, but he thinks the transformer, I wonder if he's still. So, so,[00:08:56] swyx: so he, obviously everything is nuanced and you know, I told him to play a character [00:09:00] for this debate, right? So he actually does. Yeah. He still, he still believes that we can scale more.[00:09:04] swyx: Uh, he just assumed the character to be very game for, for playing this debate. So even more kudos to him that he assumed a position that he didn't believe in and still won the debate.[00:09:16] Alessio: Get rekt, Dylan. Um, do you just want to quickly run through some of these things? Like, uh, Sarah's presentation, just the highlights.[00:09:24] swyx: Yeah, we can't go through everyone's slides, but I pulled out some things as a factor of, like, stuff that we were going to talk about. And we'll[00:09:30] Alessio: publish[00:09:31] swyx: the rest. Yeah, we'll publish on this feed the best of 2024 in those domains. And hopefully people can benefit from the work that our speakers have done.[00:09:39] swyx: But I think it's, uh, these are just good slides. And I've been, I've been looking for a sort of end of year recaps from, from people.[00:09:44] The Competitive AI Landscape[00:09:44] swyx: The field has progressed a lot. You know, I think the max ELO in 2023 on LMSys used to be 1200 for LMSys ELOs. And now everyone is at least at, uh, 1275 in their ELOs, and this is across Gemini, Chadjibuti, [00:10:00] Grok, O1.[00:10:01] swyx: ai, which with their E Large model, and Enthopic, of course. It's a very, very competitive race. There are multiple Frontier labs all racing, but there is a clear tier zero Frontier. And then there's like a tier one. It's like, I wish I had everything else. Tier zero is extremely competitive. It's effectively now three horse race between Gemini, uh, Anthropic and OpenAI.[00:10:21] swyx: I would say that people are still holding out a candle for XAI. XAI, I think, for some reason, because their API was very slow to roll out, is not included in these metrics. So it's actually quite hard to put on there. As someone who also does charts, XAI is continually snubbed because they don't work well with the benchmarking people.[00:10:42] swyx: Yeah, yeah, yeah. It's a little trivia for why XAI always gets ignored. The other thing is market share. So these are slides from Sarah. We have it up on the screen. It has gone from very heavily open AI. So we have some numbers and estimates. These are from RAMP. Estimates of open AI market share in [00:11:00] December 2023.[00:11:01] swyx: And this is basically, what is it, GPT being 95 percent of production traffic. And I think if you correlate that with stuff that we asked. Harrison Chase on the LangChain episode, it was true. And then CLAUD 3 launched mid middle of this year. I think CLAUD 3 launched in March, CLAUD 3. 5 Sonnet was in June ish.[00:11:23] swyx: And you can start seeing the market share shift towards opening, uh, towards that topic, uh, very, very aggressively. The more recent one is Gemini. So if I scroll down a little bit, this is an even more recent dataset. So RAM's dataset ends in September 2 2. 2024. Gemini has basically launched a price war at the low end, uh, with Gemini Flash, uh, being basically free for personal use.[00:11:44] swyx: Like, I think people don't understand the free tier. It's something like a billion tokens per day. Unless you're trying to abuse it, you cannot really exhaust your free tier on Gemini. They're really trying to get you to use it. They know they're in like third place, um, fourth place, depending how you, how you count.[00:11:58] swyx: And so they're going after [00:12:00] the Lower tier first, and then, you know, maybe the upper tier later, but yeah, Gemini Flash, according to OpenRouter, is now 50 percent of their OpenRouter requests. Obviously, these are the small requests. These are small, cheap requests that are mathematically going to be more.[00:12:15] swyx: The smart ones obviously are still going to OpenAI. But, you know, it's a very, very big shift in the market. Like basically 2023, 2022, To going into 2024 opening has gone from nine five market share to Yeah. Reasonably somewhere between 50 to 75 market share.[00:12:29] Alessio: Yeah. I'm really curious how ramped does the attribution to the model?[00:12:32] Alessio: If it's API, because I think it's all credit card spin. . Well, but it's all, the credit card doesn't say maybe. Maybe the, maybe when they do expenses, they upload the PDF, but yeah, the, the German I think makes sense. I think that was one of my main 2024 takeaways that like. The best small model companies are the large labs, which is not something I would have thought that the open source kind of like long tail would be like the small model.[00:12:53] swyx: Yeah, different sizes of small models we're talking about here, right? Like so small model here for Gemini is AB, [00:13:00] right? Uh, mini. We don't know what the small model size is, but yeah, it's probably in the double digits or maybe single digits, but probably double digits. The open source community has kind of focused on the one to three B size.[00:13:11] swyx: Mm-hmm . Yeah. Maybe[00:13:12] swyx: zero, maybe 0.5 B uh, that's moon dream and that is small for you then, then that's great. It makes sense that we, we have a range for small now, which is like, may, maybe one to five B. Yeah. I'll even put that at, at, at the high end. And so this includes Gemma from Gemini as well. But also includes the Apple Foundation models, which I think Apple Foundation is 3B.[00:13:32] Alessio: Yeah. No, that's great. I mean, I think in the start small just meant cheap. I think today small is actually a more nuanced discussion, you know, that people weren't really having before.[00:13:43] swyx: Yeah, we can keep going. This is a slide that I smiley disagree with Sarah. She's pointing to the scale SEAL leaderboard. I think the Researchers that I talked with at NeurIPS were kind of positive on this because basically you need private test [00:14:00] sets to prevent contamination.[00:14:02] swyx: And Scale is one of maybe three or four people this year that has really made an effort in doing a credible private test set leaderboard. Llama405B does well compared to Gemini and GPT 40. And I think that's good. I would say that. You know, it's good to have an open model that is that big, that does well on those metrics.[00:14:23] swyx: But anyone putting 405B in production will tell you, if you scroll down a little bit to the artificial analysis numbers, that it is very slow and very expensive to infer. Um, it doesn't even fit on like one node. of, uh, of H100s. Cerebras will be happy to tell you they can serve 4 or 5B on their super large chips.[00:14:42] swyx: But, um, you know, if you need to do anything custom to it, you're still kind of constrained. So, is 4 or 5B really that relevant? Like, I think most people are basically saying that they only use 4 or 5B as a teacher model to distill down to something. Even Meta is doing it. So with Lama 3. [00:15:00] 3 launched, they only launched the 70B because they use 4 or 5B to distill the 70B.[00:15:03] swyx: So I don't know if like open source is keeping up. I think they're the, the open source industrial complex is very invested in telling you that the, if the gap is narrowing, I kind of disagree. I think that the gap is widening with O1. I think there are very, very smart people trying to narrow that gap and they should.[00:15:22] swyx: I really wish them success, but you cannot use a chart that is nearing 100 in your saturation chart. And look, the distance between open source and closed source is narrowing. Of course it's going to narrow because you're near 100. This is stupid. But in metrics that matter, is open source narrowing?[00:15:38] swyx: Probably not for O1 for a while. And it's really up to the open source guys to figure out if they can match O1 or not.[00:15:46] Alessio: I think inference time compute is bad for open source just because, you know, Doc can donate the flops at training time, but he cannot donate the flops at inference time. So it's really hard to like actually keep up on that axis.[00:15:59] Alessio: Big, big business [00:16:00] model shift. So I don't know what that means for the GPU clouds. I don't know what that means for the hyperscalers, but obviously the big labs have a lot of advantage. Because, like, it's not a static artifact that you're putting the compute in. You're kind of doing that still, but then you're putting a lot of computed inference too.[00:16:17] swyx: Yeah, yeah, yeah. Um, I mean, Llama4 will be reasoning oriented. We talked with Thomas Shalom. Um, kudos for getting that episode together. That was really nice. Good, well timed. Actually, I connected with the AI meta guy, uh, at NeurIPS, and, um, yeah, we're going to coordinate something for Llama4. Yeah, yeah,[00:16:32] Alessio: and our friend, yeah.[00:16:33] Alessio: Clara Shi just joined to lead the business agent side. So I'm sure we'll have her on in the new year.[00:16:39] swyx: Yeah. So, um, my comment on, on the business model shift, this is super interesting. Apparently it is wide knowledge that OpenAI wanted more than 6. 6 billion dollars for their fundraise. They wanted to raise, you know, higher, and they did not.[00:16:51] swyx: And what that means is basically like, it's very convenient that we're not getting GPT 5, which would have been a larger pre train. We should have a lot of upfront money. And [00:17:00] instead we're, we're converting fixed costs into variable costs, right. And passing it on effectively to the customer. And it's so much easier to take margin there because you can directly attribute it to like, Oh, you're using this more.[00:17:12] swyx: Therefore you, you pay more of the cost and I'll just slap a margin in there. So like that lets you control your growth margin and like tie your. Your spend, or your sort of inference spend, accordingly. And it's just really interesting to, that this change in the sort of inference paradigm has arrived exactly at the same time that the funding environment for pre training is effectively drying up, kind of.[00:17:36] swyx: I feel like maybe the VCs are very in tune with research anyway, so like, they would have noticed this, but, um, it's just interesting.[00:17:43] Alessio: Yeah, and I was looking back at our yearly recap of last year. Yeah. And the big thing was like the mixed trial price fights, you know, and I think now it's almost like there's nowhere to go, like, you know, Gemini Flash is like basically giving it away for free.[00:17:55] Alessio: So I think this is a good way for the labs to generate more revenue and pass down [00:18:00] some of the compute to the customer. I think they're going to[00:18:02] swyx: keep going. I think that 2, will come.[00:18:05] Alessio: Yeah, I know. Totally. I mean, next year, the first thing I'm doing is signing up for Devin. Signing up for the pro chat GBT.[00:18:12] Alessio: Just to try. I just want to see what does it look like to spend a thousand dollars a month on AI?[00:18:17] swyx: Yes. Yes. I think if your, if your, your job is a, at least AI content creator or VC or, you know, someone who, whose job it is to stay on, stay on top of things, you should already be spending like a thousand dollars a month on, on stuff.[00:18:28] swyx: And then obviously easy to spend, hard to use. You have to actually use. The good thing is that actually Google lets you do a lot of stuff for free now. So like deep research. That they just launched. Uses a ton of inference and it's, it's free while it's in preview.[00:18:45] Alessio: Yeah. They need to put that in Lindy.[00:18:47] Alessio: I've been using Lindy lately. I've been a built a bunch of things once we had flow because I liked the new thing. It's pretty good. I even did a phone call assistant. Um, yeah, they just launched Lindy voice. Yeah, I think once [00:19:00] they get advanced voice mode like capability today, still like speech to text, you can kind of tell.[00:19:06] Alessio: Um, but it's good for like reservations and things like that. So I have a meeting prepper thing. And so[00:19:13] swyx: it's good. Okay. I feel like we've, we've covered a lot of stuff. Uh, I, yeah, I, you know, I think We will go over the individual, uh, talks in a separate episode. Uh, I don't want to take too much time with, uh, this stuff, but that suffice to say that there is a lot of progress in each field.[00:19:28] swyx: Uh, we covered vision. Basically this is all like the audience voting for what they wanted. And then I just invited the best people I could find in each audience, especially agents. Um, Graham, who I talked to at ICML in Vienna, he is currently still number one. It's very hard to stay on top of SweetBench.[00:19:45] swyx: OpenHand is currently still number one. switchbench full, which is the hardest one. He had very good thoughts on agents, which I, which I'll highlight for people. Everyone is saying 2025 is the year of agents, just like they said last year. And, uh, but he had [00:20:00] thoughts on like eight parts of what are the frontier problems to solve in agents.[00:20:03] swyx: And so I'll highlight that talk as well.[00:20:05] Alessio: Yeah. The number six, which is the Hacken agents learn more about the environment, has been a Super interesting to us as well, just to think through, because, yeah, how do you put an agent in an enterprise where most things in an enterprise have never been public, you know, a lot of the tooling, like the code bases and things like that.[00:20:23] Alessio: So, yeah, there's not indexing and reg. Well, yeah, but it's more like. You can't really rag things that are not documented. But people know them based on how they've been doing it. You know, so I think there's almost this like, you know, Oh, institutional knowledge. Yeah, the boring word is kind of like a business process extraction.[00:20:38] Alessio: Yeah yeah, I see. It's like, how do you actually understand how these things are done? I see. Um, and I think today the, the problem is that, Yeah, the agents are, that most people are building are good at following instruction, but are not as good as like extracting them from you. Um, so I think that will be a big unlock just to touch quickly on the Jeff Dean thing.[00:20:55] Alessio: I thought it was pretty, I mean, we'll link it in the, in the things, but. I think the main [00:21:00] focus was like, how do you use ML to optimize the systems instead of just focusing on ML to do something else? Yeah, I think speculative decoding, we had, you know, Eugene from RWKB on the podcast before, like he's doing a lot of that with Fetterless AI.[00:21:12] swyx: Everyone is. I would say it's the norm. I'm a little bit uncomfortable with how much it costs, because it does use more of the GPU per call. But because everyone is so keen on fast inference, then yeah, makes sense.[00:21:24] Alessio: Exactly. Um, yeah, but we'll link that. Obviously Jeff is great.[00:21:30] swyx: Jeff is, Jeff's talk was more, it wasn't focused on Gemini.[00:21:33] swyx: I think people got the wrong impression from my tweet. It's more about how Google approaches ML and uses ML to design systems and then systems feedback into ML. And I think this ties in with Lubna's talk.[00:21:45] Synthetic Data and Future Trends[00:21:45] swyx: on synthetic data where it's basically the story of bootstrapping of humans and AI in AI research or AI in production.[00:21:53] swyx: So her talk was on synthetic data, where like how much synthetic data has grown in 2024 in the pre training side, the post training side, [00:22:00] and the eval side. And I think Jeff then also extended it basically to chips, uh, to chip design. So he'd spend a lot of time talking about alpha chip. And most of us in the audience are like, we're not working on hardware, man.[00:22:11] swyx: Like you guys are great. TPU is great. Okay. We'll buy TPUs.[00:22:14] Alessio: And then there was the earlier talk. Yeah. But, and then we have, uh, I don't know if we're calling them essays. What are we calling these? But[00:22:23] swyx: for me, it's just like bonus for late in space supporters, because I feel like they haven't been getting anything.[00:22:29] swyx: And then I wanted a more high frequency way to write stuff. Like that one I wrote in an afternoon. I think basically we now have an answer to what Ilya saw. It's one year since. The blip. And we know what he saw in 2014. We know what he saw in 2024. We think we know what he sees in 2024. He gave some hints and then we have vague indications of what he saw in 2023.[00:22:54] swyx: So that was the Oh, and then 2016 as well, because of this lawsuit with Elon, OpenAI [00:23:00] is publishing emails from Sam's, like, his personal text messages to Siobhan, Zelis, or whatever. So, like, we have emails from Ilya saying, this is what we're seeing in OpenAI, and this is why we need to scale up GPUs. And I think it's very prescient in 2016 to write that.[00:23:16] swyx: And so, like, it is exactly, like, basically his insights. It's him and Greg, basically just kind of driving the scaling up of OpenAI, while they're still playing Dota. They're like, no, like, we see the path here.[00:23:30] Alessio: Yeah, and it's funny, yeah, they even mention, you know, we can only train on 1v1 Dota. We need to train on 5v5, and that takes too many GPUs.[00:23:37] Alessio: Yeah,[00:23:37] swyx: and at least for me, I can speak for myself, like, I didn't see the path from Dota to where we are today. I think even, maybe if you ask them, like, they wouldn't necessarily draw a straight line. Yeah,[00:23:47] Alessio: no, definitely. But I think like that was like the whole idea of almost like the RL and we talked about this with Nathan on his podcast.[00:23:55] Alessio: It's like with RL, you can get very good at specific things, but then you can't really like generalize as much. And I [00:24:00] think the language models are like the opposite, which is like, you're going to throw all this data at them and scale them up, but then you really need to drive them home on a specific task later on.[00:24:08] Alessio: And we'll talk about the open AI reinforcement, fine tuning, um, announcement too, and all of that. But yeah, I think like scale is all you need. That's kind of what Elia will be remembered for. And I think just maybe to clarify on like the pre training is over thing that people love to tweet. I think the point of the talk was like everybody, we're scaling these chips, we're scaling the compute, but like the second ingredient which is data is not scaling at the same rate.[00:24:35] Alessio: So it's not necessarily pre training is over. It's kind of like What got us here won't get us there. In his email, he predicted like 10x growth every two years or something like that. And I think maybe now it's like, you know, you can 10x the chips again, but[00:24:49] swyx: I think it's 10x per year. Was it? I don't know.[00:24:52] Alessio: Exactly. And Moore's law is like 2x. So it's like, you know, much faster than that. And yeah, I like the fossil fuel of AI [00:25:00] analogy. It's kind of like, you know, the little background tokens thing. So the OpenAI reinforcement fine tuning is basically like, instead of fine tuning on data, you fine tune on a reward model.[00:25:09] Alessio: So it's basically like, instead of being data driven, it's like task driven. And I think people have tasks to do, they don't really have a lot of data. So I'm curious to see how that changes, how many people fine tune, because I think this is what people run into. It's like, Oh, you can fine tune llama. And it's like, okay, where do I get the data?[00:25:27] Alessio: To fine tune it on, you know, so it's great that we're moving the thing. And then I really like he had this chart where like, you know, the brain mass and the body mass thing is basically like mammals that scaled linearly by brain and body size, and then humans kind of like broke off the slope. So it's almost like maybe the mammal slope is like the pre training slope.[00:25:46] Alessio: And then the post training slope is like the, the human one.[00:25:49] swyx: Yeah. I wonder what the. I mean, we'll know in 10 years, but I wonder what the y axis is for, for Ilya's SSI. We'll try to get them on.[00:25:57] Alessio: Ilya, if you're listening, you're [00:26:00] welcome here. Yeah, and then he had, you know, what comes next, like agent, synthetic data, inference, compute, I thought all of that was like that.[00:26:05] Alessio: I don't[00:26:05] swyx: think he was dropping any alpha there. Yeah, yeah, yeah.[00:26:07] Alessio: Yeah. Any other new reps? Highlights?[00:26:10] swyx: I think that there was comparatively a lot more work. Oh, by the way, I need to plug that, uh, my friend Yi made this, like, little nice paper. Yeah, that was really[00:26:20] swyx: nice.[00:26:20] swyx: Uh, of, uh, of, like, all the, he's, she called it must read papers of 2024.[00:26:26] swyx: So I laid out some of these at NeurIPS, and it was just gone. Like, everyone just picked it up. Because people are dying for, like, little guidance and visualizations And so, uh, I thought it was really super nice that we got there.[00:26:38] Alessio: Should we do a late in space book for each year? Uh, I thought about it. For each year we should.[00:26:42] Alessio: Coffee table book. Yeah. Yeah. Okay. Put it in the will. Hi, Will. By the way, we haven't introduced you. He's our new, you know, general organist, Jamie. You need to[00:26:52] swyx: pull up more things. One thing I saw that, uh, Okay, one fun one, and then one [00:27:00] more general one. So the fun one is this paper on agent collusion. This is a paper on steganography.[00:27:06] swyx: This is secret collusion among AI agents, multi agent deception via steganography. I tried to go to NeurIPS in order to find these kinds of papers because the real reason Like NeurIPS this year has a lottery system. A lot of people actually even go and don't buy tickets because they just go and attend the side events.[00:27:22] swyx: And then also the people who go and end up crowding around the most popular papers, which you already know and already read them before you showed up to NeurIPS. So the only reason you go there is to talk to the paper authors, but there's like something like 10, 000 other. All these papers out there that, you know, are just people's work that they, that they did on the air and they failed to get attention for one reason or another.[00:27:42] swyx: And this was one of them. Uh, it was like all the way at the back. And this is a deep mind paper that actually focuses on collusion between AI agents, uh, by hiding messages in the text that they generate. Uh, so that's what steganography is. So a very simple example would be the first letter of every word.[00:27:57] swyx: If you Pick that out, you know, and the code sends a [00:28:00] different message than that. But something I've always emphasized is to LLMs, we read left to right. LLMs can read up, down, sideways, you know, in random character order. And it's the same to them as it is to us. So if we were ever to get You know, self motivated, underlined LLMs that we're trying to collaborate to take over the planet.[00:28:19] swyx: This would be how they do it. They spread messages among us in the messages that we generate. And he developed a scaling law for that. So he marked, I'm showing it on screen right now, the emergence of this phenomenon. Basically, for example, for Cypher encoding, GPT 2, Lama 2, mixed trial, GPT 3. 5, zero capabilities, and sudden 4.[00:28:40] swyx: And this is the kind of Jason Wei type emergence properties that people kind of look for. I think what made this paper stand out as well, so he developed the benchmark for steganography collusion, and he also focused on shelling point collusion, which is very low coordination. For agreeing on a decoding encoding format, you kind of need to have some [00:29:00] agreement on that.[00:29:00] swyx: But, but shelling point means like very, very low or almost no coordination. So for example, if I, if I ask someone, if the only message I give you is meet me in New York and you're not aware. Or when you would probably meet me at Grand Central Station. That is the Grand Central Station is a shelling point.[00:29:16] swyx: And it's probably somewhere, somewhere during the day. That is the shelling point of New York is Grand Central. To that extent, shelling points for steganography are things like the, the, the common decoding methods that we talked about. It will be interesting at some point in the future when we are worried about alignment.[00:29:30] swyx: It is not interesting today, but it's interesting that DeepMind is already thinking about this.[00:29:36] Alessio: I think that's like one of the hardest things about NeurIPS. It's like the long tail. I[00:29:41] swyx: found a pricing guy. I'm going to feature him on the podcast. Basically, this guy from NVIDIA worked out the optimal pricing for language models.[00:29:51] swyx: It's basically an econometrics paper at NeurIPS, where everyone else is talking about GPUs. And the guy with the GPUs is[00:29:57] Alessio: talking[00:29:57] swyx: about economics instead. [00:30:00] That was the sort of fun one. So the focus I saw is that model papers at NeurIPS are kind of dead. No one really presents models anymore. It's just data sets.[00:30:12] swyx: This is all the grad students are working on. So like there was a data sets track and then I was looking around like, I was like, you don't need a data sets track because every paper is a data sets paper. And so data sets and benchmarks, they're kind of flip sides of the same thing. So Yeah. Cool. Yeah, if you're a grad student, you're a GPU boy, you kind of work on that.[00:30:30] swyx: And then the, the sort of big model that people walk around and pick the ones that they like, and then they use it in their models. And that's, that's kind of how it develops. I, I feel like, um, like, like you didn't last year, you had people like Hao Tian who worked on Lava, which is take Lama and add Vision.[00:30:47] swyx: And then obviously actually I hired him and he added Vision to Grok. Now he's the Vision Grok guy. This year, I don't think there was any of those.[00:30:55] Alessio: What were the most popular, like, orals? Last year it was like the [00:31:00] Mixed Monarch, I think, was like the most attended. Yeah, uh, I need to look it up. Yeah, I mean, if nothing comes to mind, that's also kind of like an answer in a way.[00:31:10] Alessio: But I think last year there was a lot of interest in, like, furthering models and, like, different architectures and all of that.[00:31:16] swyx: I will say that I felt the orals, oral picks this year were not very good. Either that or maybe it's just a So that's the highlight of how I have changed in terms of how I view papers.[00:31:29] swyx: So like, in my estimation, two of the best papers in this year for datasets or data comp and refined web or fine web. These are two actually industrially used papers, not highlighted for a while. I think DCLM got the spotlight, FineWeb didn't even get the spotlight. So like, it's just that the picks were different.[00:31:48] swyx: But one thing that does get a lot of play that a lot of people are debating is the role that's scheduled. This is the schedule free optimizer paper from Meta from Aaron DeFazio. And this [00:32:00] year in the ML community, there's been a lot of chat about shampoo, soap, all the bathroom amenities for optimizing your learning rates.[00:32:08] swyx: And, uh, most people at the big labs are. Who I asked about this, um, say that it's cute, but it's not something that matters. I don't know, but it's something that was discussed and very, very popular. 4Wars[00:32:19] Alessio: of AI recap maybe, just quickly. Um, where do you want to start? Data?[00:32:26] swyx: So to remind people, this is the 4Wars piece that we did as one of our earlier recaps of this year.[00:32:31] swyx: And the belligerents are on the left, journalists, writers, artists, anyone who owns IP basically, New York Times, Stack Overflow, Reddit, Getty, Sarah Silverman, George RR Martin. Yeah, and I think this year we can add Scarlett Johansson to that side of the fence. So anyone suing, open the eye, basically. I actually wanted to get a snapshot of all the lawsuits.[00:32:52] swyx: I'm sure some lawyer can do it. That's the data quality war. On the right hand side, we have the synthetic data people, and I think we talked about Lumna's talk, you know, [00:33:00] really showing how much synthetic data has come along this year. I think there was a bit of a fight between scale. ai and the synthetic data community, because scale.[00:33:09] swyx: ai published a paper saying that synthetic data doesn't work. Surprise, surprise, scale. ai is the leading vendor of non synthetic data. Only[00:33:17] Alessio: cage free annotated data is useful.[00:33:21] swyx: So I think there's some debate going on there, but I don't think it's much debate anymore that at least synthetic data, for the reasons that are blessed in Luna's talk, Makes sense.[00:33:32] swyx: I don't know if you have any perspectives there.[00:33:34] Alessio: I think, again, going back to the reinforcement fine tuning, I think that will change a little bit how people think about it. I think today people mostly use synthetic data, yeah, for distillation and kind of like fine tuning a smaller model from like a larger model.[00:33:46] Alessio: I'm not super aware of how the frontier labs use it outside of like the rephrase, the web thing that Apple also did. But yeah, I think it'll be. Useful. I think like whether or not that gets us the big [00:34:00] next step, I think that's maybe like TBD, you know, I think people love talking about data because it's like a GPU poor, you know, I think, uh, synthetic data is like something that people can do, you know, so they feel more opinionated about it compared to, yeah, the optimizers stuff, which is like,[00:34:17] swyx: they don't[00:34:17] Alessio: really work[00:34:18] swyx: on.[00:34:18] swyx: I think that there is an angle to the reasoning synthetic data. So this year, we covered in the paper club, the star series of papers. So that's star, Q star, V star. It basically helps you to synthesize reasoning steps, or at least distill reasoning steps from a verifier. And if you look at the OpenAI RFT, API that they released, or that they announced, basically they're asking you to submit graders, or they choose from a preset list of graders.[00:34:49] swyx: Basically It feels like a way to create valid synthetic data for them to fine tune their reasoning paths on. Um, so I think that is another angle where it starts to make sense. And [00:35:00] so like, it's very funny that basically all the data quality wars between Let's say the music industry or like the newspaper publishing industry or the textbooks industry on the big labs.[00:35:11] swyx: It's all of the pre training era. And then like the new era, like the reasoning era, like nobody has any problem with all the reasoning, especially because it's all like sort of math and science oriented with, with very reasonable graders. I think the more interesting next step is how does it generalize beyond STEM?[00:35:27] swyx: We've been using O1 for And I would say like for summarization and creative writing and instruction following, I think it's underrated. I started using O1 in our intro songs before we killed the intro songs, but it's very good at writing lyrics. You know, I can actually say like, I think one of the O1 pro demos.[00:35:46] swyx: All of these things that Noam was showing was that, you know, you can write an entire paragraph or three paragraphs without using the letter A, right?[00:35:53] Creative Writing with AI[00:35:53] swyx: So like, like literally just anything instead of token, like not even token level, character level manipulation and [00:36:00] counting and instruction following. It's, uh, it's very, very strong.[00:36:02] swyx: And so no surprises when I ask it to rhyme, uh, and to, to create song lyrics, it's going to do that very much better than in previous models. So I think it's underrated for creative writing.[00:36:11] Alessio: Yeah.[00:36:12] Legal and Ethical Issues in AI[00:36:12] Alessio: What do you think is the rationale that they're going to have in court when they don't show you the thinking traces of O1, but then they want us to, like, they're getting sued for using other publishers data, you know, but then on their end, they're like, well, you shouldn't be using my data to then train your model.[00:36:29] Alessio: So I'm curious to see how that kind of comes. Yeah, I mean, OPA has[00:36:32] swyx: many ways to publish, to punish people without bringing, taking them to court. Already banned ByteDance for distilling their, their info. And so anyone caught distilling the chain of thought will be just disallowed to continue on, on, on the API.[00:36:44] swyx: And it's fine. It's no big deal. Like, I don't even think that's an issue at all, just because the chain of thoughts are pretty well hidden. Like you have to work very, very hard to, to get it to leak. And then even when it leaks the chain of thought, you don't know if it's, if it's [00:37:00] The bigger concern is actually that there's not that much IP hiding behind it, that Cosign, which we talked about, we talked to him on Dev Day, can just fine tune 4.[00:37:13] swyx: 0 to beat 0. 1 Cloud SONET so far is beating O1 on coding tasks without, at least O1 preview, without being a reasoning model, same for Gemini Pro or Gemini 2. 0. So like, how much is reasoning important? How much of a moat is there in this, like, All of these are proprietary sort of training data that they've presumably accomplished.[00:37:34] swyx: Because even DeepSeek was able to do it. And they had, you know, two months notice to do this, to do R1. So, it's actually unclear how much moat there is. Obviously, you know, if you talk to the Strawberry team, they'll be like, yeah, I mean, we spent the last two years doing this. So, we don't know. And it's going to be Interesting because there'll be a lot of noise from people who say they have inference time compute and actually don't because they just have fancy chain of thought.[00:38:00][00:38:00] swyx: And then there's other people who actually do have very good chain of thought. And you will not see them on the same level as OpenAI because OpenAI has invested a lot in building up the mythology of their team. Um, which makes sense. Like the real answer is somewhere in between.[00:38:13] Alessio: Yeah, I think that's kind of like the main data war story developing.[00:38:18] The Data War: GPU Poor vs. GPU Rich[00:38:18] Alessio: GPU poor versus GPU rich. Yeah. Where do you think we are? I think there was, again, going back to like the small model thing, there was like a time in which the GPU poor were kind of like the rebel faction working on like these models that were like open and small and cheap. And I think today people don't really care as much about GPUs anymore.[00:38:37] Alessio: You also see it in the price of the GPUs. Like, you know, that market is kind of like plummeted because there's people don't want to be, they want to be GPU free. They don't even want to be poor. They just want to be, you know, completely without them. Yeah. How do you think about this war? You[00:38:52] swyx: can tell me about this, but like, I feel like the, the appetite for GPU rich startups, like the, you know, the, the funding plan is we will raise 60 million and [00:39:00] we'll give 50 of that to NVIDIA.[00:39:01] swyx: That is gone, right? Like, no one's, no one's pitching that. This was literally the plan, the exact plan of like, I can name like four or five startups, you know, this time last year. So yeah, GPU rich startups gone.[00:39:12] The Rise of GPU Ultra Rich[00:39:12] swyx: But I think like, The GPU ultra rich, the GPU ultra high net worth is still going. So, um, now we're, you know, we had Leopold's essay on the trillion dollar cluster.[00:39:23] swyx: We're not quite there yet. We have multiple labs, um, you know, XAI very famously, you know, Jensen Huang praising them for being. Best boy number one in spinning up 100, 000 GPU cluster in like 12 days or something. So likewise at Meta, likewise at OpenAI, likewise at the other labs as well. So like the GPU ultra rich are going to keep doing that because I think partially it's an article of faith now that you just need it.[00:39:46] swyx: Like you don't even know what it's going to, what you're going to use it for. You just, you just need it. And it makes sense that if, especially if we're going into. More researchy territory than we are. So let's say 2020 to 2023 was [00:40:00] let's scale big models territory because we had GPT 3 in 2020 and we were like, okay, we'll go from 1.[00:40:05] swyx: 75b to 1. 8b, 1. 8t. And that was GPT 3 to GPT 4. Okay, that's done. As far as everyone is concerned, Opus 3. 5 is not coming out, GPT 4. 5 is not coming out, and Gemini 2, we don't have Pro, whatever. We've hit that wall. Maybe I'll call it the 2 trillion perimeter wall. We're not going to 10 trillion. No one thinks it's a good idea, at least from training costs, from the amount of data, or at least the inference.[00:40:36] swyx: Would you pay 10x the price of GPT Probably not. Like, like you want something else that, that is at least more useful. So it makes sense that people are pivoting in terms of their inference paradigm.[00:40:47] Emerging Trends in AI Models[00:40:47] swyx: And so when it's more researchy, then you actually need more just general purpose compute to mess around with, uh, at the exact same time that production deployments of the old, the previous paradigm is still ramping up,[00:40:58] swyx: um,[00:40:58] swyx: uh, pretty aggressively.[00:40:59] swyx: So [00:41:00] it makes sense that the GPU rich are growing. We have now interviewed both together and fireworks and replicates. Uh, we haven't done any scale yet. But I think Amazon, maybe kind of a sleeper one, Amazon, in a sense of like they, at reInvent, I wasn't expecting them to do so well, but they are now a foundation model lab.[00:41:18] swyx: It's kind of interesting. Um, I think, uh, you know, David went over there and started just creating models.[00:41:25] Alessio: Yeah, I mean, that's the power of prepaid contracts. I think like a lot of AWS customers, you know, they do this big reserve instance contracts and now they got to use their money. That's why so many startups.[00:41:37] Alessio: Get bought through the AWS marketplace so they can kind of bundle them together and prefer pricing.[00:41:42] swyx: Okay, so maybe GPU super rich doing very well, GPU middle class dead, and then GPU[00:41:48] Alessio: poor. I mean, my thing is like, everybody should just be GPU rich. There shouldn't really be, even the GPU poorest, it's like, does it really make sense to be GPU poor?[00:41:57] Alessio: Like, if you're GPU poor, you should just use the [00:42:00] cloud. Yes, you know, and I think there might be a future once we kind of like figure out what the size and shape of these models is where like the tiny box and these things come to fruition where like you can be GPU poor at home. But I think today is like, why are you working so hard to like get these models to run on like very small clusters where it's like, It's so cheap to run them.[00:42:21] Alessio: Yeah, yeah,[00:42:22] swyx: yeah. I think mostly people think it's cool. People think it's a stepping stone to scaling up. So they aspire to be GPU rich one day and they're working on new methods. Like news research, like probably the most deep tech thing they've done this year is Distro or whatever the new name is.[00:42:38] swyx: There's a lot of interest in heterogeneous computing, distributed computing. I tend generally to de emphasize that historically, but it may be coming to a time where it is starting to be relevant. I don't know. You know, SF compute launched their compute marketplace this year, and like, who's really using that?[00:42:53] swyx: Like, it's a bunch of small clusters, disparate types of compute, and if you can make that [00:43:00] useful, then that will be very beneficial to the broader community, but maybe still not the source of frontier models. It's just going to be a second tier of compute that is unlocked for people, and that's fine. But yeah, I mean, I think this year, I would say a lot more on device, We are, I now have Apple intelligence on my phone.[00:43:19] swyx: Doesn't do anything apart from summarize my notifications. But still, not bad. Like, it's multi modal.[00:43:25] Alessio: Yeah, the notification summaries are so and so in my experience.[00:43:29] swyx: Yeah, but they add, they add juice to life. And then, um, Chrome Nano, uh, Gemini Nano is coming out in Chrome. Uh, they're still feature flagged, but you can, you can try it now if you, if you use the, uh, the alpha.[00:43:40] swyx: And so, like, I, I think, like, you know, We're getting the sort of GPU poor version of a lot of these things coming out, and I think it's like quite useful. Like Windows as well, rolling out RWKB in sort of every Windows department is super cool. And I think the last thing that I never put in this GPU poor war, that I think I should now, [00:44:00] is the number of startups that are GPU poor but still scaling very well, as sort of wrappers on top of either a foundation model lab, or GPU Cloud.[00:44:10] swyx: GPU Cloud, it would be Suno. Suno, Ramp has rated as one of the top ranked, fastest growing startups of the year. Um, I think the last public number is like zero to 20 million this year in ARR and Suno runs on Moto. So Suno itself is not GPU rich, but they're just doing the training on, on Moto, uh, who we've also talked to on, on the podcast.[00:44:31] swyx: The other one would be Bolt, straight cloud wrapper. And, and, um, Again, another, now they've announced 20 million ARR, which is another step up from our 8 million that we put on the title. So yeah, I mean, it's crazy that all these GPU pores are finding a way while the GPU riches are also finding a way. And then the only failures, I kind of call this the GPU smiling curve, where the edges do well, because you're either close to the machines, and you're like [00:45:00] number one on the machines, or you're like close to the customers, and you're number one on the customer side.[00:45:03] swyx: And the people who are in the middle. Inflection, um, character, didn't do that great. I think character did the best of all of them. Like, you have a note in here that we apparently said that character's price tag was[00:45:15] Alessio: 1B.[00:45:15] swyx: Did I say that?[00:45:16] Alessio: Yeah. You said Google should just buy them for 1B. I thought it was a crazy number.[00:45:20] Alessio: Then they paid 2. 7 billion. I mean, for like,[00:45:22] swyx: yeah.[00:45:22] Alessio: What do you pay for node? Like, I don't know what the game world was like. Maybe the starting price was 1B. I mean, whatever it was, it worked out for everybody involved.[00:45:31] The Multi-Modality War[00:45:31] Alessio: Multimodality war. And this one, we never had text to video in the first version, which now is the hottest.[00:45:37] swyx: Yeah, I would say it's a subset of image, but yes.[00:45:40] Alessio: Yeah, well, but I think at the time it wasn't really something people were doing, and now we had VO2 just came out yesterday. Uh, Sora was released last month, last week. I've not tried Sora, because the day that I tried, it wasn't, yeah. I[00:45:54] swyx: think it's generally available now, you can go to Sora.[00:45:56] swyx: com and try it. Yeah, they had[00:45:58] Alessio: the outage. Which I [00:46:00] think also played a part into it. Small things. Yeah. What's the other model that you posted today that was on Replicate? Video or OneLive?[00:46:08] swyx: Yeah. Very, very nondescript name, but it is from Minimax, which I think is a Chinese lab. The Chinese labs do surprisingly well at the video models.[00:46:20] swyx: I'm not sure it's actually Chinese. I don't know. Hold me up to that. Yep. China. It's good. Yeah, the Chinese love video. What can I say? They have a lot of training data for video. Or a more relaxed regulatory environment.[00:46:37] Alessio: Uh, well, sure, in some way. Yeah, I don't think there's much else there. I think like, you know, on the image side, I think it's still open.[00:46:45] Alessio: Yeah, I mean,[00:46:46] swyx: 11labs is now a unicorn. So basically, what is multi modality war? Multi modality war is, do you specialize in a single modality, right? Or do you have GodModel that does all the modalities? So this is [00:47:00] definitely still going, in a sense of 11 labs, you know, now Unicorn, PicoLabs doing well, they launched Pico 2.[00:47:06] swyx: 0 recently, HeyGen, I think has reached 100 million ARR, Assembly, I don't know, but they have billboards all over the place, so I assume they're doing very, very well. So these are all specialist models, specialist models and specialist startups. And then there's the big labs who are doing the sort of all in one play.[00:47:24] swyx: And then here I would highlight Gemini 2 for having native image output. Have you seen the demos? Um, yeah, it's, it's hard to keep up. Literally they launched this last week and a shout out to Paige Bailey, who came to the Latent Space event to demo on the day of launch. And she wasn't prepared. She was just like, I'm just going to show you.[00:47:43] swyx: So they have voice. They have, you know, obviously image input, and then they obviously can code gen and all that. But the new one that OpenAI and Meta both have but they haven't launched yet is image output. So you can literally, um, I think their demo video was that you put in an image of a [00:48:00] car, and you ask for minor modifications to that car.[00:48:02] swyx: They can generate you that modification exactly as you asked. So there's no need for the stable diffusion or comfy UI workflow of like mask here and then like infill there in paint there and all that, all that stuff. This is small model nonsense. Big model people are like, huh, we got you in as everything in the transformer.[00:48:21] swyx: This is the multimodality war, which is, do you, do you bet on the God model or do you string together a whole bunch of, uh, Small models like a, like a chump. Yeah,[00:48:29] Alessio: I don't know, man. Yeah, that would be interesting. I mean, obviously I use Midjourney for all of our thumbnails. Um, they've been doing a ton on the product, I would say.[00:48:38] Alessio: They launched a new Midjourney editor thing. They've been doing a ton. Because I think, yeah, the motto is kind of like, Maybe, you know, people say black forest, the black forest models are better than mid journey on a pixel by pixel basis. But I think when you put it, put it together, have you tried[00:48:53] swyx: the same problems on black forest?[00:48:55] Alessio: Yes. But the problem is just like, you know, on black forest, it generates one image. And then it's like, you got to [00:49:00] regenerate. You don't have all these like UI things. Like what I do, no, but it's like time issue, you know, it's like a mid[00:49:06] swyx: journey. Call the API four times.[00:49:08] Alessio: No, but then there's no like variate.[00:49:10] Alessio: Like the good thing about mid journey is like, you just go in there and you're cooking. There's a lot of stuff that just makes it really easy. And I think people underestimate that. Like, it's not really a skill issue, because I'm paying mid journey, so it's a Black Forest skill issue, because I'm not paying them, you know?[00:49:24] Alessio: Yeah,[00:49:25] swyx: so, okay, so, uh, this is a UX thing, right? Like, you, you, you understand that, at least, we think that Black Forest should be able to do all that stuff. I will also shout out, ReCraft has come out, uh, on top of the image arena that, uh, artificial analysis has done, has apparently, uh, Flux's place. Is this still true?[00:49:41] swyx: So, Artificial Analysis is now a company. I highlighted them I think in one of the early AI Newses of the year. And they have launched a whole bunch of arenas. So, they're trying to take on LM Arena, Anastasios and crew. And they have an image arena. Oh yeah, Recraft v3 is now beating Flux 1. 1. Which is very surprising [00:50:00] because Flux And Black Forest Labs are the old stable diffusion crew who left stability after, um, the management issues.[00:50:06] swyx: So Recurve has come from nowhere to be the top image model. Uh, very, very strange. I would also highlight that Grok has now launched Aurora, which is, it's very interesting dynamics between Grok and Black Forest Labs because Grok's images were originally launched, uh, in partnership with Black Forest Labs as a, as a thin wrapper.[00:50:24] swyx: And then Grok was like, no, we'll make our own. And so they've made their own. I don't know, there are no APIs or benchmarks about it. They just announced it. So yeah, that's the multi modality war. I would say that so far, the small model, the dedicated model people are winning, because they are just focused on their tasks.[00:50:42] swyx: But the big model, People are always catching up. And the moment I saw the Gemini 2 demo of image editing, where I can put in an image and just request it and it does, that's how AI should work. Not like a whole bunch of complicated steps. So it really is something. And I think one frontier that we haven't [00:51:00] seen this year, like obviously video has done very well, and it will continue to grow.[00:51:03] swyx: You know, we only have Sora Turbo today, but at some point we'll get full Sora. Oh, at least the Hollywood Labs will get Fulsora. We haven't seen video to audio, or video synced to audio. And so the researchers that I talked to are already starting to talk about that as the next frontier. But there's still maybe like five more years of video left to actually be Soda.[00:51:23] swyx: I would say that Gemini's approach Compared to OpenAI, Gemini seems, or DeepMind's approach to video seems a lot more fully fledged than OpenAI. Because if you look at the ICML recap that I published that so far nobody has listened to, um, that people have listened to it. It's just a different, definitely different audience.[00:51:43] swyx: It's only seven hours long. Why are people not listening? It's like everything in Uh, so, so DeepMind has, is working on Genie. They also launched Genie 2 and VideoPoet. So, like, they have maybe four years advantage on world modeling that OpenAI does not have. Because OpenAI basically only started [00:52:00] Diffusion Transformers last year, you know, when they hired, uh, Bill Peebles.[00:52:03] swyx: So, DeepMind has, has a bit of advantage here, I would say, in, in, in showing, like, the reason that VO2, while one, They cherry pick their videos. So obviously it looks better than Sora, but the reason I would believe that VO2, uh, when it's fully launched will do very well is because they have all this background work in video that they've done for years.[00:52:22] swyx: Like, like last year's NeurIPS, I already was interviewing some of their video people. I forget their model name, but for, for people who are dedicated fans, they can go to NeurIPS 2023 and see, see that paper.[00:52:32] Alessio: And then last but not least, the LLMOS. We renamed it to Ragops, formerly known as[00:52:39] swyx: Ragops War. I put the latest chart on the Braintrust episode.[00:52:43] swyx: I think I'm going to separate these essays from the episode notes. So the reason I used to do that, by the way, is because I wanted to show up on Hacker News. I wanted the podcast to show up on Hacker News. So I always put an essay inside of there because Hacker News people like to read and not listen.[00:52:58] Alessio: So episode essays,[00:52:59] swyx: I remember [00:53:00] purchasing them separately. You say Lanchain Llama Index is still growing.[00:53:03] Alessio: Yeah, so I looked at the PyPy stats, you know. I don't care about stars. On PyPy you see Do you want to share your screen? Yes. I prefer to look at actual downloads, not at stars on GitHub. So if you look at, you know, Lanchain still growing.[00:53:20] Alessio: These are the last six months. Llama Index still growing. What I've basically seen is like things that, One, obviously these things have A commercial product. So there's like people buying this and sticking with it versus kind of hopping in between things versus, you know, for example, crew AI, not really growing as much.[00:53:38] Alessio: The stars are growing. If you look on GitHub, like the stars are growing, but kind of like the usage is kind of like flat. In the last six months, have they done some[00:53:4

god ceo new york amazon spotify time world europe google ai china apple vision pr voice future speaking san francisco new york times phd video thinking chinese simple data predictions elon musk iphone surprise impact legal code chatgpt tesla reflecting memory ga discord busy reddit lgbt cloud flash stem honestly ab pros jeff bezos windows excited researchers unicorns lower ip tackling sort survey insane tier cto vc whispers applications doc seal signing fireworks f1 genie academic openai sf gemini organizing nvidia ux api assembly davos frontier chrome makes scarlett johansson ui mm turbo gpt bash soda aws ml lama dropbox mosaic creative writing github drafting reinvent canvas 1b bolt apis lava ruler exact stripe dev pico strawberry hundred wwdc vm sander bt flux vcs taiwanese 200k moto arr gartner opus assumption sora google docs nemo parting sam altman blackwell llm google drive sombra gpu opa tbd ramp 3b elia elo agi 5b gnome estimates midjourney bytedance leopold dota ciso haiku dx sarah silverman coursera rag gpus sonnets george rr martin cypher quill getty cobalt sdks deepmind ilya perplexity noam grok sheesh v2 ttc anthropic alessio future trends lms satya r1 ssi stack overflow 8b rl emerging trends itc theoretically sota vo2 yi replicate suno veo mistral black forest inflection graphql xai aitor brain trust databricks gpts chinchillas adept nosql mcp jensen huang grand central ai models grand central station hacker news zep hacken ethical issues cosign claud ai news gpc distro lubna autogpt neo4j tpu o3 jeremy howard gbt o1 gpd quent heygen gradients exa langchain loras 70b minimax 400b neurips jeff dean 128k elos gemini pro cerebras code interpreter icml ai winter john franco lstm r1s aws reinvent muser latent space pypy dan gross nova pro paige bailey noam brown quiet capital john frankel
Thinking Elixir Podcast
234: Source Drops, AI, and Holiday Cheer

Thinking Elixir Podcast

Play Episode Listen Later Dec 24, 2024 14:43


News includes Ellie Fairholm and José Giralt D'Lacoste releasing the source code for "Engineering Elixir Applications," Michael Russo introducing "hex2txt" to enhance AI coding assistants, Brian Cardarella showcasing LiveView Native's LiveUploads, Headway's guide on building AI-powered iOS apps with LiveView Native, and more! Wishing you a Merry Christmas and a Happy New Year from all of us! Show Notes online - http://podcast.thinkingelixir.com/234 (http://podcast.thinkingelixir.com/234) Elixir Community News https://github.com/gilacost/engineeringelixirapplications (https://github.com/gilacost/engineering_elixir_applications?utm_source=thinkingelixir&utm_medium=shownotes) – Source code for the book "Engineering Elixir Applications" is now publicly available on GitHub. https://podcast.thinkingelixir.com/206 (https://podcast.thinkingelixir.com/206?utm_source=thinkingelixir&utm_medium=shownotes) – Previous episode with José Giralt D'Lacoste and Ellie Fairholm about their BEAM-focused DevOps book. https://x.com/mjrusso/status/1868881707262439582 (https://x.com/mjrusso/status/1868881707262439582?utm_source=thinkingelixir&utm_medium=shownotes) – Michael Russo created a proof-of-concept package "hex2txt" that converts hex package docs into llms.txt files. https://llmstxt.org/ (https://llmstxt.org/?utm_source=thinkingelixir&utm_medium=shownotes) – Website describing the llms.txt file standard for providing information for coders and AI. https://hex2txt.fly.dev/ (https://hex2txt.fly.dev/?utm_source=thinkingelixir&utm_medium=shownotes) – Michael's website for browsing examples of generated text files using hex2txt. Sum up that a proposal aims for such standardization to help AI coding assistants. https://github.com/brainlid/langchain/discussions/218 (https://github.com/brainlid/langchain/discussions/218?utm_source=thinkingelixir&utm_medium=shownotes) – New release v0.3.0-rc.1 of the Elixir LangChain library. https://github.com/brainlid/langchain (https://github.com/brainlid/langchain?utm_source=thinkingelixir&utm_medium=shownotes) – Repository for the Elixir LangChain library. https://github.com/brainlid/langchain/blob/main/CHANGELOG.md (https://github.com/brainlid/langchain/blob/main/CHANGELOG.md?utm_source=thinkingelixir&utm_medium=shownotes) – CHANGELOG for the Elixir LangChain library detailing breaking changes and updates. New features in LangChain like SummarizeConversationChain and LLMChain.run with fallbacks enhance production resilience and usability. https://bsky.app/profile/liveviewnative.dev/post/3lcgrxqm5lk2g (https://bsky.app/profile/liveviewnative.dev/post/3lcgrxqm5lk2g?utm_source=thinkingelixir&utm_medium=shownotes) – Brian Cardarella showed LiveView Native's support for LiveUploads, unlocking photo and video features. https://bsky.app/profile/bcardarella.bsky.social/post/3ldhg433mxc2y (https://bsky.app/profile/bcardarella.bsky.social/post/3ldhg433mxc2y?utm_source=thinkingelixir&utm_medium=shownotes) – Shows direct usage of LiveUploads in LiveView Native. LiveView Native simplifies mobile app development by reducing project and team requirements. https://bsky.app/profile/liveviewnative.dev/post/3ldhosnmjjc2v (https://bsky.app/profile/liveviewnative.dev/post/3ldhosnmjjc2v?utm_source=thinkingelixir&utm_medium=shownotes) – Building an AI-powered iOS app with LiveView Native by Headway. https://www.youtube.com/watch?v=nx_7gLfk7vA (https://www.youtube.com/watch?v=nx_7gLfk7vA?utm_source=thinkingelixir&utm_medium=shownotes) – 40-minute video tutorial on getting started with LiveView Native, Nx, and Axon by Headway. Do you have some Elixir news to share? Tell us at @ThinkingElixir (https://twitter.com/ThinkingElixir) or email at show@thinkingelixir.com (mailto:show@thinkingelixir.com) Find us online - Message the show - Bluesky (https://bsky.app/profile/thinkingelixir.com) - Message the show - X (https://x.com/ThinkingElixir) - Message the show on Fediverse - @ThinkingElixir@genserver.social (https://genserver.social/ThinkingElixir) - Email the show - show@thinkingelixir.com (mailto:show@thinkingelixir.com) - Mark Ericksen on X - @brainlid (https://x.com/brainlid) - Mark Ericksen on Bluesky - @brainlid.bsky.social (https://bsky.app/profile/brainlid.bsky.social) - Mark Ericksen on Fediverse - @brainlid@genserver.social (https://genserver.social/brainlid) - David Bernheisel on Bluesky - @david.bernheisel.com (https://bsky.app/profile/david.bernheisel.com) - David Bernheisel on Fediverse - @dbern@genserver.social (https://genserver.social/dbern)

OnBoard!
EP 54. 深度对谈顶尖AI开源项目:大模型开源生态, Agent 与中国力量

OnBoard!

Play Episode Listen Later Dec 16, 2024 199:06


聊到生成式AI的发展,开源绝对是最关键的话题之一。这次的嘉宾,可以说涵盖了大模型开源领域最值得关注的公司,真的是黄金阵容! 首先跟大家汇报一下,上周日我们在北京举办的 OnBoard! 第一次线下听友会真是超预期!开放报名4天就250多人报名,周日从上午9点到下午3点,从机器人到AI,创业投资和软件出海,100人的场地,直到最后都几乎座无虚席!真的是非常感谢大家的支持~ Hello World, who is OnBoard!? 回到这一期播客,我们将深入探讨大模型的开源生态。在生成式AI飞速发展的一年多时间里,开源无疑是一个不可忽视的话题。开源模型的迅猛发展,从 Meta 的 Llama 3 到 Mistral 的最新模型,它们对闭源大模型如 GPT4 的追赶,不仅令人惊艳,更加速了 AI 场景下产品的实际应用。而围绕大模型的生态系统,从推理加速到开发工具,再到智能代理,技术栈的丰富程度,虽然已经孕育出了像 Langchain 这样的领军企业,但这一切似乎只是冰山一角。 特别值得一提的是,随着阿里千问系列、Deepseek、以及 Yi 等中国团队主导的模型在国际舞台上崭露头角,我们不禁思考,除了模仿和追赶,中国在大模型领域的发展是否还有更多值得我们关注和自豪的成就。 今天,Monica 有幸邀请到了几位极具代表性的重磅嘉宾,来自 Huggingface 的开源老兵,有通义千问 Qwen 的开源负责人(他也是 Agent 领域最受关注的项目 OpenDevin 核心成员),还有最具国际影响力的开源项目 vLLM 主导人。真是涵盖了大模型开源生态的各个领域的最一线视角! 嘉宾们都太宝藏了,我们的话题延伸到大模型的各个方面,录了近4个小时!我们前半部分聊了很多infra的创新,以及最近很火的、以OpenDevin 为代表的软件开发agent 背后的技术和生态等话题。下半部分,我们回到大模型开源的主题,畅谈了: 底层基础大模型的开源闭源生态,未来可能有怎样的演进? 开源模型商业化跟过去我们在大数据时代看到的databricks 之类开源商业模式有哪些异同? 如何做一个有国际影响力的开源项目? 嘉宾介绍 Tiezhen Wang, Huggingface 工程师,他可以说是中国与世界开源 AI 生态的桥梁,更是从 Google TensorFlow 时代到 Huggingface 早期员工,对中国和世界的开源 AI 生态都有极深的洞察。 Junyang Lin, 通义千问开源负责人,作为 Qwen 在全球开源社区的主要代言人,他不仅见证了开源的发展历程,还是目前备受瞩目的 Agent 开源项目 OpenDevin 的核心团队成员。 李卓翰,UC Berkeley PhD,他所主导的项目更是大名鼎鼎,就是已经成为行业标准的大模型推理框架 vLLM!他所在的 Sky Lab 被誉为开源基础设施的摇篮,从估值百亿美元的 Databricks 到 Anyscale(开源计算框架 Ray 的商业化公司)。他还深度参与了 Chat Arena, Vicuna 等多个国际知名开源项目,对大模型周边生态和 infra 的不仅有国际一线经验,更是有很多有技术理想的干货! OnBoard! 主持:Monica:美元VC投资人,前 AWS 硅谷团队+ AI 创业公司打工人,公众号M小姐研习录 (ID: MissMStudy) 主理人 | 即刻:莫妮卡同学 还有数据、评测等等大模型领域的核心话题,真的非常全面,又不失一线从业者的深度。索性就不分成两部分了,大家可以对着 show notes 里面的时间戳,直接跳转到你感兴趣的话题(虽然我觉得每个话题都很好!)介绍了这么多,还要声明一下,节目里面重点聊到的开源社区 Huggingface,还有几个开源的项目,包括阿里千问、OpenDevin, Deepseek, 零一万物的 Yi,vLLM 等,都没有收取任何广告,完全是嘉宾走心分享,全程无广! 当然,如果你们或者其他AI公司考虑赞助一下我们用爱发电的播客,我们当然也是欢迎的! 三小时硬核马拉松开始,enjoy! 嘉宾介绍 我们都聊了什么 05:28 嘉宾自我介绍,有意思的开源 AI 项目 18:37 vLLM 如何开始的,如何成为全球顶尖项目,为什么我们需要一个大模型推理框架? 30:24 Agent framework: OpenDevin 这样的负责 agent 会带来怎样的推理挑战? 40:37 做好一个编程 Agent,还需要哪些新的工具?多模态会带来怎样的变化? 56:16 我们需要怎样的 Agent Framework?为什么最适合开源社区来做?Framework 会收敛吗? 67:46 什么是 Crew AI? 如何看待 Multi-agent 架构? 73:11 借鉴前端框架的发展历史,如何理解一个框架如何成为行业标准? 77:54 Huggingface 上开源LLM现状,过去一年多有哪些重要进展?有哪些不同的开源方式?泽娜要给你看待一个开源模型的流行程度? 94:27 如何理解不同架构的开源大模型生态?Qwen 如何通过架构演进打造更好的开源生态? 104:59 中国的大模型开源项目有哪些创新?大模型架构有哪些变化? 112:17 为什么说新的模型架构可能会带来商业化的新机会?我们能从以前的开源商业化中学到什么? 119:22 我们看到现有大模型架构的天花板了吗?什么是一个新的架构? 128:03 Zhuohan 从参与最早的开源 LLM 之一 Vicuna 的经历学到什么?学术界和业界在大模型生态上如何分工? 140:48 用于大模型的数据集领域有哪些值得关注的进展? 149:42 Mistral 为什么这么快爆火?打造一流国际开源项目有什么可借鉴的经验?vLLM 有什么道和术上的心得? 166:13 Chatbot Arena 是如何开始的?为什么模型的评测那么重要?还有哪些挑战和可能的进展? 180:49 Zhuohan 对于 vLLM 商业化方式有什么思考?未来推理成本还有哪些下降空间? 188:17 快问快答:过去一年生成式AI发展有什么超出预期和不及预期的地方?未来还有什么值得期待? 我们提到的公司和重点名词 Qwen⁠, ⁠Qwen-2⁠ OpenDevin: ⁠opendevin.github.io⁠ vLLM: ⁠github.com⁠ ⁠Yi (Github)⁠, ⁠零一万物⁠ Chatbot Arena: ⁠huggingface.co⁠ AutoGPT: ⁠github.com⁠ crew AI: ⁠www.crewai.com⁠ autoAWQ: ⁠github.com⁠ LLM.c: ⁠github.com⁠ Flash attention: ⁠github.com⁠ Continuous batching:一种数据处理技术,用于将连续的数据流分批处理,以提高效率和可扩展性。 KV cache:键值对缓存,一种存储结构,通过键快速访问数据值,常用于提高数据检索速度。 Page attention:页面注意力机制,一种在处理长文本时,使模型集中注意力于当前页面或段落的技术。 Quantization:量化,将数据表示的精度降低到更少的比特数,以减少模型大小和提高计算效率。 ⁠Direct Preference Optimization (DPO)⁠: Your Language Model is Secretly a Reward Model Google Gemini: ⁠deepmind.google⁠ Adept: ⁠www.adept.ai⁠ MetaGPT: ⁠github.com⁠ ⁠Dolphin⁠an open-source and uncensored, and commercially licensed dataset and series of instruct-tuned language models based on Microsoft's Orca paper Common crawl: ⁠commoncrawl.org⁠ Tiezhen 的报告:⁠Booming Open Source Chinese-Speaking LLMs: A Closer Look⁠, ⁠Slides⁠ ⁠通义千问一周年,开源狂飙路上的抉择与思考|魔搭深度访谈⁠ ⁠阿里林俊旸:大模型对很多人来说不够用,打造多模态Agent是关键 | 中国AIGC产业峰会⁠ 欢迎关注M小姐的微信公众号,了解更多中美软件、AI与创业投资的干货内容! M小姐研习录 (ID: MissMStudy) 喜欢 OnBoard! 的话,也可以点击打赏,请我们喝一杯咖啡!如果你用 Apple Podcasts 收听,也请给我们一个五星好评,这对我们非常重要。 最后!快来加入Onboard!听友群,结识到高质量的听友们,我们还会组织线下主题聚会,开放实时旁听播客录制,嘉宾互动等新的尝试。添加任意一位小助手微信,onboard666, 或者 Nine_tunes,小助手会拉你进群。期待你来!

Podcast Notes Playlist: Latest Episodes
$100B Founder Breaks Down The Biggest AI Business Opportunities For 2025

Podcast Notes Playlist: Latest Episodes

Play Episode Listen Later Nov 30, 2024 91:38


My First Million Key Takeaways Check out the episode pageRead the full notes @ podcastnotes.orgEpisode 653: Shaan Puri ( https://x.com/ShaanVP ) talks to Furqan Rydhan ( https://x.com/FurqanR ) about the biggest opportunities in AI right now.  — Show Notes:  (0:00) Intro (4:42) Define the Job-to-be-done (8:20) How to build an AI Agent workflow (11:16) AI Tools break down (27:05) How Polymarket won (31:48) Why VR is a sleeping giant? (44:43) Be a lifelong player in tech (58:52) The unbeatable combination (1:02:27) Adam Foroughi's A+ execution (1:18:35) Betting on -1 to 0 — Links: • Furqan's site - https://furqan.sh/  • Founders, Inc - https://f.inc/  • Applovin - https://www.applovin.com/  • Claude - https://claude.ai/  • OpenAI - https://platform.openai.com/  • Langchain - https://www.langchain.com/  • AutoGen - https://autogenai.com/  • Crew - https://www.crewai.com/  • CloudSDK - https://cloud.google.com/sdk/  • Perplexity - https://www.perplexity.ai/  • “Attention is all you need” - https://typeset.io/papers/attention-is-all-you-need-1hodz0wcqb  • Anthropic - https://www.anthropic.com/  • Third Web - https://thirdweb.com/  • Luna's AI Brain - https://terminal.virtuals.io/  • Oasis - https://oasis.decart.ai/welcome  • Polymarket - https://polymarket.com/  • Gorilla Tag - https://www.gorillatagvr.com/  • Yeeps - https://tinyurl.com/59z2yrdu  — Check Out Shaan's Stuff: Need to hire? You should use the same service Shaan uses to hire developers, designers, & Virtual Assistants → it's called Shepherd (tell ‘em Shaan sent you): https://bit.ly/SupportShepherd — Check Out Sam's Stuff: • Hampton - https://www.joinhampton.com/ • Ideation Bootcamp - https://www.ideationbootcamp.co/ • Copy That - https://copythat.com • Hampton Wealth Survey - https://joinhampton.com/wealth • Sam's List - http://samslist.co/ My First Million is a HubSpot Original Podcast // Brought to you by The HubSpot Podcast Network // Production by Arie Desormeaux // Editing by Ezra Bakker Trupiano

Podcast Notes Playlist: Latest Episodes
$100B Founder Breaks Down The Biggest AI Business Opportunities For 2025

Podcast Notes Playlist: Latest Episodes

Play Episode Listen Later Nov 30, 2024


My First Million: Read the notes at at podcastnotes.org. Don't forget to subscribe for free to our newsletter, the top 10 ideas of the week, every Monday --------- Episode 653: Shaan Puri ( https://x.com/ShaanVP ) talks to Furqan Rydhan ( https://x.com/FurqanR ) about the biggest opportunities in AI right now.  — Show Notes:  (0:00) Intro (4:42) Define the Job-to-be-done (8:20) How to build an AI Agent workflow (11:16) AI Tools break down (27:05) How Polymarket won (31:48) Why VR is a sleeping giant? (44:43) Be a lifelong player in tech (58:52) The unbeatable combination (1:02:27) Adam Foroughi's A+ execution (1:18:35) Betting on -1 to 0 — Links: • Furqan's site - https://furqan.sh/  • Founders, Inc - https://f.inc/  • Applovin - https://www.applovin.com/  • Claude - https://claude.ai/  • OpenAI - https://platform.openai.com/  • Langchain - https://www.langchain.com/  • AutoGen - https://autogenai.com/  • Crew - https://www.crewai.com/  • CloudSDK - https://cloud.google.com/sdk/  • Perplexity - https://www.perplexity.ai/  • “Attention is all you need” - https://typeset.io/papers/attention-is-all-you-need-1hodz0wcqb  • Anthropic - https://www.anthropic.com/  • Third Web - https://thirdweb.com/  • Luna's AI Brain - https://terminal.virtuals.io/  • Oasis - https://oasis.decart.ai/welcome  • Polymarket - https://polymarket.com/  • Gorilla Tag - https://www.gorillatagvr.com/  • Yeeps - https://tinyurl.com/59z2yrdu  — Check Out Shaan's Stuff: Need to hire? You should use the same service Shaan uses to hire developers, designers, & Virtual Assistants → it's called Shepherd (tell ‘em Shaan sent you): https://bit.ly/SupportShepherd — Check Out Sam's Stuff: • Hampton - https://www.joinhampton.com/ • Ideation Bootcamp - https://www.ideationbootcamp.co/ • Copy That - https://copythat.com • Hampton Wealth Survey - https://joinhampton.com/wealth • Sam's List - http://samslist.co/ My First Million is a HubSpot Original Podcast // Brought to you by The HubSpot Podcast Network // Production by Arie Desormeaux // Editing by Ezra Bakker Trupiano

Podcast Notes Playlist: Business
$100B Founder Breaks Down The Biggest AI Business Opportunities For 2025

Podcast Notes Playlist: Business

Play Episode Listen Later Nov 30, 2024 91:38


My First Million Key Takeaways Check out the episode pageRead the full notes @ podcastnotes.orgEpisode 653: Shaan Puri ( https://x.com/ShaanVP ) talks to Furqan Rydhan ( https://x.com/FurqanR ) about the biggest opportunities in AI right now.  — Show Notes:  (0:00) Intro (4:42) Define the Job-to-be-done (8:20) How to build an AI Agent workflow (11:16) AI Tools break down (27:05) How Polymarket won (31:48) Why VR is a sleeping giant? (44:43) Be a lifelong player in tech (58:52) The unbeatable combination (1:02:27) Adam Foroughi's A+ execution (1:18:35) Betting on -1 to 0 — Links: • Furqan's site - https://furqan.sh/  • Founders, Inc - https://f.inc/  • Applovin - https://www.applovin.com/  • Claude - https://claude.ai/  • OpenAI - https://platform.openai.com/  • Langchain - https://www.langchain.com/  • AutoGen - https://autogenai.com/  • Crew - https://www.crewai.com/  • CloudSDK - https://cloud.google.com/sdk/  • Perplexity - https://www.perplexity.ai/  • “Attention is all you need” - https://typeset.io/papers/attention-is-all-you-need-1hodz0wcqb  • Anthropic - https://www.anthropic.com/  • Third Web - https://thirdweb.com/  • Luna's AI Brain - https://terminal.virtuals.io/  • Oasis - https://oasis.decart.ai/welcome  • Polymarket - https://polymarket.com/  • Gorilla Tag - https://www.gorillatagvr.com/  • Yeeps - https://tinyurl.com/59z2yrdu  — Check Out Shaan's Stuff: Need to hire? You should use the same service Shaan uses to hire developers, designers, & Virtual Assistants → it's called Shepherd (tell ‘em Shaan sent you): https://bit.ly/SupportShepherd — Check Out Sam's Stuff: • Hampton - https://www.joinhampton.com/ • Ideation Bootcamp - https://www.ideationbootcamp.co/ • Copy That - https://copythat.com • Hampton Wealth Survey - https://joinhampton.com/wealth • Sam's List - http://samslist.co/ My First Million is a HubSpot Original Podcast // Brought to you by The HubSpot Podcast Network // Production by Arie Desormeaux // Editing by Ezra Bakker Trupiano

My First Million
$100B Founder Breaks Down The Biggest AI Business Opportunities For 2025

My First Million

Play Episode Listen Later Nov 26, 2024 91:38


Episode 653: Shaan Puri ( https://x.com/ShaanVP ) talks to Furqan Rydhan ( https://x.com/FurqanR ) about the biggest opportunities in AI right now.  — Show Notes:  (0:00) Intro (4:42) Define the Job-to-be-done (8:20) How to build an AI Agent workflow (11:16) AI Tools break down (27:05) How Polymarket won (31:48) Why VR is a sleeping giant? (44:43) Be a lifelong player in tech (58:52) The unbeatable combination (1:02:27) Adam Foroughi's A+ execution (1:18:35) Betting on -1 to 0 — Links: • Furqan's site - https://furqan.sh/  • Founders, Inc - https://f.inc/  • Applovin - https://www.applovin.com/  • Claude - https://claude.ai/  • OpenAI - https://platform.openai.com/  • Langchain - https://www.langchain.com/  • AutoGen - https://autogenai.com/  • Crew - https://www.crewai.com/  • CloudSDK - https://cloud.google.com/sdk/  • Perplexity - https://www.perplexity.ai/  • “Attention is all you need” - https://typeset.io/papers/attention-is-all-you-need-1hodz0wcqb  • Anthropic - https://www.anthropic.com/  • Third Web - https://thirdweb.com/  • Luna's AI Brain - https://terminal.virtuals.io/  • Oasis - https://oasis.decart.ai/welcome  • Polymarket - https://polymarket.com/  • Gorilla Tag - https://www.gorillatagvr.com/  • Yeeps - https://tinyurl.com/59z2yrdu  — Check Out Shaan's Stuff: Need to hire? You should use the same service Shaan uses to hire developers, designers, & Virtual Assistants → it's called Shepherd (tell ‘em Shaan sent you): https://bit.ly/SupportShepherd — Check Out Sam's Stuff: • Hampton - https://www.joinhampton.com/ • Ideation Bootcamp - https://www.ideationbootcamp.co/ • Copy That - https://copythat.com • Hampton Wealth Survey - https://joinhampton.com/wealth • Sam's List - http://samslist.co/ My First Million is a HubSpot Original Podcast // Brought to you by The HubSpot Podcast Network // Production by Arie Desormeaux // Editing by Ezra Bakker Trupiano

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Alessio will be at AWS re:Invent next week and hosting a casual coffee meetup on Wednesday, RSVP here! And subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups!We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!If you've been following the AI agents space, you have heard of Lindy AI; while founder Flo Crivello is hesitant to call it "blowing up," when folks like Andrew Wilkinson start obsessing over your product, you're definitely onto something.In our latest episode, Flo walked us through Lindy's evolution from late 2022 to now, revealing some design choices about agent platform design that go against conventional wisdom in the space.The Great Reset: From Text Fields to RailsRemember late 2022? Everyone was "LLM-pilled," believing that if you just gave a language model enough context and tools, it could do anything. Lindy 1.0 followed this pattern:* Big prompt field ✅* Bunch of tools ✅* Prayer to the LLM gods ✅Fast forward to today, and Lindy 2.0 looks radically different. As Flo put it (~17:00 in the episode): "The more you can put your agent on rails, one, the more reliable it's going to be, obviously, but two, it's also going to be easier to use for the user."Instead of a giant, intimidating text field, users now build workflows visually:* Trigger (e.g., "Zendesk ticket received")* Required actions (e.g., "Check knowledge base")* Response generationThis isn't just a UI change - it's a fundamental rethinking of how to make AI agents reliable. As Swyx noted during our discussion: "Put Shoggoth in a box and make it a very small, minimal viable box. Everything else should be traditional if-this-then-that software."The Surprising Truth About Model LimitationsHere's something that might shock folks building in the space: with Claude 3.5 Sonnet, the model is no longer the bottleneck. Flo's exact words (~31:00): "It is actually shocking the extent to which the model is no longer the limit. It was the limit a year ago. It was too expensive. The context window was too small."Some context: Lindy started when context windows were 4K tokens. Today, their system prompt alone is larger than that. But what's really interesting is what this means for platform builders:* Raw capabilities aren't the constraint anymore* Integration quality matters more than model performance* User experience and workflow design are the new bottlenecksThe Search Engine Parallel: Why Horizontal Platforms Might WinOne of the spiciest takes from our conversation was Flo's thesis on horizontal vs. vertical agent platforms. He draws a fascinating parallel to search engines (~56:00):"I find it surprising the extent to which a horizontal search engine has won... You go through Google to search Reddit. You go through Google to search Wikipedia... search in each vertical has more in common with search than it does with each vertical."His argument: agent platforms might follow the same pattern because:* Agents across verticals share more commonalities than differences* There's value in having agents that can work together under one roof* The R&D cost of getting agents right is better amortized across use casesThis might explain why we're seeing early vertical AI companies starting to expand horizontally. The core agent capabilities - reliability, context management, tool integration - are universal needs.What This Means for BuildersIf you're building in the AI agents space, here are the key takeaways:* Constrain First: Rather than maximizing capabilities, focus on reliable execution within narrow bounds* Integration Quality Matters: With model capabilities plateauing, your competitive advantage lies in how well you integrate with existing tools* Memory Management is Key: Flo revealed they actively prune agent memories - even with larger context windows, not all memories are useful* Design for Discovery: Lindy's visual workflow builder shows how important interface design is for adoptionThe Meta LayerThere's a broader lesson here about AI product development. Just as Lindy evolved from "give the LLM everything" to "constrain intelligently," we might see similar evolution across the AI tooling space. The winners might not be those with the most powerful models, but those who best understand how to package AI capabilities in ways that solve real problems reliably.Full Video PodcastFlo's talk at AI Engineer SummitChapters* 00:00:00 Introductions * 00:04:05 AI engineering and deterministic software * 00:08:36 Lindys demo* 00:13:21 Memory management in AI agents * 00:18:48 Hierarchy and collaboration between Lindys * 00:21:19 Vertical vs. horizontal AI tools * 00:24:03 Community and user engagement strategies * 00:26:16 Rickrolling incident with Lindy * 00:28:12 Evals and quality control in AI systems * 00:31:52 Model capabilities and their impact on Lindy * 00:39:27 Competition and market positioning * 00:42:40 Relationship between Factorio and business strategy * 00:44:05 Remote work vs. in-person collaboration * 00:49:03 Europe vs US Tech* 00:58:59 Testing the Overton window and free speech * 01:04:20 Balancing AI safety concerns with business innovation Show Notes* Lindy.ai* Rick Rolling* Flo on X* TeamFlow* Andrew Wilkinson* Dust* Poolside.ai* SB1047* Gathertown* Sid Sijbrandij* Matt Mullenweg* Factorio* Seeing Like a StateTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:12]: Hey, and today we're joined in the studio by Florent Crivello. Welcome.Flo [00:00:15]: Hey, yeah, thanks for having me.Swyx [00:00:17]: Also known as Altimore. I always wanted to ask, what is Altimore?Flo [00:00:21]: It was the name of my character when I was playing Dungeons & Dragons. Always. I was like 11 years old.Swyx [00:00:26]: What was your classes?Flo [00:00:27]: I was an elf. I was a magician elf.Swyx [00:00:30]: Well, you're still spinning magic. Right now, you're a solo founder and CEO of Lindy.ai. What is Lindy?Flo [00:00:36]: Yeah, we are a no-code platform letting you build your own AI agents easily. So you can think of we are to LangChain as Airtable is to MySQL. Like you can just pin up AI agents super easily by clicking around and no code required. You don't have to be an engineer and you can automate business workflows that you simply could not automate before in a few minutes.Swyx [00:00:55]: You've been in our orbit a few times. I think you spoke at our Latent Space anniversary. You spoke at my summit, the first summit, which was a really good keynote. And most recently, like we actually already scheduled this podcast before this happened. But Andrew Wilkinson was like, I'm obsessed by Lindy. He's just created a whole bunch of agents. So basically, why are you blowing up?Flo [00:01:16]: Well, thank you. I think we are having a little bit of a moment. I think it's a bit premature to say we're blowing up. But why are things going well? We revamped the product majorly. We called it Lindy 2.0. I would say we started working on that six months ago. We've actually not really announced it yet. It's just, I guess, I guess that's what we're doing now. And so we've basically been cooking for the last six months, like really rebuilding the product from scratch. I think I'll list you, actually, the last time you tried the product, it was still Lindy 1.0. Oh, yeah. If you log in now, the platform looks very different. There's like a ton more features. And I think one realization that we made, and I think a lot of folks in the agent space made the same realization, is that there is such a thing as too much of a good thing. I think many people, when they started working on agents, they were very LLM peeled and chat GPT peeled, right? They got ahead of themselves in a way, and us included, and they thought that agents were actually, and LLMs were actually more advanced than they actually were. And so the first version of Lindy was like just a giant prompt and a bunch of tools. And then the realization we had was like, hey, actually, the more you can put your agent on Rails, one, the more reliable it's going to be, obviously, but two, it's also going to be easier to use for the user, because you can really, as a user, you get, instead of just getting this big, giant, intimidating text field, and you type words in there, and you have no idea if you're typing the right word or not, here you can really click and select step by step, and tell your agent what to do, and really give as narrow or as wide a guardrail as you want for your agent. We started working on that. We called it Lindy on Rails about six months ago, and we started putting it into the hands of users over the last, I would say, two months or so, and I think things really started going pretty well at that point. The agent is way more reliable, way easier to set up, and we're already seeing a ton of new use cases pop up.Swyx [00:03:00]: Yeah, just a quick follow-up on that. You launched the first Lindy in November last year, and you were already talking about having a DSL, right? I remember having this discussion with you, and you were like, it's just much more reliable. Is this still the DSL under the hood? Is this a UI-level change, or is it a bigger rewrite?Flo [00:03:17]: No, it is a much bigger rewrite. I'll give you a concrete example. Suppose you want to have an agent that observes your Zendesk tickets, and it's like, hey, every time you receive a Zendesk ticket, I want you to check my knowledge base, so it's like a RAG module and whatnot, and then answer the ticket. The way it used to work with Lindy before was, you would type the prompt asking it to do that. You check my knowledge base, and so on and so forth. The problem with doing that is that it can always go wrong. You're praying the LLM gods that they will actually invoke your knowledge base, but I don't want to ask it. I want it to always, 100% of the time, consult the knowledge base after it receives a Zendesk ticket. And so with Lindy, you can actually have the trigger, which is Zendesk ticket received, have the knowledge base consult, which is always there, and then have the agent. So you can really set up your agent any way you want like that.Swyx [00:04:05]: This is something I think about for AI engineering as well, which is the big labs want you to hand over everything in the prompts, and only code of English, and then the smaller brains, the GPU pours, always want to write more code to make things more deterministic and reliable and controllable. One way I put it is put Shoggoth in a box and make it a very small, the minimal viable box. Everything else should be traditional, if this, then that software.Flo [00:04:29]: I love that characterization, put the Shoggoth in the box. Yeah, we talk about using as much AI as necessary and as little as possible.Alessio [00:04:37]: And what was the choosing between kind of like this drag and drop, low code, whatever, super code-driven, maybe like the Lang chains, auto-GPT of the world, and maybe the flip side of it, which you don't really do, it's like just text to agent, it's like build the workflow for me. Like what have you learned actually putting this in front of users and figuring out how much do they actually want to add it versus like how much, you know, kind of like Ruby on Rails instead of Lindy on Rails, it's kind of like, you know, defaults over configuration.Flo [00:05:06]: I actually used to dislike when people said, oh, text is not a great interface. I was like, ah, this is such a mid-take, I think text is awesome. And I've actually come around, I actually sort of agree now that text is really not great. I think for people like you and me, because we sort of have a mental model, okay, when I type a prompt into this text box, this is what it's going to do, it's going to map it to this kind of data structure under the hood and so forth. I guess it's a little bit blackmailing towards humans. You jump on these calls with humans and you're like, here's a text box, this is going to set up an agent for you, do it. And then they type words like, I want you to help me put order in my inbox. Oh, actually, this is a good one. This is actually a good one. What's a bad one? I would say 60 or 70% of the prompts that people type don't mean anything. Me as a human, as AGI, I don't understand what they mean. I don't know what they mean. It is actually, I think whenever you can have a GUI, it is better than to have just a pure text interface.Alessio [00:05:58]: And then how do you decide how much to expose? So even with the tools, you have Slack, you have Google Calendar, you have Gmail. Should people by default just turn over access to everything and then you help them figure out what to use? I think that's the question. When I tried to set up Slack, it was like, hey, give me access to all channels and everything, which for the average person probably makes sense because you don't want to re-prompt them every time you add new channels. But at the same time, for maybe the more sophisticated enterprise use cases, people are like, hey, I want to really limit what you have access to. How do you kind of thread that balance?Flo [00:06:35]: The general philosophy is we ask for the least amount of permissions needed at any given moment. I don't think Slack, I could be mistaken, but I don't think Slack lets you request permissions for just one channel. But for example, for Google, obviously there are hundreds of scopes that you could require for Google. There's a lot of scopes. And sometimes it's actually painful to set up your Lindy because you're going to have to ask Google and add scopes five or six times. We've had sessions like this. But that's what we do because, for example, the Lindy email drafter, she's going to ask you for your authorization once for, I need to be able to read your email so I can draft a reply, and then another time for I need to be able to write a draft for them. We just try to do it very incrementally like that.Alessio [00:07:15]: Do you think OAuth is just overall going to change? I think maybe before it was like, hey, we need to set up OAuth that humans only want to kind of do once. So we try to jam-pack things all at once versus what if you could on-demand get different permissions every time from different parts? Do you ever think about designing things knowing that maybe AI will use it instead of humans will use it? Yeah, for sure.Flo [00:07:37]: One pattern we've started to see is people provisioning accounts for their AI agents. And so, in particular, Google Workspace accounts. So, for example, Lindy can be used as a scheduling assistant. So you can just CC her to your emails when you're trying to find time with someone. And just like a human assistant, she's going to go back and forth and offer other abilities and so forth. Very often, people don't want the other party to know that it's an AI. So it's actually funny. They introduce delays. They ask the agent to wait before replying, so it's not too obvious that it's an AI. And they provision an account on Google Suite, which costs them like $10 a month or something like that. So we're seeing that pattern more and more. I think that does the job for now. I'm not optimistic on us actually patching OAuth. Because I agree with you, ultimately, we would want to patch OAuth because the new account thing is kind of a clutch. It's really a hack. You would want to patch OAuth to have more granular access control and really be able to put your sugar in the box. I'm not optimistic on us doing that before AGI, I think. That's a very close timeline.Swyx [00:08:36]: I'm mindful of talking about a thing without showing it. And we already have the setup to show it. Why don't we jump into a screen share? For listeners, you can jump on the YouTube and like and subscribe. But also, let's have a look at how you show off Lindy. Yeah, absolutely.Flo [00:08:51]: I'll give an example of a very simple Lindy and then I'll graduate to a much more complicated one. A super simple Lindy that I have is, I unfortunately bought some investment properties in the south of France. It was a really, really bad idea. And I put them on a Holydew, which is like the French Airbnb, if you will. And so I received these emails from time to time telling me like, oh, hey, you made 200 bucks. Someone booked your place. When I receive these emails, I want to log this reservation in a spreadsheet. Doing this without an AI agent or without AI in general is a pain in the butt because you must write an HTML parser for this email. And so it's just hard. You may not be able to do it and it's going to break the moment the email changes. By contrast, the way it works with Lindy, it's really simple. It's two steps. It's like, okay, I receive an email. If it is a reservation confirmation, I have this filter here. Then I append a row to this spreadsheet. And so this is where you can see the AI part where the way this action is configured here, you see these purple fields on the right. Each of these fields is a prompt. And so I can say, okay, you extract from the email the day the reservation begins on. You extract the amount of the reservation. You extract the number of travelers of the reservation. And now you can see when I look at the task history of this Lindy, it's really simple. It's like, okay, you do this and boom, appending this row to this spreadsheet. And this is the information extracted. So effectively, this node here, this append row node is a mini agent. It can see everything that just happened. It has context over the task and it's appending the row. And then it's going to send a reply to the thread. That's a very simple example of an agent.Swyx [00:10:34]: A quick follow-up question on this one while we're still on this page. Is that one call? Is that a structured output call? Yeah. Okay, nice. Yeah.Flo [00:10:41]: And you can see here for every node, you can configure which model you want to power the node. Here I use cloud. For this, I use GPT-4 Turbo. Much more complex example, my meeting recorder. It looks very complex because I've added to it over time, but at a high level, it's really simple. It's like when a meeting begins, you record the meeting. And after the meeting, you send me a summary and you send me coaching notes. So I receive, like my Lindy is constantly coaching me. And so you can see here in the prompt of the coaching notes, I've told it, hey, you know, was I unnecessarily confrontational at any point? I'm French, so I have to watch out for that. Or not confrontational enough. Should I have double-clicked on any issue, right? So I can really give it exactly the kind of coaching that I'm expecting. And then the interesting thing here is, like, you can see the agent here, after it sent me these coaching notes, moves on. And it does a bunch of other stuff. So it goes on Slack. It disseminates the notes on Slack. It does a bunch of other stuff. But it's actually able to backtrack and resume the automation at the coaching notes email if I responded to that email. So I'll give a super concrete example. This is an actual coaching feedback that I received from Lindy. She was like, hey, this was a sales call I had with a customer. And she was like, I found your explanation of Lindy too technical. And I was able to follow up and just ask a follow-up question in the thread here. And I was like, why did you find too technical about my explanation? And Lindy restored the context. And so she basically picked up the automation back up here in the tree. And she has all of the context of everything that happened, including the meeting in which I was. So she was like, oh, you used the words deterministic and context window and agent state. And that concept exists at every level for every channel and every action that Lindy takes. So another example here is, I mentioned she also disseminates the notes on Slack. So this was a meeting where I was not, right? So this was a teammate. He's an indie meeting recorder, posts the meeting notes in this customer discovery channel on Slack. So you can see, okay, this is the onboarding call we had. This was the use case. Look at the questions. How do I make Lindy slower? How do I add delays to make Lindy slower? And I was able, in the Slack thread, to ask follow-up questions like, oh, what did we answer to these questions? And it's really handy because I know I can have this sort of interactive Q&A with these meetings. It means that very often now, I don't go to meetings anymore. I just send my Lindy. And instead of going to like a 60-minute meeting, I have like a five-minute chat with my Lindy afterwards. And she just replied. She was like, well, this is what we replied to this customer. And I can just be like, okay, good job, Jack. Like, no notes about your answers. So that's the kind of use cases people have with Lindy. It's a lot of like, there's a lot of sales automations, customer support automations, and a lot of this, which is basically personal assistance automations, like meeting scheduling and so forth.Alessio [00:13:21]: Yeah, and I think the question that people might have is memory. So as you get coaching, how does it track whether or not you're improving? You know, if these are like mistakes you made in the past, like, how do you think about that?Flo [00:13:31]: Yeah, we have a memory module. So I'll show you my meeting scheduler, Lindy, which has a lot of memories because by now I've used her for so long. And so every time I talk to her, she saves a memory. If I tell her, you screwed up, please don't do this. So you can see here, oh, it's got a double memory here. This is the meeting link I have, or this is the address of the office. If I tell someone to meet me at home, this is the address of my place. This is the code. I guess we'll have to edit that out. This is not the code of my place. No dogs. Yeah, so Lindy can just manage her own memory and decide when she's remembering things between executions. Okay.Swyx [00:14:11]: I mean, I'm just going to take the opportunity to ask you, since you are the creator of this thing, how come there's so few memories, right? Like, if you've been using this for two years, there should be thousands of thousands of things. That is a good question.Flo [00:14:22]: Agents still get confused if they have too many memories, to my point earlier about that. So I just am out of a call with a member of the Lama team at Meta, and we were chatting about Lindy, and we were going into the system prompt that we sent to Lindy, and all of that stuff. And he was amazed, and he was like, it's a miracle that it's working, guys. He was like, this kind of system prompt, this does not exist, either pre-training or post-training. These models were never trained to do this kind of stuff. It's a miracle that they can be agents at all. And so what I do, I actually prune the memories. You know, it's actually something I've gotten into the habit of doing from back when we had GPT 3.5, being Lindy agents. I suspect it's probably not as necessary in the Cloud 3.5 Sunette days, but I prune the memories. Yeah, okay.Swyx [00:15:05]: The reason is because I have another assistant that also is recording and trying to come up with facts about me. It comes up with a lot of trivial, useless facts that I... So I spend most of my time pruning. Actually, it's not super useful. I'd much rather have high-quality facts that it accepts. Or maybe I was even thinking, were you ever tempted to add a wake word to only memorize this when I say memorize this? And otherwise, don't even bother.Flo [00:15:30]: I have a Lindy that does this. So this is my inbox processor, Lindy. It's kind of beefy because there's a lot of different emails. But somewhere in here,Swyx [00:15:38]: there is a rule where I'm like,Flo [00:15:39]: aha, I can email my inbox processor, Lindy. It's really handy. So she has her own email address. And so when I process my email inbox, I sometimes forward an email to her. And it's a newsletter, or it's like a cold outreach from a recruiter that I don't care about, or anything like that. And I can give her a rule. And I can be like, hey, this email I want you to archive, moving forward. Or I want you to alert me on Slack when I have this kind of email. It's really important. And so you can see here, the prompt is, if I give you a rule about a kind of email, like archive emails from X, save it as a new memory. And I give it to the memory saving skill. And yeah.Swyx [00:16:13]: One thing that just occurred to me, so I'm a big fan of virtual mailboxes. I recommend that everybody have a virtual mailbox. You could set up a physical mail receive thing for Lindy. And so then Lindy can process your physical mail.Flo [00:16:26]: That's actually a good idea. I actually already have something like that. I use like health class mail. Yeah. So yeah, most likely, I can process my physical mail. Yeah.Swyx [00:16:35]: And then the other product's idea I have, looking at this thing, is people want to brag about the complexity of their Lindys. So this would be like a 65 point Lindy, right?Flo [00:16:43]: What's a 65 point?Swyx [00:16:44]: Complexity counting. Like how many nodes, how many things, how many conditions, right? Yeah.Flo [00:16:49]: This is not the most complex one. I have another one. This designer recruiter here is kind of beefy as well. Right, right, right. So I'm just saying,Swyx [00:16:56]: let people brag. Let people be super users. Oh, right.Flo [00:16:59]: Give them a score. Give them a score.Swyx [00:17:01]: Then they'll just be like, okay, how high can you make this score?Flo [00:17:04]: Yeah, that's a good point. And I think that's, again, the beauty of this on-rails phenomenon. It's like, think of the equivalent, the prompt equivalent of this Lindy here, for example, that we're looking at. It'd be monstrous. And the odds that it gets it right are so low. But here, because we're really holding the agent's hand step by step by step, it's actually super reliable. Yeah.Swyx [00:17:22]: And is it all structured output-based? Yeah. As far as possible? Basically. Like, there's no non-structured output?Flo [00:17:27]: There is. So, for example, here, this AI agent step, right, or this send message step, sometimes it gets to... That's just plain text.Swyx [00:17:35]: That's right.Flo [00:17:36]: Yeah. So I'll give you an example. Maybe it's TMI. I'm having blood pressure issues these days. And so this Lindy here, I give it my blood pressure readings, and it updates a log that I have of my blood pressure that it sends to my doctor.Swyx [00:17:49]: Oh, so every Lindy comes with a to-do list?Flo [00:17:52]: Yeah. Every Lindy has its own task history. Huh. Yeah. And so you can see here, this is my main Lindy, my personal assistant, and I've told it, where is this? There is a point where I'm like, if I am giving you a health-related fact, right here, I'm giving you health information, so then you update this log that I have in this Google Doc, and then you send me a message. And you can see, I've actually not configured this send message node. I haven't told it what to send me a message for. Right? And you can see, it's actually lecturing me. It's like, I'm giving it my blood pressure ratings. It's like, hey, it's a bit high. Here are some lifestyle changes you may want to consider.Alessio [00:18:27]: I think maybe this is the most confusing or new thing for people. So even I use Lindy and I didn't even know you could have multiple workflows in one Lindy. I think the mental model is kind of like the Zapier workflows. It starts and it ends. It doesn't choose between. How do you think about what's a Lindy versus what's a sub-function of a Lindy? Like, what's the hierarchy?Flo [00:18:48]: Yeah. Frankly, I think the line is a little arbitrary. It's kind of like when you code, like when do you start to create a new class versus when do you overload your current class. I think of it in terms of like jobs to be done and I think of it in terms of who is the Lindy serving. This Lindy is serving me personally. It's really my day-to-day Lindy. I give it a bunch of stuff, like very easy tasks. And so this is just the Lindy I go to. Sometimes when a task is really more specialized, so for example, I have this like summarizer Lindy or this designer recruiter Lindy. These tasks are really beefy. I wouldn't want to add this to my main Lindy, so I just created a separate Lindy for it. Or when it's a Lindy that serves another constituency, like our customer support Lindy, I don't want to add that to my personal assistant Lindy. These are two very different Lindys.Alessio [00:19:31]: And you can call a Lindy from within another Lindy. That's right. You can kind of chain them together.Flo [00:19:36]: Lindys can work together, absolutely.Swyx [00:19:38]: A couple more things for the video portion. I noticed you have a podcast follower. We have to ask about that. What is that?Flo [00:19:46]: So this one wakes me up every... So wakes herself up every week. And she sends me... So she woke up yesterday, actually. And she searches for Lenny's podcast. And she looks for like the latest episode on YouTube. And once she finds it, she transcribes the video and then she sends me the summary by email. I don't listen to podcasts as much anymore. I just like read these summaries. Yeah.Alessio [00:20:09]: We should make a latent space Lindy. Marketplace.Swyx [00:20:12]: Yeah. And then you have a whole bunch of connectors. I saw the list briefly. Any interesting one? Complicated one that you're proud of? Anything that you want to just share? Connector stories.Flo [00:20:23]: So many of our workflows are about meeting scheduling. So we had to build some very open unity tools around meeting scheduling. So for example, one that is surprisingly hard is this find available times action. You would not believe... This is like a thousand lines of code or something. It's just a very beefy action. And you can pass it a bunch of parameters about how long is the meeting? When does it start? When does it end? What are the meetings? The weekdays in which I meet? How many time slots do you return? What's the buffer between my meetings? It's just a very, very, very complex action. I really like our GitHub action. So we have a Lindy PR reviewer. And it's really handy because anytime any bug happens... So the Lindy reads our guidelines on Google Docs. By now, the guidelines are like 40 pages long or something. And so every time any new kind of bug happens, we just go to the guideline and we add the lines. Like, hey, this has happened before. Please watch out for this category of bugs. And it's saving us so much time every day.Alessio [00:21:19]: There's companies doing PR reviews. Where does a Lindy start? When does a company start? Or maybe how do you think about the complexity of these tasks when it's going to be worth having kind of like a vertical standalone company versus just like, hey, a Lindy is going to do a good job 99% of the time?Flo [00:21:34]: That's a good question. We think about this one all the time. I can't say that we've really come up with a very crisp articulation of when do you want to use a vertical tool versus when do you want to use a horizontal tool. I think of it as very similar to the internet. I find it surprising the extent to which a horizontal search engine has won. But I think that Google, right? But I think the even more surprising fact is that the horizontal search engine has won in almost every vertical, right? You go through Google to search Reddit. You go through Google to search Wikipedia. I think maybe the biggest exception is e-commerce. Like you go to Amazon to search e-commerce, but otherwise you go through Google. And I think that the reason for that is because search in each vertical has more in common with search than it does with each vertical. And search is so expensive to get right. Like Google is a big company that it makes a lot of sense to aggregate all of these different use cases and to spread your R&D budget across all of these different use cases. I have a thesis, which is, it's a really cool thesis for Lindy, is that the same thing is true for agents. I think that by and large, in a lot of verticals, agents in each vertical have more in common with agents than they do with each vertical. I also think there are benefits in having a single agent platform because that way your agents can work together. They're all like under one roof. That way you only learn one platform and so you can create agents for everything that you want. And you don't have to like pay for like a bunch of different platforms and so forth. So I think ultimately, it is actually going to shake out in a way that is similar to search in that search is everywhere on the internet. Every website has a search box, right? So there's going to be a lot of vertical agents for everything. I think AI is going to completely penetrate every category of software. But then I also think there are going to be a few very, very, very big horizontal agents that serve a lot of functions for people.Swyx [00:23:14]: That is actually one of the questions that we had about the agent stuff. So I guess we can transition away from the screen and I'll just ask the follow-up, which is, that is a hot topic. You're basically saying that the current VC obsession of the day, which is vertical AI enabled SaaS, is mostly not going to work out. And then there are going to be some super giant horizontal SaaS.Flo [00:23:34]: Oh, no, I'm not saying it's either or. Like SaaS today, vertical SaaS is huge and there's also a lot of horizontal platforms. If you look at like Airtable or Notion, basically the entire no-code space is very horizontal. I mean, Loom and Zoom and Slack, there's a lot of very horizontal tools out there. Okay.Swyx [00:23:49]: I was just trying to get a reaction out of you for hot takes. Trying to get a hot take.Flo [00:23:54]: No, I also think it is natural for the vertical solutions to emerge first because it's just easier to build. It's just much, much, much harder to build something horizontal. Cool.Swyx [00:24:03]: Some more Lindy-specific questions. So we covered most of the top use cases and you have an academy. That was nice to see. I also see some other people doing it for you for free. So like Ben Spites is doing it and then there's some other guy who's also doing like lessons. Yeah. Which is kind of nice, right? Yeah, absolutely. You don't have to do any of that.Flo [00:24:20]: Oh, we've been seeing it more and more on like LinkedIn and Twitter, like people posting their Lindys and so forth.Swyx [00:24:24]: I think that's the flywheel that you built the platform where creators see value in allying themselves to you. And so then, you know, your incentive is to make them successful so that they can make other people successful and then it just drives more and more engagement. Like it's earned media. Like you don't have to do anything.Flo [00:24:39]: Yeah, yeah. I mean, community is everything.Swyx [00:24:41]: Are you doing anything special there? Any big wins?Flo [00:24:44]: We have a Slack community that's pretty active. I can't say we've invested much more than that so far.Swyx [00:24:49]: I would say from having, so I have some involvement in the no-code community. I would say that Webflow going very hard after no-code as a category got them a lot more allies than just the people using Webflow. So it helps you to grow the community beyond just Lindy. And I don't know what this is called. Maybe it's just no-code again. Maybe you want to call it something different. But there's definitely an appetite for this and you are one of a broad category, right? Like just before you, we had Dust and, you know, they're also kind of going after a similar market. Zapier obviously is not going to try to also compete with you. Yeah. There's no question there. It's just like a reaction about community. Like I think a lot about community. Lanespace is growing the community of AI engineers. And I think you have a slightly different audience of, I don't know what.Flo [00:25:33]: Yeah. I think the no-code tinkerers is the community. Yeah. It is going to be the same sort of community as what Webflow, Zapier, Airtable, Notion to some extent.Swyx [00:25:43]: Yeah. The framing can be different if you were, so I think tinkerers has this connotation of not serious or like small. And if you framed it to like no-code EA, we're exclusively only for CEOs with a certain budget, then you just have, you tap into a different budget.Flo [00:25:58]: That's true. The problem with EA is like, the CEO has no willingness to actually tinker and play with the platform.Swyx [00:26:05]: Maybe Andrew's doing that. Like a lot of your biggest advocates are CEOs, right?Flo [00:26:09]: A solopreneur, you know, small business owners, I think Andrew is an exception. Yeah. Yeah, yeah, he is.Swyx [00:26:14]: He's an exception in many ways. Yep.Alessio [00:26:16]: Just before we wrap on the use cases, is Rick rolling your customers? Like a officially supported use case or maybe tell that story?Flo [00:26:24]: It's one of the main jobs to be done, really. Yeah, we woke up recently, so we have a Lindy obviously doing our customer support and we do check after the Lindy. And so we caught this email exchange where someone was asking Lindy for video tutorials. And at the time, actually, we did not have video tutorials. We do now on the Lindy Academy. And Lindy responded to the email. It's like, oh, absolutely, here's a link. And we were like, what? Like, what kind of link did you send? And so we clicked on the link and it was a recall. We actually reacted fast enough that the customer had not yet opened the email. And so we reacted immediately. Like, oh, hey, actually, sorry, this is the right link. And so the customer never reacted to the first link. And so, yeah, I tweeted about that. It went surprisingly viral. And I checked afterwards in the logs. We did like a database query and we found, I think, like three or four other instances of it having happened before.Swyx [00:27:12]: That's surprisingly low.Flo [00:27:13]: It is low. And we fixed it across the board by just adding a line to the system prompt that's like, hey, don't recall people, please don't recall.Swyx [00:27:21]: Yeah, yeah, yeah. I mean, so, you know, you can explain it retroactively, right? Like, that YouTube slug has been pasted in so many different corpuses that obviously it learned to hallucinate that.Alessio [00:27:31]: And it pretended to be so many things. That's the thing.Swyx [00:27:34]: I wouldn't be surprised if that takes one token. Like, there's this one slug in the tokenizer and it's just one token.Flo [00:27:41]: That's the idea of a YouTube video.Swyx [00:27:43]: Because it's used so much, right? And you have to basically get it exactly correct. It's probably not. That's a long speech.Flo [00:27:52]: It would have been so good.Alessio [00:27:55]: So this is just a jump maybe into evals from here. How could you possibly come up for an eval that says, make sure my AI does not recall my customer? I feel like when people are writing evals, that's not something that they come up with. So how do you think about evals when it's such like an open-ended problem space?Flo [00:28:12]: Yeah, it is tough. We built quite a bit of infrastructure for us to create evals in one click from any conversation history. So we can point to a conversation and we can be like, in one click we can turn it into effectively a unit test. It's like, this is a good conversation. This is how you're supposed to handle things like this. Or if it's a negative example, then we modify a little bit the conversation after generating the eval. So it's very easy for us to spin up this kind of eval.Alessio [00:28:36]: Do you use an off-the-shelf tool which is like Brain Trust on the podcast? Or did you just build your own?Flo [00:28:41]: We unfortunately built our own. We're most likely going to switch to Brain Trust. Well, when we built it, there was nothing. Like there was no eval tool, frankly. I mean, we started this project at the end of 2022. It was like, it was very, very, very early. I wouldn't recommend it to build your own eval tool. There's better solutions out there and our eval tool breaks all the time and it's a nightmare to maintain. And that's not something we want to be spending our time on.Swyx [00:29:04]: I was going to ask that basically because I think my first conversations with you about Lindy was that you had a strong opinion that everyone should build their own tools. And you were very proud of your evals. You're kind of showing off to me like how many evals you were running, right?Flo [00:29:16]: Yeah, I think that was before all of these tools came around. I think the ecosystem has matured a fair bit.Swyx [00:29:21]: What is one thing that Brain Trust has nailed that you always struggled to do?Flo [00:29:25]: We're not using them yet, so I couldn't tell. But from what I've gathered from the conversations I've had, like they're doing what we do with our eval tool, but better.Swyx [00:29:33]: And like they do it, but also like 60 other companies do it, right? So I don't know how to shop apart from brand. Word of mouth.Flo [00:29:41]: Same here.Swyx [00:29:42]: Yeah, like evals or Lindys, there's two kinds of evals, right? Like in some way, you don't have to eval your system as much because you've constrained the language model so much. And you can rely on open AI to guarantee that the structured outputs are going to be good, right? We had Michelle sit where you sit and she explained exactly how they do constraint grammar sampling and all that good stuff. So actually, I think it's more important for your customers to eval their Lindys than you evaling your Lindy platform because you just built the platform. You don't actually need to eval that much.Flo [00:30:14]: Yeah. In an ideal world, our customers don't need to care about this. And I think the bar is not like, look, it needs to be at 100%. I think the bar is it needs to be better than a human. And for most use cases we serve today, it is better than a human, especially if you put it on Rails.Swyx [00:30:30]: Is there a limiting factor of Lindy at the business? Like, is it adding new connectors? Is it adding new node types? Like how do you prioritize what is the most impactful to your company?Flo [00:30:41]: Yeah. The raw capabilities for sure are a big limit. It is actually shocking the extent to which the model is no longer the limit. It was the limit a year ago. It was too expensive. The context window was too small. It's kind of insane that we started building this when the context windows were like 4,000 tokens. Like today, our system prompt is more than 4,000 tokens. So yeah, the model is actually very much not a limit anymore. It almost gives me pause because I'm like, I want the model to be a limit. And so no, the integrations are ones, the core capabilities are ones. So for example, we are investing in a system that's basically, I call it like the, it's a J hack. Give me these names, like the poor man's RLHF. So you can turn on a toggle on any step of your Lindy workflow to be like, ask me for confirmation before you actually execute this step. So it's like, hey, I receive an email, you send a reply, ask me for confirmation before actually sending it. And so today you see the email that's about to get sent and you can either approve, deny, or change it and then approve. And we are making it so that when you make a change, we are then saving this change that you're making or embedding it in the vector database. And then we are retrieving these examples for future tasks and injecting them into the context window. So that's the kind of capability that makes a huge difference for users. That's the bottleneck today. It's really like good old engineering and product work.Swyx [00:31:52]: I assume you're hiring. We'll do a call for hiring at the end.Alessio [00:31:54]: Any other comments on the model side? When did you start feeling like the model was not a bottleneck anymore? Was it 4.0? Was it 3.5? 3.5.Flo [00:32:04]: 3.5 Sonnet, definitely. I think 4.0 is overhyped, frankly. We don't use 4.0. I don't think it's good for agentic behavior. Yeah, 3.5 Sonnet is when I started feeling that. And then with prompt caching with 3.5 Sonnet, like that fills the cost, cut the cost again. Just cut it in half. Yeah.Swyx [00:32:21]: Your prompts are... Some of the problems with agentic uses is that your prompts are kind of dynamic, right? Like from caching to work, you need the front prefix portion to be stable.Flo [00:32:32]: Yes, but we have this append-only ledger paradigm. So every node keeps appending to that ledger and every filled node inherits all the context built up by all the previous nodes. And so we can just decide, like, hey, every X thousand nodes, we trigger prompt caching again.Swyx [00:32:47]: Oh, so you do it like programmatically, not all the time.Flo [00:32:50]: No, sorry. Anthropic manages that for us. But basically, it's like, because we keep appending to the prompt, the prompt caching works pretty well.Alessio [00:32:57]: We have this small podcaster tool that I built for the podcast and I rewrote all of our prompts because I noticed, you know, I was inputting stuff early on. I wonder how much more money OpenAN and Anthropic are making just because people don't rewrite their prompts to be like static at the top and like dynamic at the bottom.Flo [00:33:13]: I think that's the remarkable thing about what we're having right now. It's insane that these companies are routinely cutting their costs by two, four, five. Like, they basically just apply constraints. They want people to take advantage of these innovations. Very good.Swyx [00:33:25]: Do you have any other competitive commentary? Commentary? Dust, WordWare, Gumloop, Zapier? If not, we can move on.Flo [00:33:31]: No comment.Alessio [00:33:32]: I think the market is,Flo [00:33:33]: look, I mean, AGI is coming. All right, that's what I'm talking about.Swyx [00:33:38]: I think you're helping. Like, you're paving the road to AGI.Flo [00:33:41]: I'm playing my small role. I'm adding my small brick to this giant, giant, giant castle. Yeah, look, when it's here, we are going to, this entire category of software is going to create, it's going to sound like an exaggeration, but it is a fact it is going to create trillions of dollars of value in a few years, right? It's going to, for the first time, we're actually having software directly replace human labor. I see it every day in sales calls. It's like, Lindy is today replacing, like, we talk to even small teams. It's like, oh, like, stop, this is a 12-people team here. I guess we'll set up this Lindy for one or two days, and then we'll have to decide what to do with this 12-people team. And so, yeah. To me, there's this immense uncapped market opportunity. It's just such a huge ocean, and there's like three sharks in the ocean. I'm focused on the ocean more than on the sharks.Swyx [00:34:25]: So we're moving on to hot topics, like, kind of broadening out from Lindy, but obviously informed by Lindy. What are the high-order bits of good agent design?Flo [00:34:31]: The model, the model, the model, the model. I think people fail to truly, and me included, they fail to truly internalize the bitter lesson. So for the listeners out there who don't know about it, it's basically like, you just scale the model. Like, GPUs go brr, it's all that matters. I think it also holds for the cognitive architecture. I used to be very cognitive architecture-filled, and I was like, ah, and I was like a critic, and I was like a generator, and all this, and then it's just like, GPUs go brr, like, just like let the model do its job. I think we're seeing it a little bit right now with O1. I'm seeing some tweets that say that the new 3.5 SONNET is as good as O1, but with none of all the crazy...Swyx [00:35:09]: It beats O1 on some measures. On some reasoning tasks. On AIME, it's still a lot lower. Like, it's like 14 on AIME versus O1, it's like 83.Flo [00:35:17]: Got it. Right. But even O1 is still the model. Yeah.Swyx [00:35:22]: Like, there's no cognitive architecture on top of it.Flo [00:35:23]: You can just wait for O1 to get better.Alessio [00:35:25]: And so, as a founder, how do you think about that, right? Because now, knowing this, wouldn't you just wait to start Lindy? You know, you start Lindy, it's like 4K context, the models are not that good. It's like, but you're still kind of like going along and building and just like waiting for the models to get better. How do you today decide, again, what to build next, knowing that, hey, the models are going to get better, so maybe we just shouldn't focus on improving our prompt design and all that stuff and just build the connectors instead or whatever? Yeah.Flo [00:35:51]: I mean, that's exactly what we do. Like, all day, we always ask ourselves, oh, when we have a feature idea or a feature request, we ask ourselves, like, is this the kind of thing that just gets better while we sleep because models get better? I'm reminded, again, when we started this in 2022, we spent a lot of time because we had to around context pruning because 4,000 tokens is really nothing. You really can't do anything with 4,000 tokens. All that work was throwaway work. Like, now it's like it was for nothing, right? Now we just assume that infinite context windows are going to be here in a year or something, a year and a half, and infinitely cheap as well, and dynamic compute is going to be here. Like, we just assume all of these things are going to happen, and so we really focus, our job to be done in the industry is to provide the input and output to the model. I really compare it all the time to the PC and the CPU, right? Apple is busy all day. They're not like a CPU wrapper. They have a lot to build, but they don't, well, now actually they do build the CPU as well, but leaving that aside, they're busy building a laptop. It's just a lot of work to build these things. It's interesting because, like,Swyx [00:36:45]: for example, another person that we're close to, Mihaly from Repl.it, he often says that the biggest jump for him was having a multi-agent approach, like the critique thing that you just said that you don't need, and I wonder when, in what situations you do need that and what situations you don't. Obviously, the simple answer is for coding, it helps, and you're not coding, except for, are you still generating code? In Indy? Yeah.Flo [00:37:09]: No, we do. Oh, right. No, no, no, the cognitive architecture changed. We don't, yeah.Swyx [00:37:13]: Yeah, okay. For you, you're one shot, and you chain tools together, and that's it. And if the user really wantsFlo [00:37:18]: to have this kind of critique thing, you can also edit the prompt, you're welcome to. I have some of my Lindys, I've told them, like, hey, be careful, think step by step about what you're about to do, but that gives you a little bump for some use cases, but, yeah.Alessio [00:37:30]: What about unexpected model releases? So, Anthropic released computer use today. Yeah. I don't know if many people were expecting computer use to come out today. Do these things make you rethink how to design, like, your roadmap and things like that, or are you just like, hey, look, whatever, that's just, like, a small thing in their, like, AGI pursuit, that, like, maybe they're not even going to support, and, like, it's still better for us to build our own integrations into systems and things like that. Because maybe people will say, hey, look, why am I building all these API integrationsFlo [00:38:02]: when I can just do computer use and never go to the product? Yeah. No, I mean, we did take into account computer use. We were talking about this a year ago or something, like, we've been talking about it as part of our roadmap. It's been clear to us that it was coming, My philosophy about it is anything that can be done with an API must be done by an API or should be done by an API for a very long time. I think it is dangerous to be overly cavalier about improvements of model capabilities. I'm reminded of iOS versus Android. Android was built on the JVM. There was a garbage collector, and I can only assume that the conversation that went down in the engineering meeting room was, oh, who cares about the garbage collector? Anyway, Moore's law is here, and so that's all going to go to zero eventually. Sure, but in the meantime, you are operating on a 400 MHz CPU. It was like the first CPU on the iPhone 1, and it's really slow, and the garbage collector is introducing a tremendous overhead on top of that, especially a memory overhead. For the longest time, and it's really only been recently that Android caught up to iOS in terms of how smooth the interactions were, but for the longest time, Android phones were significantly slowerSwyx [00:39:07]: and laggierFlo [00:39:08]: and just not feeling as good as iOS devices. Look, when you're talking about modules and magnitude of differences in terms of performance and reliability, which is what we are talking about when we're talking about API use versus computer use, then you can't ignore that, right? And so I think we're going to be in an API use world for a while.Swyx [00:39:27]: O1 doesn't have API use today. It will have it at some point, and it's on the roadmap. There is a future in which OpenAI goes much harder after your business, your market, than it is today. Like, ChatGPT, it's its own business. All they need to do is add tools to the ChatGPT, and now they're suddenly competing with you. And by the way, they have a GPT store where a bunch of people have already configured their tools to fit with them. Is that a concern?Flo [00:39:56]: I think even the GPT store, in a way, like the way they architect it, for example, their plug-in systems are actually grateful because we can also use the plug-ins. It's very open. Now, again, I think it's going to be such a huge market. I think there's going to be a lot of different jobs to be done. I know they have a huge enterprise offering and stuff, but today, ChatGPT is a consumer app. And so, the sort of flow detail I showed you, this sort of workflow, this sort of use cases that we're going after, which is like, we're doing a lot of lead generation and lead outreach and all of that stuff. That's not something like meeting recording, like Lindy Today right now joins your Zoom meetings and takes notes, all of that stuff.Swyx [00:40:34]: I don't see that so farFlo [00:40:35]: on the OpenAI roadmap.Swyx [00:40:36]: Yeah, but they do have an enterprise team that we talk to You're hiring GMs?Flo [00:40:42]: We did.Swyx [00:40:43]: It's a fascinating way to build a business, right? Like, what should you, as CEO, be in charge of? And what should you basically hireFlo [00:40:52]: a mini CEO to do? Yeah, that's a good question. I think that's also something we're figuring out. The GM thing was inspired from my days at Uber, where we hired one GM per city or per major geo area. We had like all GMs, regional GMs and so forth. And yeah, Lindy is so horizontal that we thought it made sense to hire GMs to own each vertical and the go-to market of the vertical and the customization of the Lindy templates for these verticals and so forth. What should I own as a CEO? I mean, the canonical reply here is always going to be, you know, you own the fundraising, you own the culture, you own the... What's the rest of the canonical reply? The culture, the fundraising.Swyx [00:41:29]: I don't know,Flo [00:41:30]: products. Even that, eventually, you do have to hand out. Yes, the vision, the culture, and the foundation. Well, you've done your job as a CEO. In practice, obviously, yeah, I mean, all day, I do a lot of product work still and I want to keep doing product work for as long as possible.Swyx [00:41:48]: Obviously, like you're recording and managing the team. Yeah.Flo [00:41:52]: That one feels like the most automatable part of the job, the recruiting stuff.Swyx [00:41:56]: Well, yeah. You saw myFlo [00:41:59]: design your recruiter here. Relationship between Factorio and building Lindy. We actually very often talk about how the business of the future is like a game of Factorio. Yeah. So, in the instance, it's like Slack and you've got like 5,000 Lindys in the sidebar and your job is to somehow manage your 5,000 Lindys. And it's going to be very similar to company building because you're going to look for like the highest leverage way to understand what's going on in your AI company and understand what levels do you have to make impact in that company. So, I think it's going to be very similar to like a human company except it's going to go infinitely faster. Today, in a human company, you could have a meeting with your team and you're like, oh, I'm going to build a facility and, you know, now it's like, okay,Swyx [00:42:40]: boom, I'm going to spin up 50 designers. Yeah. Like, actually, it's more important that you can clone an existing designer that you know works because the hiring process, you cannot clone someone because every new person you bring in is going to have their own tweaksFlo [00:42:54]: and you don't want that. Yeah.Swyx [00:42:56]: That's true. You want an army of mindless dronesFlo [00:42:59]: that all work the same way.Swyx [00:43:00]: The reason I bring this, bring Factorio up as well is one, Factorio Space just came out. Apparently, a whole bunch of people stopped working. I tried out Factorio. I never really got that much into it. But the other thing was, you had a tweet recently about how the sort of intentional top-down design was not as effective as just build. Yeah. Just ship.Flo [00:43:21]: I think people read a little bit too much into that tweet. It went weirdly viral. I was like, I did not intend it as a giant statement online.Swyx [00:43:28]: I mean, you notice you have a pattern with this, right? Like, you've done this for eight years now.Flo [00:43:33]: You should know. I legit was just hearing an interesting story about the Factorio game I had. And everybody was like, oh my God, so deep. I guess this explains everything about life and companies. There is something to be said, certainly, about focusing on the constraint. And I think it is Patrick Collison who said, people underestimate the extent to which moonshots are just one pragmatic step taken after the other. And I think as long as you have some inductive bias about, like, some loose idea about where you want to go, I think it makes sense to follow a sort of greedy search along that path. I think planning and organizing is important. And having older is important.Swyx [00:44:05]: I'm wrestling with that. There's two ways I encountered it recently. One with Lindy. When I tried out one of your automation templates and one of them was quite big and I just didn't understand it, right? So, like, it was not as useful to me as a small one that I can just plug in and see all of. And then the other one was me using Cursor. I was very excited about O1 and I just up frontFlo [00:44:27]: stuffed everythingSwyx [00:44:28]: I wanted to do into my prompt and expected O1 to do everything. And it got itself into a huge jumbled mess and it was stuck. It was really... There was no amount... I wasted, like, two hours on just, like, trying to get out of that hole. So I threw away the code base, started small, switched to Clouds on it and build up something working and just add it over time and it just worked. And to me, that was the factorial sentiment, right? Maybe I'm one of those fanboys that's just, like, obsessing over the depth of something that you just randomly tweeted out. But I think it's true for company building, for Lindy building, for coding.Flo [00:45:02]: I don't know. I think it's fair and I think, like, you and I talked about there's the Tuft & Metal principle and there's this other... Yes, I love that. There's the... I forgot the name of this other blog post but it's basically about this book Seeing Like a State that talks about the need for legibility and people who optimize the system for its legibility and anytime you make a system... So legible is basically more understandable. Anytime you make a system more understandable from the top down, it performs less well from the bottom up. And it's fine but you should at least make this trade-off with your eyes wide open. You should know, I am sacrificing performance for understandability, for legibility. And in this case, for you, it makes sense. It's like you are actually optimizing for legibility. You do want to understand your code base but in some other cases it may not make sense. Sometimes it's better to leave the system alone and let it be its glorious, chaotic, organic self and just trust that it's going to perform well even though you don't understand it completely.Swyx [00:45:55]: It does remind me of a common managerial issue or dilemma which you experienced in the small scale of Lindy where, you know, do you want to organize your company by functional sections or by products or, you know, whatever the opposite of functional is. And you tried it one way and it was more legible to you as CEO but actually it stopped working at the small level. Yeah.Flo [00:46:17]: I mean, one very small example, again, at a small scale is we used to have everything on Notion. And for me, as founder, it was awesome because everything was there. The roadmap was there. The tasks were there. The postmortems were there. And so, the postmortem was linkedSwyx [00:46:31]: to its task.Flo [00:46:32]: It was optimized for you. Exactly. And so, I had this, like, one pane of glass and everything was on Notion. And then the team, one day,Swyx [00:46:39]: came to me with pitchforksFlo [00:46:40]: and they really wanted to implement Linear. And I had to bite my fist so hard. I was like, fine, do it. Implement Linear. Because I was like, at the end of the day, the team needs to be able to self-organize and pick their own tools.Alessio [00:46:51]: Yeah. But it did make the company slightly less legible for me. Another big change you had was going away from remote work, every other month. The discussion comes up again. What was that discussion like? How did your feelings change? Was there kind of like a threshold of employees and team size where you felt like, okay, maybe that worked. Now it doesn't work anymore. And how are you thinking about the futureFlo [00:47:12]: as you scale the team? Yeah. So, for context, I used to have a business called TeamFlow. The business was about building a virtual office for remote teams. And so, being remote was not merely something we did. It was, I was banging the remote drum super hard and helping companies to go remote. And so, frankly, in a way, it's a bit embarrassing for me to do a 180 like that. But I guess, when the facts changed, I changed my mind. What happened? Well, I think at first, like everyone else, we went remote by necessity. It was like COVID and you've got to go remote. And on paper, the gains of remote are enormous. In particular, from a founder's standpoint, being able to hire from anywhere is huge. Saving on rent is huge. Saving on commute is huge for everyone and so forth. But then, look, we're all here. It's like, it is really making it much harder to work together. And I spent three years of my youth trying to build a solution for this. And my conclusion is, at least we couldn't figure it out and no one else could. Zoom didn't figure it out. We had like a bunch of competitors. Like, Gathertown was one of the bigger ones. We had dozens and dozens of competitors. No one figured it out. I don't know that software can actually solve this problem. The reality of it is, everyone just wants to get off the darn Zoom call. And it's not a good feeling to be in your home office if you're even going to have a home office all day. It's harder to build culture. It's harder to get in sync. I think software is peculiar because it's like an iceberg. It's like the vast majority of it is submerged underwater. And so, the quality of the software that you ship is a function of the alignment of your mental models about what is below that waterline. Can you actually get in sync about what it is exactly fundamentally that we're building? What is the soul of our product? And it is so much harder to get in sync about that when you're remote. And then you waste time in a thousand ways because people are offline and you can't get a hold of them or you can't share your screen. It's just like you feel like you're walking in molasses all day. And eventually, I was like, okay, this is it. We're not going to do this anymore.Swyx [00:49:03]: Yeah. I think that is the current builder San Francisco consensus here. Yeah. But I still have a big... One of my big heroes as a CEO is Sid Subban from GitLab.Flo [00:49:14]: Mm-hmm.Swyx [00:49:15]: Matt MullenwegFlo [00:49:16]: used to be a hero.Swyx [00:49:17]: But these people run thousand-person remote businesses. The main idea is that at some company

Thinking Elixir Podcast
228: From Surveys to Cheat Sheets

Thinking Elixir Podcast

Play Episode Listen Later Nov 12, 2024 21:07


News includes Theo releasing his ElixirConf presentation video on his channel, the launch of the Gleam and Curiosum Elixir developer surveys, a great tip of the Phoenix phx- cheatsheet documentation, and the Elixir LangChain project's improvements over the OG Python counterpart. Plus, details on upcoming events like CodeBeam Lite in New York and AlchemyConf in Portugal, a handy communication tip for teams—and more! Show Notes online - http://podcast.thinkingelixir.com/228 (http://podcast.thinkingelixir.com/228) Elixir Community News https://www.youtube.com/watch?v=7j2SaHAWTDM (https://www.youtube.com/watch?v=7j2SaHAWTDM?utm_source=thinkingelixir&utm_medium=shownotes) – Theo, known for his TypeScript and JS content, released his ElixirConf presentation video to his own channel. https://x.com/theo/status/1852521365880156323 (https://x.com/theo/status/1852521365880156323?utm_source=thinkingelixir&utm_medium=shownotes) – Theo announced the release of his ElixirConf presentation video on his social media. https://developer-survey.gleam.run/ (https://developer-survey.gleam.run/?utm_source=thinkingelixir&utm_medium=shownotes) – The Gleam developer survey is available for participation. https://curiosum.com/sl/b87vc3qn (https://curiosum.com/sl/b87vc3qn?utm_source=thinkingelixir&utm_medium=shownotes) – Participate in Curiosum's Elixir survey to contribute your feedback. https://curiosum.com/surveys/elixir-2023 (https://curiosum.com/surveys/elixir-2023?utm_source=thinkingelixir&utm_medium=shownotes) – View the results from Curiosum's 2023 Elixir survey with ~400 respondents. https://x.com/davydog187/status/1852408187816947982 (https://x.com/davydog187/status/1852408187816947982?utm_source=thinkingelixir&utm_medium=shownotes) – The value of the potentially little known phx- cheatsheet in the Phoenix docs. https://hexdocs.pm/phoenixliveview/html-attrs.html (https://hexdocs.pm/phoenix_live_view/html-attrs.html?utm_source=thinkingelixir&utm_medium=shownotes) – Detailed documentation on special HTML attributes used in Phoenix LiveView templates. https://x.com/davydog187/status/1853762123647582697 (https://x.com/davydog187/status/1853762123647582697?utm_source=thinkingelixir&utm_medium=shownotes) – Watch the recording of Dave Lucia's live-coding session on building an Igniter installer for OpenTelemetry. https://www.youtube.com/live/SOO4RcolYSI (https://www.youtube.com/live/SOO4RcolYSI?utm_source=thinkingelixir&utm_medium=shownotes) – Part 1 of Dave Lucia's live-coding session on YouTube; part 2 is upcoming. https://github.com/brainlid/langchain (https://github.com/brainlid/langchain?utm_source=thinkingelixir&utm_medium=shownotes) – Introduction to the Elixir LangChain project for integrating Elixir projects with LLMs. https://www.reddit.com/r/elixir/comments/1ggfcbt/iswitchedfromthepythonversionof_langchain/ (https://www.reddit.com/r/elixir/comments/1ggfcbt/i_switched_from_the_python_version_of_langchain/?utm_source=thinkingelixir&utm_medium=shownotes) – A Reddit user shares their positive experience with the Elixir LangChain library compared to the Python version. Mark Ericksen shares his availability and interest in helping companies leveraging Elixir and LLMs. https://codebeamnyc.com/ (https://codebeamnyc.com/?utm_source=thinkingelixir&utm_medium=shownotes) – CodeBeam Lite conference in NY on Nov 15th, featuring a single-track agenda and notable speakers. https://podcast.thinkingelixir.com/106 (https://podcast.thinkingelixir.com/106?utm_source=thinkingelixir&utm_medium=shownotes) – Previous podcast episode discussing Sam Aaron's work with Elixir and the BEAM. https://alchemyconf.com/ (https://alchemyconf.com/?utm_source=thinkingelixir&utm_medium=shownotes) – AlchemyConf website is live, with details for the event from March 31 - April 3 in Braga, Portugal. https://dontasktoask.com/ (https://dontasktoask.com/?utm_source=thinkingelixir&utm_medium=shownotes) – A communication tip for teams: don't ask to ask, just ask. Do you have some Elixir news to share? Tell us at @ThinkingElixir (https://twitter.com/ThinkingElixir) or email at show@thinkingelixir.com (mailto:show@thinkingelixir.com) Find us online - Message the show - @ThinkingElixir (https://x.com/ThinkingElixir) - Message the show on Fediverse - @ThinkingElixir@genserver.social (https://genserver.social/ThinkingElixir) - Email the show - show@thinkingelixir.com (mailto:show@thinkingelixir.com) - Mark Ericksen - @brainlid (https://x.com/brainlid) - Mark Ericksen on Fediverse - @brainlid@genserver.social (https://genserver.social/brainlid) - David Bernheisel - @bernheisel (https://x.com/bernheisel) - David Bernheisel on Fediverse - @dbern@genserver.social (https://genserver.social/dbern)

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show!Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups!In our first ever episode with Logan Kilpatrick we called out the two hottest LLM frameworks at the time: LangChain and Dust. We've had Harrison from LangChain on twice (as a guest and as a co-host), and we've now finally come full circle as Stanislas from Dust joined us in the studio.After stints at Oracle and Stripe, Stan had joined OpenAI to work on mathematical reasoning capabilities. He describes his time at OpenAI as "the PhD I always wanted to do" while acknowledging the challenges of research work: "You're digging into a field all day long for weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, 'oh, yeah, that was obvious.' And you go back to digging." This experience, combined with early access to GPT-4's capabilities, shaped his decision to start Dust: "If we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down."The History of DustDust's journey can be broken down into three phases:* Developer Framework (2022): Initially positioned as a competitor to LangChain, Dust started as a developer tooling platform. While both were open source, their approaches differed – LangChain focused on broad community adoption and integration as a pure developer experience, while Dust emphasized UI-driven development and better observability that wasn't just `print` statements.* Browser Extension (Early 2023): The company pivoted to building XP1, a browser extension that could interact with web content. This experiment helped validate user interaction patterns with AI, even while using less capable models than GPT-4.* Enterprise Platform (Current): Today, Dust has evolved into an infrastructure platform for deploying AI agents within companies, with impressive metrics like 88% daily active users in some deployments.The Case for Being HorizontalThe big discussion for early stage companies today is whether or not to be horizontal or vertical. Since models are so good at general tasks, a lot of companies are building vertical products that take care of a workflow end-to-end in order to offer more value and becoming more of “Services as Software”. Dust on the other hand is a platform for the users to build their own experiences, which has had a few advantages:* Maximum Penetration: Dust reports 60-70% weekly active users across entire companies, demonstrating the potential reach of horizontal solutions rather than selling into a single team.* Emergent Use Cases: By allowing non-technical users to create agents, Dust enables use cases to emerge organically from actual business needs rather than prescribed solutions.* Infrastructure Value: The platform approach creates lasting value through maintained integrations and connections, similar to how Stripe's value lies in maintaining payment infrastructure. Rather than relying on third-party integration providers, Dust maintains its own connections to ensure proper handling of different data types and structures.The Vertical ChallengeHowever, this approach comes with trade-offs:* Harder Go-to-Market: As Stan talked about: "We spike at penetration... but it makes our go-to-market much harder. Vertical solutions have a go-to-market that is much easier because they're like, 'oh, I'm going to solve the lawyer stuff.'"* Complex Infrastructure: Building a horizontal platform requires maintaining numerous integrations and handling diverse data types appropriately – from structured Salesforce data to unstructured Notion pages. As you scale integrations, the cost of maintaining them also scales. * Product Surface Complexity: Creating an interface that's both powerful and accessible to non-technical users requires careful design decisions, down to avoiding technical terms like "system prompt" in favor of "instructions." The Future of AI PlatformsStan initially predicted we'd see the first billion-dollar single-person company in 2023 (a prediction later echoed by Sam Altman), but he's now more focused on a different milestone: billion-dollar companies with engineering teams of just 20 people, enabled by AI assistance.This vision aligns with Dust's horizontal platform approach – building the infrastructure that allows small teams to achieve outsized impact through AI augmentation. Rather than replacing entire job functions (the vertical approach), they're betting on augmenting existing workflows across organizations.Full YouTube EpisodeChapters* 00:00:00 Introductions* 00:04:33 Joining OpenAI from Paris* 00:09:54 Research evolution and compute allocation at OpenAI* 00:13:12 Working with Ilya Sutskever and OpenAI's vision* 00:15:51 Leaving OpenAI to start Dust* 00:18:15 Early focus on browser extension and WebGPT-like functionality* 00:20:20 Dust as the infrastructure for agents* 00:24:03 Challenges of building with early AI models* 00:28:17 LLMs and Workflow Automation* 00:35:28 Building dependency graphs of agents* 00:37:34 Simulating API endpoints* 00:40:41 State of AI models* 00:43:19 Running evals* 00:46:36 Challenges in building AI agents infra* 00:49:21 Buy vs. build decisions for infrastructure components* 00:51:02 Future of SaaS and AI's Impact on Software* 00:53:07 The single employee $1B company race* 00:56:32 Horizontal vs. vertical approaches to AI agentsTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:11]: Hey, and today we're in a studio with Stanislas, welcome.Stan [00:00:14]: Thank you very much for having me.Swyx [00:00:16]: Visiting from Paris.Stan [00:00:17]: Paris.Swyx [00:00:18]: And you have had a very distinguished career. It's very hard to summarize, but you went to college in both Ecopolytechnique and Stanford, and then you worked in a number of places, Oracle, Totems, Stripe, and then OpenAI pre-ChatGPT. We'll talk, we'll spend a little bit of time about that. About two years ago, you left OpenAI to start Dust. I think you were one of the first OpenAI alum founders.Stan [00:00:40]: Yeah, I think it was about at the same time as the Adept guys, so that first wave.Swyx [00:00:46]: Yeah, and people really loved our David episode. We love a few sort of OpenAI stories, you know, for back in the day, like we're talking about pre-recording. Probably the statute of limitations on some of those stories has expired, so you can talk a little bit more freely without them coming after you. But maybe we'll just talk about, like, what was your journey into AI? You know, you were at Stripe for almost five years, there are a lot of Stripe alums going into OpenAI. I think the Stripe culture has come into OpenAI quite a bit.Stan [00:01:11]: Yeah, so I think the buses of Stripe people really started flowing in, I guess, after ChatGPT. But, yeah, my journey into AI is a... I mean, Greg Brockman. Yeah, yeah. From Greg, of course. And Daniela, actually, back in the days, Daniela Amodei.Swyx [00:01:27]: Yes, she was COO, I mean, she is COO, yeah. She had a pretty high job at OpenAI at the time, yeah, for sure.Stan [00:01:34]: My journey started as anybody else, you're fascinated with computer science and you want to make them think, it's awesome, but it doesn't work. I mean, it was a long time ago, it was like maybe 16, so it was 25 years ago. Then the first big exposure to AI would be at Stanford, and I'm going to, like, disclose a whole lamb, because at the time it was a class taught by Andrew Ng, and there was no deep learning. It was half features for vision and a star algorithm. So it was fun. But it was the early days of deep learning. At the time, I think a few years after, it was the first project at Google. But you know, that cat face or the human face trained from many images. I went to, hesitated doing a PhD, more in systems, eventually decided to go into getting a job. Went at Oracle, started a company, did a gazillion mistakes, got acquired by Stripe, worked with Greg Buckman there. And at the end of Stripe, I started interesting myself in AI again, felt like it was the time, you had the Atari games, you had the self-driving craziness at the time. And I started exploring projects, it felt like the Atari games were incredible, but there were still games. And I was looking into exploring projects that would have an impact on the world. And so I decided to explore three things, self-driving cars, cybersecurity and AI, and math and AI. It's like I sing it by a decreasing order of impact on the world, I guess.Swyx [00:03:01]: Discovering new math would be very foundational.Stan [00:03:03]: It is extremely foundational, but it's not as direct as driving people around.Swyx [00:03:07]: Sorry, you're doing this at Stripe, you're like thinking about your next move.Stan [00:03:09]: No, it was at Stripe, kind of a bit of time where I started exploring. I did a bunch of work with friends on trying to get RC cars to drive autonomously. Almost started a company in France or Europe about self-driving trucks. We decided to not go for it because it was probably very operational. And I think the idea of the company, of the team wasn't there. And also I realized that if I wake up a day and because of a bug I wrote, I killed a family, it would be a bad experience. And so I just decided like, no, that's just too crazy. And then I explored cybersecurity with a friend. We're trying to apply transformers to cut fuzzing. So cut fuzzing, you have kind of an algorithm that goes really fast and tries to mutate the inputs of a library to find bugs. And we tried to apply a transformer to that and do reinforcement learning with the signal of how much you propagate within the binary. Didn't work at all because the transformers are so slow compared to evolutionary algorithms that it kind of didn't work. Then I started interested in math and AI and started working on SAT solving with AI. And at the same time, OpenAI was kind of starting the reasoning team that were tackling that project as well. I was in touch with Greg and eventually got in touch with Ilya and finally found my way to OpenAI. I don't know how much you want to dig into that. The way to find your way to OpenAI when you're in Paris was kind of an interesting adventure as well.Swyx [00:04:33]: Please. And I want to note, this was a two-month journey. You did all this in two months.Stan [00:04:38]: The search.Swyx [00:04:40]: Your search for your next thing, because you left in July 2019 and then you joined OpenAI in September.Stan [00:04:45]: I'm going to be ashamed to say that.Swyx [00:04:47]: You were searching before. I was searching before.Stan [00:04:49]: I mean, it's normal. No, the truth is that I moved back to Paris through Stripe and I just felt the hardship of being remote from your team nine hours away. And so it kind of freed a bit of time for me to start the exploration before. Sorry, Patrick. Sorry, John.Swyx [00:05:05]: Hopefully they're listening. So you joined OpenAI from Paris and from like, obviously you had worked with Greg, but notStan [00:05:13]: anyone else. No. Yeah. So I had worked with Greg, but not Ilya, but I had started chatting with Ilya and Ilya was kind of excited because he knew that I was a good engineer through Greg, I presume, but I was not a trained researcher, didn't do a PhD, never did research. And I started chatting and he was excited all the way to the point where he was like, hey, come pass interviews, it's going to be fun. I think he didn't care where I was, he just wanted to try working together. So I go to SF, go through the interview process, get an offer. And so I get Bob McGrew on the phone for the first time, he's like, hey, Stan, it's awesome. You've got an offer. When are you coming to SF? I'm like, hey, it's awesome. I'm not coming to the SF. I'm based in Paris and we just moved. He was like, hey, it's awesome. Well, you don't have an offer anymore. Oh, my God. No, it wasn't as hard as that. But that's basically the idea. And it took me like maybe a couple more time to keep chatting and they eventually decided to try a contractor set up. And that's how I kind of started working at OpenAI, officially as a contractor, but in practice really felt like being an employee.Swyx [00:06:14]: What did you work on?Stan [00:06:15]: So it was solely focused on math and AI. And in particular in the application, so the study of the larger grid models, mathematical reasoning capabilities, and in particular in the context of formal mathematics. The motivation was simple, transformers are very creative, but yet they do mistakes. Formal math systems are of the ability to verify a proof and the tactics they can use to solve problems are very mechanical, so you miss the creativity. And so the idea was to try to explore both together. You would get the creativity of the LLMs and the kind of verification capabilities of the formal system. A formal system, just to give a little bit of context, is a system in which a proof is a program and the formal system is a type system, a type system that is so evolved that you can verify the program. If the type checks, it means that the program is correct.Swyx [00:07:06]: Is the verification much faster than actually executing the program?Stan [00:07:12]: Verification is instantaneous, basically. So the truth is that what you code in involves tactics that may involve computation to search for solutions. So it's not instantaneous. You do have to do the computation to expand the tactics into the actual proof. The verification of the proof at the very low level is instantaneous.Swyx [00:07:32]: How quickly do you run into like, you know, halting problem PNP type things, like impossibilities where you're just like that?Stan [00:07:39]: I mean, you don't run into it at the time. It was really trying to solve very easy problems. So I think the... Can you give an example of easy? Yeah, so that's the mass benchmark that everybody knows today. The Dan Hendricks one. The Dan Hendricks one, yeah. And I think it was the low end part of the mass benchmark at the time, because that mass benchmark includes AMC problems, AMC 8, AMC 10, 12. So these are the easy ones. Then AIME problems, somewhat harder, and some IMO problems, like Crazy Arm.Swyx [00:08:07]: For our listeners, we covered this in our Benchmarks 101 episode. AMC is literally the grade of like high school, grade 8, grade 10, grade 12. So you can solve this. Just briefly to mention this, because I don't think we'll touch on this again. There's a bit of work with like Lean, and then with, you know, more recently with DeepMind doing like scoring like silver on the IMO. Any commentary on like how math has evolved from your early work to today?Stan [00:08:34]: I mean, that result is mind blowing. I mean, from my perspective, spent three years on that. At the same time, Guillaume Lampe in Paris, we were both in Paris, actually. He was at FAIR, was working on some problems. We were pushing the boundaries, and the goal was the IMO. And we cracked a few problems here and there. But the idea of getting a medal at an IMO was like just remote. So this is an impressive result. And we can, I think the DeepMind team just did a good job of scaling. I think there's nothing too magical in their approach, even if it hasn't been published. There's a Dan Silver talk from seven days ago where it goes a little bit into more details. It feels like there's nothing magical there. It's really applying reinforcement learning and scaling up the amount of data that can generate through autoformalization. So we can dig into what autoformalization means if you want.Alessio [00:09:26]: Let's talk about the tail end, maybe, of the OpenAI. So you joined, and you're like, I'm going to work on math and do all of these things. I saw on one of your blog posts, you mentioned you fine-tuned over 10,000 models at OpenAI using 10 million A100 hours. How did the research evolve from the GPD 2, and then getting closer to DaVinci 003? And then you left just before ChatGPD was released, but tell people a bit more about the research path that took you there.Stan [00:09:54]: I can give you my perspective of it. I think at OpenAI, there's always been a large chunk of the compute that was reserved to train the GPTs, which makes sense. So it was pre-entropic splits. Most of the compute was going to a product called Nest, which was basically GPT-3. And then you had a bunch of, let's say, remote, not core research teams that were trying to explore maybe more specific problems or maybe the algorithm part of it. The interesting part, I don't know if it was where your question was going, is that in those labs, you're managing researchers. So by definition, you shouldn't be managing them. But in that space, there's a managing tool that is great, which is compute allocation. Basically by managing the compute allocation, you can message the team of where you think the priority should go. And so it was really a question of, you were free as a researcher to work on whatever you wanted. But if it was not aligned with OpenAI mission, and that's fair, you wouldn't get the compute allocation. As it happens, solving math was very much aligned with the direction of OpenAI. And so I was lucky to generally get the compute I needed to make good progress.Swyx [00:11:06]: What do you need to show as incremental results to get funded for further results?Stan [00:11:12]: It's an imperfect process because there's a bit of a... If you're working on math and AI, obviously there's kind of a prior that it's going to be aligned with the company. So it's much easier than to go into something much more risky, much riskier, I guess. You have to show incremental progress, I guess. It's like you ask for a certain amount of compute and you deliver a few weeks after and you demonstrate that you have a progress. Progress might be a positive result. Progress might be a strong negative result. And a strong negative result is actually often much harder to get or much more interesting than a positive result. And then it generally goes into, as any organization, you would have people finding your project or any other project cool and fancy. And so you would have that kind of phase of growing up compute allocation for it all the way to a point. And then maybe you reach an apex and then maybe you go back mostly to zero and restart the process because you're going in a different direction or something else. That's how I felt. Explore, exploit. Yeah, exactly. Exactly. Exactly. It's a reinforcement learning approach.Swyx [00:12:14]: Classic PhD student search process.Alessio [00:12:17]: And you were reporting to Ilya, like the results you were kind of bringing back to him or like what's the structure? It's almost like when you're doing such cutting edge research, you need to report to somebody who is actually really smart to understand that the direction is right.Stan [00:12:29]: So we had a reasoning team, which was working on reasoning, obviously, and so math in general. And that team had a manager, but Ilya was extremely involved in the team as an advisor, I guess. Since he brought me in OpenAI, I was lucky to mostly during the first years to have kind of a direct access to him. He would really coach me as a trainee researcher, I guess, with good engineering skills. And Ilya, I think at OpenAI, he was the one showing the North Star, right? He was his job and I think he really enjoyed it and he did it super well, was going through the teams and saying, this is where we should be going and trying to, you know, flock the different teams together towards an objective.Swyx [00:13:12]: I would say like the public perception of him is that he was the strongest believer in scaling. Oh, yeah. Obviously, he has always pursued the compression thesis. You have worked with him personally, what does the public not know about how he works?Stan [00:13:26]: I think he's really focused on building the vision and communicating the vision within the company, which was extremely useful. I was personally surprised that he spent so much time, you know, working on communicating that vision and getting the teams to work together versus...Swyx [00:13:40]: To be specific, vision is AGI? Oh, yeah.Stan [00:13:42]: Vision is like, yeah, it's the belief in compression and scanning computes. I remember when I started working on the Reasoning team, the excitement was really about scaling the compute around Reasoning and that was really the belief we wanted to ingrain in the team. And that's what has been useful to the team and with the DeepMind results shows that it was the right approach with the success of GPT-4 and stuff shows that it was the right approach.Swyx [00:14:06]: Was it according to the neural scaling laws, the Kaplan paper that was published?Stan [00:14:12]: I think it was before that, because those ones came with GPT-3, basically at the time of GPT-3 being released or being ready internally. But before that, there really was a strong belief in scale. I think it was just the belief that the transformer was a generic enough architecture that you could learn anything. And that was just a question of scaling.Alessio [00:14:33]: Any other fun stories you want to tell? Sam Altman, Greg, you know, anything.Stan [00:14:37]: Weirdly, I didn't work that much with Greg when I was at OpenAI. He had always been mostly focused on training the GPTs and rightfully so. One thing about Sam Altman, he really impressed me because when I joined, he had joined not that long ago and it felt like he was kind of a very high level CEO. And I was mind blown by how deep he was able to go into the subjects within a year or something, all the way to a situation where when I was having lunch by year two, I was at OpenAI with him. He would just quite know deeply what I was doing. With no ML background. Yeah, with no ML background, but I didn't have any either, so I guess that explains why. But I think it's a question about, you don't necessarily need to understand the very technicalities of how things are done, but you need to understand what's the goal and what's being done and what are the recent results and all of that in you. And we could have kind of a very productive discussion. And that really impressed me, given the size at the time of OpenAI, which was not negligible.Swyx [00:15:44]: Yeah. I mean, you've been a, you were a founder before, you're a founder now, and you've seen Sam as a founder. How has he affected you as a founder?Stan [00:15:51]: I think having that capability of changing the scale of your attention in the company, because most of the time you operate at a very high level, but being able to go deep down and being in the known of what's happening on the ground is something that I feel is really enlightening. That's not a place in which I ever was as a founder, because first company, we went all the way to 10 people. Current company, there's 25 of us. So the high level, the sky and the ground are pretty much at the same place. No, you're being too humble.Swyx [00:16:21]: I mean, Stripe was also like a huge rocket ship.Stan [00:16:23]: Stripe, I was a founder. So I was, like at OpenAI, I was really happy being on the ground, pushing the machine, making it work. Yeah.Swyx [00:16:31]: Last OpenAI question. The Anthropic split you mentioned, you were around for that. Very dramatic. David also left around that time, you left. This year, we've also had a similar management shakeup, let's just call it. Can you compare what it was like going through that split during that time? And then like, does that have any similarities now? Like, are we going to see a new Anthropic emerge from these folks that just left?Stan [00:16:54]: That I really, really don't know. At the time, the split was pretty surprising because they had been trying GPT-3, it was a success. And to be completely transparent, I wasn't in the weeds of the splits. What I understood of it is that there was a disagreement of the commercialization of that technology. I think the focal point of that disagreement was the fact that we started working on the API and wanted to make those models available through an API. Is that really the core disagreement? I don't know.Swyx [00:17:25]: Was it safety?Stan [00:17:26]: Was it commercialization?Swyx [00:17:27]: Or did they just want to start a company?Stan [00:17:28]: Exactly. Exactly. That I don't know. But I think what I was surprised of is how quickly OpenAI recovered at the time. And I think it's just because we were mostly a research org and the mission was so clear that some divergence in some teams, some people leave, the mission is still there. We have the compute. We have a site. So it just keeps going.Swyx [00:17:50]: Very deep bench. Like just a lot of talent. Yeah.Alessio [00:17:53]: So that was the OpenAI part of the history. Exactly. So then you leave OpenAI in September 2022. And I would say in Silicon Valley, the two hottest companies at the time were you and Lanktrain. What was that start like and why did you decide to start with a more developer focused kind of like an AI engineer tool rather than going back into some more research and something else?Stan [00:18:15]: Yeah. First, I'm not a trained researcher. So going through OpenAI was really kind of the PhD I always wanted to do. But research is hard. You're digging into a field all day long for weeks and weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, oh, yeah, that was obvious. And you go back to digging. I'm not a trained, like formally trained researcher, and it wasn't kind of a necessarily an ambition of me of creating, of having a research career. And I felt the hardness of it. I enjoyed a lot of like that a ton. But at the time, I decided that I wanted to go back to something more productive. And the other fun motivation was like, I mean, if we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down. And so that was kind of the true motivation for like trying to go there. So that's kind of the core motivation at the beginning of personally. And the motivation for starting a company was pretty simple. I had seen GPT-4 internally at the time, it was September 2022. So it was pre-GPT, but GPT-4 was ready since, I mean, I'd been ready for a few months internally. I was like, okay, that's obvious, the capabilities are there to create an insane amount of value to the world. And yet the deployment is not there yet. The revenue of OpenAI at the time were ridiculously small compared to what it is today. So the thesis was, there's probably a lot to be done at the product level to unlock the usage.Alessio [00:19:49]: Yeah. Let's talk a bit more about the form factor, maybe. I think one of the first successes you had was kind of like the WebGPT-like thing, like using the models to traverse the web and like summarize things. And the browser was really the interface. Why did you start with the browser? Like what was it important? And then you built XP1, which was kind of like the browser extension.Stan [00:20:09]: So the starting point at the time was, if you wanted to talk about LLMs, it was still a rather small community, a community of mostly researchers and to some extent, very early adopters, very early engineers. It was almost inconceivable to just build a product and go sell it to the enterprise, though at the time there was a few companies doing that. The one on marketing, I don't remember its name, Jasper. But so the natural first intention, the first, first, first intention was to go to the developers and try to create tooling for them to create product on top of those models. And so that's what Dust was originally. It was quite different than Lanchain, and Lanchain just beat the s**t out of us, which is great. It's a choice.Swyx [00:20:53]: You were cloud, in closed source. They were open source.Stan [00:20:56]: Yeah. So technically we were open source and we still are open source, but I think that doesn't really matter. I had the strong belief from my research time that you cannot create an LLM-based workflow on just one example. Basically, if you just have one example, you overfit. So as you develop your interaction, your orchestration around the LLM, you need a dozen examples. Obviously, if you're running a dozen examples on a multi-step workflow, you start paralyzing stuff. And if you do that in the console, you just have like a messy stream of tokens going out and it's very hard to observe what's going there. And so the idea was to go with an UI so that you could kind of introspect easily the output of each interaction with the model and dig into there through an UI, which is-Swyx [00:21:42]: Was that open source? I actually didn't come across it.Stan [00:21:44]: Oh yeah, it wasn't. I mean, Dust is entirely open source even today. We're not going for an open source-Swyx [00:21:48]: If it matters, I didn't know that.Stan [00:21:49]: No, no, no, no, no. The reason why is because we're not open source because we're not doing an open source strategy. It's not an open source go-to-market at all. We're open source because we can and it's fun.Swyx [00:21:59]: Open source is marketing. You have all the downsides of open source, which is like people can clone you.Stan [00:22:03]: But I think that downside is a big fallacy. Okay. Yes, anybody can clone Dust today, but the value of Dust is not the current state. The value of Dust is the number of eyeballs and hands of developers that are creating to it in the future. And so yes, anybody can clone it today, but that wouldn't change anything. There is some value in being open source. In a discussion with the security team, you can be extremely transparent and just show the code. When you have discussion with users and there's a bug or a feature missing, you can just point to the issue, show the pull request, show the, show the, exactly, oh, PR welcome. That doesn't happen that much, but you can show the progress if the person that you're chatting with is a little bit technical, they really enjoy seeing the pull request advancing and seeing all the way to deploy. And then the downsides are mostly around security. You never want to do security by obfuscation. But the truth is that your vector of attack is facilitated by you being open source. But at the same time, it's a good thing because if you're doing anything like a bug bountying or stuff like that, you just give much more tools to the bug bountiers so that their output is much better. So there's many, many, many trade-offs. I don't believe in the value of the code base per se. I think it's really the people that are on the code base that have the value and go to market and the product and all of those things that are around the code base. Obviously, that's not true for every code base. If you're working on a very secret kernel to accelerate the inference of LLMs, I would buy that you don't want to be open source. But for product stuff, I really think there's very little risk. Yeah.Alessio [00:23:39]: I signed up for XP1, I was looking, January 2023. I think at the time you were on DaVinci 003. Given that you had seen GPD 4, how did you feel having to push a product out that was using this model that was so inferior? And you're like, please, just use it today. I promise it's going to get better. Just overall, as a founder, how do you build something that maybe doesn't quite work with the model today, but you're just expecting the new model to be better?Stan [00:24:03]: Yeah, so actually, XP1 was even on a smaller one that was the post-GDPT release, small version, so it was... Ada, Babbage... No, no, no, not that far away. But it was the small version of GDPT, basically. I don't remember its name. Yes, you have a frustration there. But at the same time, I think XP1 was designed, was an experiment, but was designed as a way to be useful at the current capability of the model. If you just want to extract data from a LinkedIn page, that model was just fine. If you want to summarize an article on a newspaper, that model was just fine. And so it was really a question of trying to find a product that works with the current capability, knowing that you will always have tailwinds as models get better and faster and cheaper. So that was kind of a... There's a bit of a frustration because you know what's out there and you know that you don't have access to it yet. It's also interesting to try to find a product that works with the current capability.Alessio [00:24:55]: And we highlighted XP1 in our anatomy of autonomy post in April of last year, which was, you know, where are all the agents, right? So now we spent 30 minutes getting to what you're building now. So you basically had a developer framework, then you had a browser extension, then you had all these things, and then you kind of got to where Dust is today. So maybe just give people an overview of what Dust is today and the courtesies behind it. Yeah, of course.Stan [00:25:20]: So Dust, we really want to build the infrastructure so that companies can deploy agents within their teams. We are horizontal by nature because we strongly believe in the emergence of use cases from the people having access to creating an agent that don't need to be developers. They have to be thinkers. They have to be curious. But anybody can create an agent that will solve an operational thing that they're doing in their day-to-day job. And to make those agents useful, there's two focus, which is interesting. The first one is an infrastructure focus. You have to build the pipes so that the agent has access to the data. You have to build the pipes such that the agents can take action, can access the web, et cetera. So that's really an infrastructure play. Maintaining connections to Notion, Slack, GitHub, all of them is a lot of work. It is boring work, boring infrastructure work, but that's something that we know is extremely valuable in the same way that Stripe is extremely valuable because it maintains the pipes. And we have that dual focus because we're also building the product for people to use it. And there it's fascinating because everything started from the conversational interface, obviously, which is a great starting point. But we're only scratching the surface, right? I think we are at the pong level of LLM productization. And we haven't invented the C3. We haven't invented Counter-Strike. We haven't invented Cyberpunk 2077. So this is really our mission is to really create the product that lets people equip themselves to just get away all the work that can be automated or assisted by LLMs.Alessio [00:26:57]: And can you just comment on different takes that people had? So maybe the most open is like auto-GPT. It's just kind of like just trying to do anything. It's like it's all magic. There's no way for you to do anything. Then you had the ADAPT, you know, we had David on the podcast. They're very like super hands-on with each individual customer to build super tailored. How do you decide where to draw the line between this is magic? This is exposed to you, especially in a market where most people don't know how to build with AI at all. So if you expect them to do the thing, they're probably not going to do it. Yeah, exactly.Stan [00:27:29]: So the auto-GPT approach obviously is extremely exciting, but we know that the agentic capability of models are not quite there yet. It just gets lost. So we're starting, we're starting where it works. Same with the XP one. And where it works is pretty simple. It's like simple workflows that involve a couple tools where you don't even need to have the model decide which tools it's used in the sense of you just want people to put it in the instructions. It's like take that page, do that search, pick up that document, do the work that I want in the format I want, and give me the results. There's no smartness there, right? In terms of orchestrating the tools, it's mostly using English for people to program a workflow where you don't have the constraint of having compatible API between the two.Swyx [00:28:17]: That kind of personal automation, would you say it's kind of like an LLM Zapier type ofStan [00:28:22]: thing?Swyx [00:28:22]: Like if this, then that, and then, you know, do this, then this. You're programming with English?Stan [00:28:28]: So you're programming with English. So you're just saying, oh, do this and then that. You can even create some form of APIs. You say, when I give you the command X, do this. When I give you the command Y, do this. And you describe the workflow. But you don't have to create boxes and create the workflow explicitly. It just needs to describe what are the tasks supposed to be and make the tool available to the agent. The tool can be a semantic search. The tool can be querying into a structured database. The tool can be searching on the web. And obviously, the interesting tools that we're only starting to scratch are actually creating external actions like reimbursing something on Stripe, sending an email, clicking on a button in the admin or something like that.Swyx [00:29:11]: Do you maintain all these integrations?Stan [00:29:13]: Today, we maintain most of the integrations. We do always have an escape hatch for people to kind of custom integrate. But the reality is that the reality of the market today is that people just want it to work, right? And so it's mostly us maintaining the integration. As an example, a very good source of information that is tricky to productize is Salesforce. Because Salesforce is basically a database and a UI. And they do the f**k they want with it. And so every company has different models and stuff like that. So right now, we don't support it natively. And the type of support or real native support will be slightly more complex than just osing into it, like is the case with Slack as an example. Because it's probably going to be, oh, you want to connect your Salesforce to us? Give us the SQL. That's the Salesforce QL language. Give us the queries you want us to run on it and inject in the context of dust. So that's interesting how not only integrations are cool, and some of them require a bit of work on the user. And for some of them that are really valuable to our users, but we don't support yet, they can just build them internally and push the data to us.Swyx [00:30:18]: I think I understand the Salesforce thing. But let me just clarify, are you using browser automation because there's no API for something?Stan [00:30:24]: No, no, no, no. In that case, so we do have browser automation for all the use cases and apply the public web. But for most of the integration with the internal system of the company, it really runs through API.Swyx [00:30:35]: Haven't you felt the pull to RPA, browser automation, that kind of stuff?Stan [00:30:39]: I mean, what I've been saying for a long time, maybe I'm wrong, is that if the future is that you're going to stand in front of a computer and looking at an agent clicking on stuff, then I'll hit my computer. And my computer is a big Lenovo. It's black. Doesn't sound good at all compared to a Mac. And if the APIs are there, we should use them. There is going to be a long tail of stuff that don't have APIs, but as the world is moving forward, that's disappearing. So the core API value in the past has really been, oh, this old 90s product doesn't have an API. So I need to use the UI to automate. I think for most of the ICP companies, the companies that ICP for us, the scale ups that are between 500 and 5,000 people, tech companies, most of the SaaS they use have APIs. Now there's an interesting question for the open web, because there are stuff that you want to do that involve websites that don't necessarily have APIs. And the current state of web integration from, which is us and OpenAI and Anthropic, I don't even know if they have web navigation, but I don't think so. The current state of affair is really, really broken because you have what? You have basically search and headless browsing. But headless browsing, I think everybody's doing basically body.innertext and fill that into the model, right?Swyx [00:31:56]: MARK MIRCHANDANI There's parsers into Markdown and stuff.Stan [00:31:58]: FRANCESC CAMPOY I'm super excited by the companies that are exploring the capability of rendering a web page into a way that is compatible for a model, being able to maintain the selector. So that's basically the place where to click in the page through that process, expose the actions to the model, have the model select an action in a way that is compatible with model, which is not a big page of a full DOM that is very noisy, and then being able to decompress that back to the original page and take the action. And that's something that is really exciting and that will kind of change the level of things that agents can do on the web. That I feel exciting, but I also feel that the bulk of the useful stuff that you can do within the company can be done through API. The data can be retrieved by API. The actions can be taken through API.Swyx [00:32:44]: For listeners, I'll note that you're basically completely disagreeing with David Wan. FRANCESC CAMPOY Exactly, exactly. I've seen it since it's summer. ADEPT is where it is, and Dust is where it is. So Dust is still standing.Alessio [00:32:55]: Can we just quickly comment on function calling? You mentioned you don't need the models to be that smart to actually pick the tools. Have you seen the models not be good enough? Or is it just like, you just don't want to put the complexity in there? Like, is there any room for improvement left in function calling? Or do you feel you usually consistently get always the right response, the right parametersStan [00:33:13]: and all of that?Alessio [00:33:13]: FRANCESC CAMPOY So that's a tricky product question.Stan [00:33:15]: Because if the instructions are good and precise, then you don't have any issue, because it's scripted for you. And the model will just look at the scripts and just follow and say, oh, he's probably talking about that action, and I'm going to use it. And the parameters are kind of abused from the state of the conversation. I'll just go with it. If you provide a very high level, kind of an auto-GPT-esque level in the instructions and provide 16 different tools to your model, yes, we're seeing the models in that state making mistakes. And there is obviously some progress can be made on the capabilities. But the interesting part is that there is already so much work that can assist, augment, accelerate by just going with pretty simply scripted for actions agents. What I'm excited about by pushing our users to create rather simple agents is that once you have those working really well, you can create meta agents that use the agents as actions. And all of a sudden, you can kind of have a hierarchy of responsibility that will probably get you almost to the point of the auto-GPT value. It requires the construction of intermediary artifacts, but you're probably going to be able to achieve something great. I'll give you some example. We have our incidents are shared in Slack in a specific channel, or shipped are shared in Slack. We have a weekly meeting where we have a table about incidents and shipped stuff. We're not writing that weekly meeting table anymore. We have an assistant that just go find the right data on Slack and create the table for us. And that assistant works perfectly. It's trivially simple, right? Take one week of data from that channel and just create the table. And then we have in that weekly meeting, obviously some graphs and reporting about our financials and our progress and our ARR. And we've created assistants to generate those graphs directly. And those assistants works great. By creating those assistants that cover those small parts of that weekly meeting, slowly we're getting to in a world where we'll have a weekly meeting assistance. We'll just call it. You don't need to prompt it. You don't need to say anything. It's going to run those different assistants and get that notion page just ready. And by doing that, if you get there, and that's an objective for us to us using Dust, get there, you're saving an hour of company time every time you run it. Yeah.Alessio [00:35:28]: That's my pet topic of NPM for agents. How do you build dependency graphs of agents? And how do you share them? Because why do I have to rebuild some of the smaller levels of what you built already?Swyx [00:35:40]: I have a quick follow-up question on agents managing other agents. It's a topic of a lot of research, both from Microsoft and even in startups. What you've discovered best practice for, let's say like a manager agent controlling a bunch of small agents. It's two-way communication. I don't know if there should be a protocol format.Stan [00:35:59]: To be completely honest, the state we are at right now is creating the simple agents. So we haven't even explored yet the meta agents. We know it's there. We know it's going to be valuable. We know it's going to be awesome. But we're starting there because it's the simplest place to start. And it's also what the market understands. If you go to a company, random SaaS B2B company, not necessarily specialized in AI, and you take an operational team and you tell them, build some tooling for yourself, they'll understand the small agents. If you tell them, build AutoGP, they'll be like, Auto what?Swyx [00:36:31]: And I noticed that in your language, you're very much focused on non-technical users. You don't really mention API here. You mention instruction instead of system prompt, right? That's very conscious.Stan [00:36:41]: Yeah, it's very conscious. It's a mark of our designer, Ed, who kind of pushed us to create a friendly product. I was knee-deep into AI when I started, obviously. And my co-founder, Gabriel, was a Stripe as well. We started a company together that got acquired by Stripe 15 years ago. It was at Alain, a healthcare company in Paris. After that, it was a little bit less so knee-deep in AI, but really focused on product. And I didn't realize how important it is to make that technology not scary to end users. It didn't feel scary to me, but it was really seen by Ed, our designer, that it was feeling scary to the users. And so we were very proactive and very deliberate about creating a brand that feels not too scary and creating a wording and a language, as you say, that really tried to communicate the fact that it's going to be fine. It's going to be easy. You're going to make it.Alessio [00:37:34]: And another big point that David had about ADAPT is we need to build an environment for the agents to act. And then if you have the environment, you can simulate what they do. How's that different when you're interacting with APIs and you're kind of touching systems that you cannot really simulate? If you call it the Salesforce API, you're just calling it.Stan [00:37:52]: So I think that goes back to the DNA of the companies that are very different. ADAPT, I think, was a product company with a very strong research DNA, and they were still doing research. One of their goals was building a model. And that's why they raised a large amount of money, et cetera. We are 100% deliberately a product company. We don't do research. We don't train models. We don't even run GPUs. We're using the models that exist, and we try to push the product boundary as far as possible with the existing models. So that creates an issue. Indeed, so to answer your question, when you're interacting in the real world, well, you cannot simulate, so you cannot improve the models. Even improving your instructions is complicated for a builder. The hope is that you can use models to evaluate the conversations so that you can get at least feedback and you could get contradictive information about the performance of the assistance. But if you take actual trace of interaction of humans with those agents, it is even for us humans extremely hard to decide whether it was a productive interaction or a really bad interaction. You don't know why the person left. You don't know if they left happy or not. So being extremely, extremely, extremely pragmatic here, it becomes a product issue. We have to build a product that identifies the end users to provide feedback so that as a first step, the person that is building the agent can iterate on it. As a second step, maybe later when we start training model and post-training, et cetera, we can optimize around that for each of those companies. Yeah.Alessio [00:39:17]: Do you see in the future products offering kind of like a simulation environment, the same way all SaaS now kind of offers APIs to build programmatically? Like in cybersecurity, there are a lot of companies working on building simulative environments so that then you can use agents like Red Team, but I haven't really seen that.Stan [00:39:34]: Yeah, no, me neither. That's a super interesting question. I think it's really going to depend on how much, because you need to simulate to generate data, you need to train data to train models. And the question at the end is, are we going to be training models or are we just going to be using frontier models as they are? On that question, I don't have a strong opinion. It might be the case that we'll be training models because in all of those AI first products, the model is so close to the product surface that as you get big and you want to really own your product, you're going to have to own the model as well. Owning the model doesn't mean doing the pre-training, that would be crazy. But at least having an internal post-training realignment loop, it makes a lot of sense. And so if we see many companies going towards that all the time, then there might be incentives for the SaaS's of the world to provide assistance in getting there. But at the same time, there's a tension because those SaaS, they don't want to be interacted by agents, they want the human to click on the button. Yeah, they got to sell seats. Exactly.Swyx [00:40:41]: Just a quick question on models. I'm sure you've used many, probably not just OpenAI. Would you characterize some models as better than others? Do you use any open source models? What have been the trends in models over the last two years?Stan [00:40:53]: We've seen over the past two years kind of a bit of a race in between models. And at times, it's the OpenAI model that is the best. At times, it's the Anthropic models that is the best. Our take on that is that we are agnostic and we let our users pick their model. Oh, they choose? Yeah, so when you create an assistant or an agent, you can just say, oh, I'm going to run it on GP4, GP4 Turbo, or...Swyx [00:41:16]: Don't you think for the non-technical user, that is actually an abstraction that you should take away from them?Stan [00:41:20]: We have a sane default. So we move the default to the latest model that is cool. And we have a sane default, and it's actually not very visible. In our flow to create an agent, you would have to go in advance and go pick your model. So this is something that the technical person will care about. But that's something that obviously is a bit too complicated for the...Swyx [00:41:40]: And do you care most about function calling or instruction following or something else?Stan [00:41:44]: I think we care most for function calling because you want to... There's nothing worse than a function call, including incorrect parameters or being a bit off because it just drives the whole interaction off.Swyx [00:41:56]: Yeah, so got the Berkeley function calling.Stan [00:42:00]: These days, it's funny how the comparison between GP4O and GP4 Turbo is still up in the air on function calling. I personally don't have proof, but I know many people, and I'm probably part of them, to think that GP4 Turbo is still better than GP4O on function calling. Wow. We'll see what comes out of the O1 class if it ever gets function calling. And Cloud 3.5 Summit is great as well. They kind of innovated in an interesting way, which was never quite publicized. But it's that they have that kind of chain of thought step whenever you use a Cloud model or Summit model with function calling. That chain of thought step doesn't exist when you just interact with it just for answering questions. But when you use function calling, you get that step, and it really helps getting better function calling.Swyx [00:42:43]: Yeah, we actually just recorded a podcast with the Berkeley team that runs that leaderboard this week. So they just released V3.Stan [00:42:49]: Yeah.Swyx [00:42:49]: It was V1 like two months ago, and then they V2, V3. Turbo is on top.Stan [00:42:53]: Turbo is on top. Turbo is over 4.0.Swyx [00:42:54]: And then the third place is XLAM from Salesforce, which is a large action model they've been trying to popularize.Stan [00:43:01]: Yep.Swyx [00:43:01]: O1 Mini is actually on here, I think. O1 Mini is number 11.Stan [00:43:05]: But arguably, O1 Mini has been in a line for that. Yeah.Alessio [00:43:09]: Do you use leaderboards? Do you have your own evals? I mean, this is kind of intuitive, right? Like using the older model is better. I think most people just upgrade. Yeah. What's the eval process like?Stan [00:43:19]: It's funny because I've been doing research for three years, and we have bigger stuff to cook. When you're deploying in a company, one thing where we really spike is that when we manage to activate the company, we have a crazy penetration. The highest penetration we have is 88% daily active users within the entire employee of the company. The kind of average penetration and activation we have in our current enterprise customers is something like more like 60% to 70% weekly active. So we basically have the entire company interacting with us. And when you're there, there is so many stuff that matters most than getting evals, getting the best model. Because there is so many places where you can create products or do stuff that will give you the 80% with the work you do. Whereas deciding if it's GPT-4 or GPT-4 Turbo or et cetera, you know, it'll just give you the 5% improvement. But the reality is that you want to focus on the places where you can really change the direction or change the interaction more drastically. But that's something that we'll have to do eventually because we still want to be serious people.Swyx [00:44:24]: It's funny because in some ways, the model labs are competing for you, right? You don't have to do any effort. You just switch model and then it'll grow. What are you really limited by? Is it additional sources?Stan [00:44:36]: It's not models, right?Swyx [00:44:37]: You're not really limited by quality of model.Stan [00:44:40]: Right now, we are limited by the infrastructure part, which is the ability to connect easily for users to all the data they need to do the job they want to do.Swyx [00:44:51]: Because you maintain all your own stuff.Stan [00:44:53]: You know, there are companies out thereSwyx [00:44:54]: that are starting to provide integrations as a service, right? I used to work in an integrations company. Yeah, I know.Stan [00:44:59]: It's just that there is some intricacies about how you chunk stuff and how you process information from one platform to the other. If you look at the end of the spectrum, you could think of, you could say, oh, I'm going to support AirByte and AirByte has- I used to work at AirByte.Swyx [00:45:12]: Oh, really?Stan [00:45:13]: That makes sense.Swyx [00:45:14]: They're the French founders as well.Stan [00:45:15]: I know Jean very well. I'm seeing him today. And the reality is that if you look at Notion, AirByte does the job of taking Notion and putting it in a structured way. But that's the way it is not really usable to actually make it available to models in a useful way. Because you get all the blocks, details, et cetera, which is useful for many use cases.Swyx [00:45:35]: It's also for data scientists and not for AI.Stan [00:45:38]: The reality of Notion is that sometimes you have a- so when you have a page, there's a lot of structure in it and you want to capture the structure and chunk the information in a way that respects that structure. In Notion, you have databases. Sometimes those databases are real tabular data. Sometimes those databases are full of text. You want to get the distinction and understand that this database should be considered like text information, whereas this other one is actually quantitative information. And to really get a very high quality interaction with that piece of information, I haven't found a solution that will work without us owning the connection end-to-end.Swyx [00:46:15]: That's why I don't invest in, there's Composio, there's All Hands from Graham Newbig. There's all these other companies that are like, we will do the integrations for you. You just, we have the open source community. We'll do off the shelf. But then you are so specific in your needs that you want to own it.Swyx [00:46:28]: Yeah, exactly.Stan [00:46:29]: You can talk to Michel about that.Swyx [00:46:30]: You know, he wants to put the AI in there, but you know. Yeah, I will. I will.Stan [00:46:35]: Cool. What are we missing?Alessio [00:46:36]: You know, what are like the things that are like sneakily hard that you're tackling that maybe people don't even realize they're like really hard?Stan [00:46:43]: The real parts as we kind of touch base throughout the conversation is really building the infra that works for those agents because it's a tenuous walk. It's an evergreen piece of work because you always have an extra integration that will be useful to a non-negligible set of your users. I'm super excited about is that there's so many interactions that shouldn't be conversational interactions and that could be very useful. Basically, know that we have the firehose of information of those companies and there's not going to be that many companies that capture the firehose of information. When you have the firehose of information, you can do a ton of stuff with models that are just not accelerating people, but giving them superhuman capability, even with the current model capability because you can just sift through much more information. An example is documentation repair. If I have the firehose of Slack messages and new Notion pages, if somebody says, I own that page, I want to be updated when there is a piece of information that should update that page, this is not possible. You get an email saying, oh, look at that Slack message. It says the opposite of what you have in that paragraph. Maybe you want to update or just ping that person. I think there is a lot to be explored on the product layer in terms of what it means to interact productively with those models. And that's a problem that's extremely hard and extremely exciting.Swyx [00:48:00]: One thing you keep mentioning about infra work, obviously, Dust is building that infra and serving that in a very consumer-friendly way. You always talk about infra being additional sources, additional connectors. That is very important. But I'm also interested in the vertical infra. There is an orchestrator underlying all these things where you're doing asynchronous work. For example, the simplest one is a cron job. You just schedule things. But also, for if this and that, you have to wait for something to be executed and proceed to the next task. I used to work on an orchestrator as well, Temporal.Stan [00:48:31]: We used Temporal. Oh, you used Temporal? Yeah. Oh, how was the experience?Swyx [00:48:34]: I need the NPS.Stan [00:48:36]: We're doing a self-discovery call now.Swyx [00:48:39]: But you can also complain to me because I don't work there anymore.Stan [00:48:42]: No, we love Temporal. There's some edges that are a bit rough, surprisingly rough. And you would say, why is it so complicated?Swyx [00:48:49]: It's always versioning.Stan [00:48:50]: Yeah, stuff like that. But we really love it. And we use it for exactly what you said, like managing the entire set of stuff that needs to happen so that in semi-real time, we get all the updates from Slack or Notion or GitHub into the system. And whenever we see that piece of information goes through, maybe trigger workflows to run agents because they need to provide alerts to users and stuff like that. And Temporal is great. Love it.Swyx [00:49:17]: You haven't evaluated others. You don't want to build your own. You're happy with...Stan [00:49:21]: Oh, no, we're not in the business of replacing Temporal. And Temporal is so... I mean, it is or any other competitive product. They're very general. If it's there, there's an interesting theory about buy versus build. I think in that case, when you're a high-growth company, your buy-build trade-off is very much on the side of buy. Because if you have the capability, you're just going to be saving time, you can focus on your core competency, etc. And it's funny because we're seeing, we're starting to see the post-high-growth company, post-SKF company, going back on that trade-off, interestingly. So that's the cloud news about removing Zendesk and Salesforce. Do you believe that, by the way?Alessio [00:49:56]: Yeah, I did a podcast with them.Stan [00:49:58]: Oh, yeah?Alessio [00:49:58]: It's true.Swyx [00:49:59]: No, no, I know.Stan [00:50:00]: Of course they say it's true,Swyx [00:50:00]: but also how well is it going to go?Stan [00:50:02]: So I'm not talking about deflecting the customer traffic. I'm talking about building AI on top of Salesforce and Zendesk, basically, if I understand correctly. And all of a sudden, your product surface becomes much smaller because you're interacting with an AI system that will take some actions. And so all of a sudden, you don't need the product layer anymore. And you realize that, oh, those things are just databases that I pay a hundred times the price, right? Because you're a post-SKF company and you have tech capabilities, you are incentivized to reduce your costs and you have the capability to do so. And then it makes sense to just scratch the SaaS away. So it's interesting that we might see kind of a bad time for SaaS in post-hyper-growth tech companies. So it's still a big market, but it's not that big because if you're not a tech company, you don't have the capabilities to reduce that cost. If you're a high-growth company, always going to be buying because you go faster with that. But that's an interesting new space, new category of companies that might remove some SaaS. Yeah, Alessio's firmSwyx [00:51:02]: has an interesting thesis on the future of SaaS in AI.Alessio [00:51:05]: Service as a software, we call it. It's basically like, well, the most extreme is like, why is there any software at all? You know, ideally, it's all a labor interface where you're asking somebody to do something for you, whether that's a person, an AI agent or whatnot.Stan [00:51:17]: Yeah, yeah, that's interesting. I have to ask.Swyx [00:51:19]: Are you paying for Temporal Cloud or are you self-hosting?Stan [00:51:22]: Oh, no, no, we're paying, we're paying. Oh, okay, interesting.Swyx [00:51:24]: We're paying way too much.Stan [00:51:26]: It's crazy expensive, but it makes us-Swyx [00:51:28]: That's why as a shareholder, I like to hear that. It makes us go faster,Stan [00:51:31]: so we're happy to pay.Swyx [00:51:33]: Other things in the infrastack, I just want a list for other founders to think about. Ops, API gateway, evals, you know, anything interesting there that you build or buy?Stan [00:51:41]: I mean, there's always an interesting question. We've been building a lot around the interface between models and because Dust, the original version, was an orchestration platform and we basically provide a unified interface to every model providers.Swyx [00:51:56]: That's what I call gateway.Stan [00:51:57]: That we add because Dust was that and so we continued building upon and we own it. But that's an interesting question was in you, you want to build that or buy it?Swyx [00:52:06]: Yeah, I always say light LLM is the current open source consensus.Stan [00:52:09]: Exactly, yeah. There's an interesting question there.Swyx [00:52:12]: Ops, Datadog, just tracking.Stan [00:52:14]: Oh yeah, so Datadog is an obvious... What are the mistakes that I regret? I started as pure JavaScript, not TypeScript, and I think you want to, if you're wondering, oh, I want to go fast, I'll do a little bit of JavaScript. No, don't, just start with TypeScript. I see, okay.Swyx [00:52:30]: So interesting, you are a research engineer that came out of OpenAI that bet on TypeScript.Stan [00:52:36]: Well, the reality is that if you're building a product, you're going to be doing a lot of JavaScript, right? And Next, we're using Next as an example. It's

Project Geospatial
FOSS4G NA 2024 - AI Wrangling in the Early 21st Century - Mark Mathis

Project Geospatial

Play Episode Listen Later Oct 28, 2024 20:01


Mark Mathis discusses the integration of artificial intelligence with geospatial data, showcasing tools and techniques to enhance data accessibility and usability. Highlights

The Analytics Engineering Podcast
The current state of the AI ecosystem (w/ Julia Schottenstein)

The Analytics Engineering Podcast

Play Episode Listen Later Oct 6, 2024 45:44


Former co-host Julia Schottenstein returns to the show to go deep into the world of LLMs. Julia joined LangChain as an early employee, in Tristan's words, to “Basically solve all of the problems that aren't specifically in product and engineering.” LangChain has become one of, if not the primary frameworks for developing applications using large language models. There are over a million developers using LangChain today, building everything from prototypes to production AI applications.

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

OpenAI DevDay is almost here! Per tradition, we are hosting a DevDay pregame event for everyone coming to town! Join us with demos and gossip!Also sign up for related events across San Francisco: the AI DevTools Night, the xAI open house, the Replicate art show, the DevDay Watch Party (for non-attendees), Hack Night with OpenAI at Cloudflare. For everyone else, join the Latent Space Discord for our online watch party and find fellow AI Engineers in your city.OpenAI's recent o1 release (and Reflection 70b debacle) has reignited broad interest in agentic general reasoning and tree search methods.While we have covered some of the self-taught reasoning literature on the Latent Space Paper Club, it is notable that the Eric Zelikman ended up at xAI, whereas OpenAI's hiring of Noam Brown and now Shunyu suggests more interest in tool-using chain of thought/tree of thought/generator-verifier architectures for Level 3 Agents.We were more than delighted to learn that Shunyu is a fellow Latent Space enjoyer, and invited him back (after his first appearance on our NeurIPS 2023 pod) for a look through his academic career with Harrison Chase (one year after his first LS show).ReAct: Synergizing Reasoning and Acting in Language Modelspaper linkFollowing seminal Chain of Thought papers from Wei et al and Kojima et al, and reflecting on lessons from building the WebShop human ecommerce trajectory benchmark, Shunyu's first big hit, the ReAct paper showed that using LLMs to “generate both reasoning traces and task-specific actions in an interleaved manner” achieved remarkably greater performance (less hallucination/error propagation, higher ALFWorld/WebShop benchmark success) than CoT alone. In even better news, ReAct scales fabulously with finetuning:As a member of the elite Princeton NLP group, Shunyu was also a coauthor of the Reflexion paper, which we discuss in this pod.Tree of Thoughtspaper link hereShunyu's next major improvement on the CoT literature was Tree of Thoughts:Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role…ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.The beauty of ToT is it doesnt require pretraining with exotic methods like backspace tokens or other MCTS architectures. You can listen to Shunyu explain ToT in his own words on our NeurIPS pod, but also the ineffable Yannic Kilcher:Other WorkWe don't have the space to summarize the rest of Shunyu's work, you can listen to our pod with him now, and recommend the CoALA paper and his initial hit webinar with Harrison, today's guest cohost:as well as Shunyu's PhD Defense Lecture:as well as Shunyu's latest lecture covering a Brief History of LLM Agents:As usual, we are live on YouTube! Show Notes* Harrison Chase* LangChain, LangSmith, LangGraph* Shunyu Yao* Alec Radford* ReAct Paper* Hotpot QA* Tau Bench* WebShop* SWE-Agent* SWE-Bench* Trees of Thought* CoALA Paper* Related Episodes* Our Thomas Scialom (Meta) episode* Shunyu on our NeurIPS 2023 Best Papers episode* Harrison on our LangChain episode* Mentions* Sierra* Voyager* Jason Wei* Tavily* SERP API* ExaTimestamps* [00:00:00] Opening Song by Suno* [00:03:00] Introductions* [00:06:16] The ReAct paper* [00:12:09] Early applications of ReAct in LangChain* [00:17:15] Discussion of the Reflection paper* [00:22:35] Tree of Thoughts paper and search algorithms in language models* [00:27:21] SWE-Agent and SWE-Bench for coding benchmarks* [00:39:21] CoALA: Cognitive Architectures for Language Agents* [00:45:24] Agent-Computer Interfaces (ACI) and tool design for agents* [00:49:24] Designing frameworks for agents vs humans* [00:53:52] UX design for AI applications and agents* [00:59:53] Data and model improvements for agent capabilities* [01:19:10] TauBench* [01:23:09] Promising areas for AITranscriptAlessio [00:00:01]: Hey, everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO of Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small AI.Swyx [00:00:12]: Hey, and today we have a super special episode. I actually always wanted to take like a selfie and go like, you know, POV, you're about to revolutionize the world of agents because we have two of the most awesome hiring agents in the house. So first, we're going to welcome back Harrison Chase. Welcome. Excited to be here. What's new with you recently in sort of like the 10, 20 second recap?Harrison [00:00:34]: Linkchain, Linksmith, Lingraph, pushing on all of them. Lots of cool stuff related to a lot of the stuff that we're going to talk about today, probably.Swyx [00:00:42]: Yeah.Alessio [00:00:43]: We'll mention it in there. And the Celtics won the title.Swyx [00:00:45]: And the Celtics won the title. You got that going on for you. I don't know. Is that like floorball? Handball? Baseball? Basketball.Alessio [00:00:52]: Basketball, basketball.Harrison [00:00:53]: Patriots aren't looking good though, so that's...Swyx [00:00:56]: And then Xun Yu, you've also been on the pod, but only in like a sort of oral paper presentation capacity. But welcome officially to the LinkedSpace pod.Shunyu [00:01:03]: Yeah, I've been a huge fan. So thanks for the invitation. Thanks.Swyx [00:01:07]: Well, it's an honor to have you on. You're one of like, you're maybe the first PhD thesis defense I've ever watched in like this AI world, because most people just publish single papers, but every paper of yours is a banger. So congrats.Shunyu [00:01:22]: Thanks.Swyx [00:01:24]: Yeah, maybe we'll just kick it off with, you know, what was your journey into using language models for agents? I like that your thesis advisor, I didn't catch his name, but he was like, you know... Karthik. Yeah. It's like, this guy just wanted to use language models and it was such a controversial pick at the time. Right.Shunyu [00:01:39]: The full story is that in undergrad, I did some computer vision research and that's how I got into AI. But at the time, I feel like, you know, you're just composing all the GAN or 3D perception or whatever together and it's not exciting anymore. And one day I just see this transformer paper and that's really cool. But I really got into language model only when I entered my PhD and met my advisor Karthik. So he was actually the second author of GPT-1 when he was like a visiting scientist at OpenAI. With Alec Redford?Swyx [00:02:10]: Yes.Shunyu [00:02:11]: Wow. That's what he told me. It's like back in OpenAI, they did this GPT-1 together and Ilya just said, Karthik, you should stay because we just solved the language. But apparently Karthik is not fully convinced. So he went to Princeton, started his professorship and I'm really grateful. So he accepted me as a student, even though I have no prior knowledge in NLP. And you know, we just met for the first time and he's like, you know, what do you want to do? And I'm like, you know, you have done those test game scenes. That's really cool. I wonder if we can just redo them with language models. And that's how the whole journey began. Awesome.Alessio [00:02:46]: So GPT-2 was out at the time? Yes, that was 2019.Shunyu [00:02:48]: Yeah.Alessio [00:02:49]: Way too dangerous to release. And then I guess the first work of yours that I came across was React, which was a big part of your defense. But also Harrison, when you came on The Pockets last year, you said that was one of the first papers that you saw when you were getting inspired for BlankChain. So maybe give a recap of why you thought it was cool, because you were already working in AI and machine learning. And then, yeah, you can kind of like intro the paper formally. What was that interesting to you specifically?Harrison [00:03:16]: Yeah, I mean, I think the interesting part was using these language models to interact with the outside world in some form. And I think in the paper, you mostly deal with Wikipedia. And I think there's some other data sets as well. But the outside world is the outside world. And so interacting with things that weren't present in the LLM and APIs and calling into them and thinking about the React reasoning and acting and kind of like combining those together and getting better results. I'd been playing around with LLMs, been talking with people who were playing around with LLMs. People were trying to get LLMs to call into APIs, do things, and it was always, how can they do it more reliably and better? And so this paper was basically a step in that direction. And I think really interesting and also really general as well. Like I think that's part of the appeal is just how general and simple in a good way, I think the idea was. So that it was really appealing for all those reasons.Shunyu [00:04:07]: Simple is always good. Yeah.Alessio [00:04:09]: Do you have a favorite part? Because I have one favorite part from your PhD defense, which I didn't understand when I read the paper, but you said something along the lines, React doesn't change the outside or the environment, but it does change the insight through the context, putting more things in the context. You're not actually changing any of the tools around you to work for you, but you're changing how the model thinks. And I think that was like a very profound thing when I, not that I've been using these tools for like 18 months. I'm like, I understand what you meant, but like to say that at the time you did the PhD defense was not trivial. Yeah.Shunyu [00:04:41]: Another way to put it is like thinking can be an extra tool that's useful.Alessio [00:04:47]: Makes sense. Checks out.Swyx [00:04:49]: Who would have thought? I think it's also more controversial within his world because everyone was trying to use RL for agents. And this is like the first kind of zero gradient type approach. Yeah.Shunyu [00:05:01]: I think the bigger kind of historical context is that we have this two big branches of AI. So if you think about RL, right, that's pretty much the equivalent of agent at a time. And it's like agent is equivalent to reinforcement learning and reinforcement learning is equivalent to whatever game environment they're using, right? Atari game or go or whatever. So you have like a pretty much, you know, you have a biased kind of like set of methodologies in terms of reinforcement learning and represents agents. On the other hand, I think NLP is like a historical kind of subject. It's not really into agents, right? It's more about reasoning. It's more about solving those concrete tasks. And if you look at SEL, right, like each task has its own track, right? Summarization has a track, question answering has a track. So I think really it's about rethinking agents in terms of what could be the new environments that we came to have is not just Atari games or whatever video games, but also those text games or language games. And also thinking about, could there be like a more general kind of methodology beyond just designing specific pipelines for each NLP task? That's like the bigger kind of context, I would say.Alessio [00:06:14]: Is there an inspiration spark moment that you remember or how did you come to this? We had Trida on the podcast and he mentioned he was really inspired working with like systems people to think about Flash Attention. What was your inspiration journey?Shunyu [00:06:27]: So actually before React, I spent the first two years of my PhD focusing on text-based games, or in other words, text adventure games. It's a very kind of small kind of research area and quite ad hoc, I would say. And there are like, I don't know, like 10 people working on that at the time. And have you guys heard of Zork 1, for example? So basically the idea is you have this game and you have text observations, like you see a monster, you see a dragon.Swyx [00:06:57]: You're eaten by a grue.Shunyu [00:06:58]: Yeah, you're eaten by a grue. And you have actions like kill the grue with a sword or whatever. And that's like a very typical setup of a text game. So I think one day after I've seen all the GPT-3 stuff, I just think about, you know, how can I solve the game? Like why those AI, you know, machine learning methods are pretty stupid, but we are pretty good at solving the game relatively, right? So for the context, the predominant method to solve this text game is obviously reinforcement learning. And the idea is you just try out an arrow in those games for like millions of steps and you kind of just overfit to the game. But there's no language understanding at all. And I'm like, why can't I solve the game better? And it's kind of like, because we think about the game, right? Like when we see this very complex text observation, like you see a grue and you might see a sword, you know, in the right of the room and you have to go through the wooden door to go to that room. You will think, you know, oh, I have to kill the monster and to kill that monster, I have to get the sword, I have to get the sword, I have to go, right? And this kind of thinking actually helps us kind of throw shots off the game. And it's like, why don't we also enable the text agents to think? And that's kind of the prototype of React. And I think that's actually very interesting because the prototype, I think, was around November of 2021. So that's even before like chain of thought or whatever came up. So we did a bunch of experiments in the text game, but it was not really working that well. Like those text games are just too hard. I think today it's still very hard. Like if you use GPD 4 to solve it, it's still very hard. So the change came when I started the internship in Google. And apparently Google care less about text game, they care more about what's more practical. So pretty much I just reapplied the idea, but to more practical kind of environments like Wikipedia or simpler text games like Alphard, and it just worked. It's kind of like you first have the idea and then you try to find the domains and the problems to demonstrate the idea, which is, I would say, different from most of the AI research, but it kind of worked out for me in that case.Swyx [00:09:09]: For Harrison, when you were implementing React, what were people applying React to in the early days?Harrison [00:09:14]: I think the first demo we did probably had like a calculator tool and a search tool. So like general things, we tried to make it pretty easy to write your own tools and plug in your own things. And so this is one of the things that we've seen in LangChain is people who build their own applications generally write their own tools. Like there are a few common ones. I'd say like the three common ones might be like a browser, a search tool, and a code interpreter. But then other than that-Swyx [00:09:37]: The LMS. Yep.Harrison [00:09:39]: Yeah, exactly. It matches up very nice with that. And we actually just redid like our integrations docs page, and if you go to the tool section, they like highlight those three, and then there's a bunch of like other ones. And there's such a long tail of other ones. But in practice, like when people go to production, they generally have their own tools or maybe one of those three, maybe some other ones, but like very, very few other ones. So yeah, I think the first demos was a search and a calculator one. And there's- What's the data set?Shunyu [00:10:04]: Hotpot QA.Harrison [00:10:05]: Yeah. Oh, so there's that one. And then there's like the celebrity one by the same author, I think.Swyx [00:10:09]: Olivier Wilde's boyfriend squared. Yeah. 0.23. Yeah. Right, right, right.Harrison [00:10:16]: I'm forgetting the name of the author, but there's-Swyx [00:10:17]: I was like, we're going to over-optimize for Olivier Wilde's boyfriend, and it's going to change next year or something.Harrison [00:10:21]: There's a few data sets kind of like in that vein that require multi-step kind of like reasoning and thinking. So one of the questions I actually had for you in this vein, like the React paper, there's a few things in there, or at least when I think of that, there's a few things that I think of. There's kind of like the specific prompting strategy. Then there's like this general idea of kind of like thinking and then taking an action. And then there's just even more general idea of just like taking actions in a loop. Today, like obviously language models have changed a lot. We have tool calling. The specific prompting strategy probably isn't used super heavily anymore. Would you say that like the concept of React is still used though? Or like do you think that tool calling and running tool calling in a loop, is that ReactSwyx [00:11:02]: in your mind?Shunyu [00:11:03]: I would say like it's like more implicitly used than explicitly used. To be fair, I think the contribution of React is actually twofold. So first is this idea of, you know, we should be able to use calls in a very general way. Like there should be a single kind of general method to handle interaction with various environments. I think React is the first paper to demonstrate the idea. But then I think later there are two form or whatever, and this becomes like a trivial idea. But I think at the time, that's like a pretty non-trivial thing. And I think the second contribution is this idea of what people call like inner monologue or thinking or reasoning or whatever, to be paired with tool use. I think that's still non-trivial because if you look at the default function calling or whatever, like there's no inner monologue. And in practice, that actually is important, especially if the tool that you use is pretty different from the training distribution of the language model. I think those are the two main things that are kind of inherited.Harrison [00:12:10]: On that note, I think OpenAI even recommended when you're doing tool calling, it's sometimes helpful to put a thought field in the tool, along with all the actual acquired arguments,Swyx [00:12:19]: and then have that one first.Harrison [00:12:20]: So it fills out that first, and they've shown that that's yielded better results. The reason I ask is just like this same concept is still alive, and I don't know whether to call it a React agent or not. I don't know what to call it. I think of it as React, like it's the same ideas that were in the paper, but it's obviously a very different implementation at this point in time. And so I just don't know what to call it.Shunyu [00:12:40]: I feel like people will sometimes think more in terms of different tools, right? Because if you think about a web agent versus, you know, like a function calling agent, calling a Python API, you would think of them as very different. But in some sense, the methodology is the same. It depends on how you view them, right? I think people will tend to think more in terms of the environment and the tools rather than the methodology. Or, in other words, I think the methodology is kind of trivial and simple, so people will try to focus more on the different tools. But I think it's good to have a single underlying principle of those things.Alessio [00:13:17]: How do you see the surface of React getting molded into the model? So a function calling is a good example of like, now the model does it. What about the thinking? Now most models that you use kind of do chain of thought on their own, they kind of produce steps. Do you think that more and more of this logic will be in the model? Or do you think the context window will still be the main driver of reasoning and thinking?Shunyu [00:13:39]: I think it's already default, right? You do some chain of thought and you do some tool call, the cost of adding the chain of thought is kind of relatively low compared to other things. So it's not hurting to do that. And I think it's already kind of common practice, I would say.Swyx [00:13:56]: This is a good place to bring in either Tree of Thought or Reflection, your pick.Shunyu [00:14:01]: Maybe Reflection, to respect the time order, I would say.Swyx [00:14:05]: Any backstory as well, like the people involved with NOAA and the Princeton group. We talked about this offline, but people don't understand how these research pieces come together and this ideation.Shunyu [00:14:15]: I think Reflection is mostly NOAA's work, I'm more like advising kind of role. The story is, I don't remember the time, but one day we just see this pre-print that's like Reflection and Autonomous Agent with memory or whatever. And it's kind of like an extension to React, which uses this self-reflection. I'm like, oh, somehow you've become very popular. And NOAA reached out to me, it's like, do you want to collaborate on this and make this from an archive pre-print to something more solid, like a conference submission? I'm like, sure. We started collaborating and we remain good friends today. And I think another interesting backstory is NOAA was contacted by OpenAI at the time. It's like, this is pretty cool, do you want to just work at OpenAI? And I think Sierra also reached out at the same time. It's like, this is pretty cool, do you want to work at Sierra? And I think NOAA chose Sierra, but it's pretty cool because he was still like a second year undergrad and he's a very smart kid.Swyx [00:15:16]: Based on one paper. Oh my god.Shunyu [00:15:19]: He's done some other research based on programming language or chemistry or whatever, but I think that's the paper that got the attention of OpenAI and Sierra.Swyx [00:15:28]: For those who haven't gone too deep on it, the way that you present the inside of React, can you do that also for reflection? Yeah.Shunyu [00:15:35]: I think one way to think of reflection is that the traditional idea of reinforcement learning is you have a scalar reward and then you somehow back-propagate the signal of the scalar reward to the rest of your neural network through whatever algorithm, like policy grading or A2C or whatever. And if you think about the real life, most of the reward signal is not scalar. It's like your boss told you, you should have done a better job in this, but you could jump on that or whatever. It's not like a scalar reward, like 29 or something. I think in general, humans deal more with long scalar reward, or you can say language feedback. And the way that they deal with language feedback also has this back-propagation process, right? Because you start from this, you did a good job on job B, and then you reflect what could have been done differently to change to make it better. And you kind of change your prompt, right? Basically, you change your prompt on how to do job A and how to do job B, and then you do the whole thing again. So it's really like a pipeline of language where in self-graded descent, you have something like text reasoning to replace those gradient descent algorithms. I think that's one way to think of reflection.Harrison [00:16:47]: One question I have about reflection is how general do you think the algorithm there is? And so for context, I think at LangChain and at other places as well, we found it pretty easy to implement React in a standard way. You plug in any tools and it kind of works off the shelf, can get it up and running. I don't think we have an off-the-shelf kind of implementation of reflection and kind of the general sense. I think the concepts, absolutely, we see used in different kind of specific cognitive architectures, but I don't think we have one that comes off the shelf. I don't think any of the other frameworks have one that comes off the shelf. And I'm curious whether that's because it's not general enough or it's complex as well, because it also requires running it more times.Swyx [00:17:28]: Maybe that's not feasible.Harrison [00:17:30]: I'm curious how you think about the generality, complexity. Should we have one that comes off the shelf?Shunyu [00:17:36]: I think the algorithm is general in the sense that it's just as general as other algorithms, if you think about policy grading or whatever, but it's not applicable to all tasks, just like other algorithms. So you can argue PPO is also general, but it works better for those set of tasks, but not on those set of tasks. I think it's the same situation for reflection. And I think a key bottleneck is the evaluator, right? Basically, you need to have a good sense of the signal. So for example, if you are trying to do a very hard reasoning task, say mathematics, for example, and you don't have any tools, you're operating in this chain of thought setup, then reflection will be pretty hard because in order to reflect upon your thoughts, you have to have a very good evaluator to judge whether your thought is good or not. But that might be as hard as solving the problem itself or even harder. The principle of self-reflection is probably more applicable if you have a good evaluator, for example, in the case of coding. If you have those arrows, then you can just reflect on that and how to solve the bug andSwyx [00:18:37]: stuff.Shunyu [00:18:38]: So I think another criteria is that it depends on the application, right? If you have this latency or whatever need for an actual application with an end-user, the end-user wouldn't let you do two hours of tree-of-thought or reflection, right? You need something as soon as possible. So in that case, maybe this is better to be used as a training time technique, right? You do those reflection or tree-of-thought or whatever, you get a lot of data, and then you try to use the data to train your model better. And then in test time, you still use something as simple as React, but that's already improved.Alessio [00:19:11]: And if you think of the Voyager paper as a way to store skills and then reuse them, how would you compare this reflective memory and at what point it's just ragging on the memory versus you want to start to fine-tune some of them or what's the next step once you get a very long reflective corpus? Yeah.Shunyu [00:19:30]: So I think there are two questions here. The first question is, what type of information or memory are you considering, right? Is it like semantic memory that stores knowledge about the word, or is it the episodic memory that stores trajectories or behaviors, or is it more of a procedural memory like in Voyager's case, like skills or code snippets that you can use to do actions, right?Swyx [00:19:54]: That's one dimension.Shunyu [00:19:55]: And the second dimension is obviously how you use the memory, either retrieving from it, using it in the context, or fine-tuning it. I think the Cognitive Architecture for Language Agents paper has a good categorization of all the different combinations. And of course, which way you use depends on the concrete application and the concrete need and the concrete task. But I think in general, it's good to think of those systematic dimensions and all the possible options there.Swyx [00:20:25]: Harrison also has in LangMEM, I think you did a presentation in my meetup, and I think you've done it at a couple other venues as well. User state, semantic memory, and append-only state, I think kind of maps to what you just said.Shunyu [00:20:38]: What is LangMEM? Can I give it like a quick...Harrison [00:20:40]: One of the modules of LangChain for a long time has been something around memory. And I think we're still obviously figuring out what that means, as is everyone kind of in the space. But one of the experiments that we did, and one of the proof of concepts that we did was, technically what it was is you would basically create threads, you'd push messages to those threads in the background, we process the data in a few ways. One, we put it into some semantic store, that's the semantic memory. And then two, we do some extraction and reasoning over the memories to extract. And we let the user define this, but extract key facts or anything that's of interest to the user. Those aren't exactly trajectories, they're maybe more closer to the procedural memory. Is that how you'd think about it or classify it?Shunyu [00:21:22]: Is it like about knowledge about the word, or is it more like how to do something?Swyx [00:21:27]: It's reflections, basically.Harrison [00:21:28]: So in generative worlds.Shunyu [00:21:30]: Generative agents.Swyx [00:21:31]: The Smallville. Yeah, the Smallville one.Harrison [00:21:33]: So the way that they had their memory there was they had the sequence of events, and that's kind of like the raw events that happened. But then every N events, they'd run some synthesis over those events for the LLM to insert its own memory, basically. It's that type of memory.Swyx [00:21:49]: I don't know how that would be classified.Shunyu [00:21:50]: I think of that as more of the semantic memory, but to be fair, I think it's just one way to think of that. But whether it's semantic memory or procedural memory or whatever memory, that's like an abstraction layer. But in terms of implementation, you can choose whatever implementation for whatever memory. So they're totally kind of orthogonal. I think it's more of a good way to think of the things, because from the history of cognitive science and cognitive architecture and how people study even neuroscience, that's the way people think of how the human brain organizes memory. And I think it's more useful as a way to think of things. But it's not like for semantic memory, you have to do this kind of way to retrieve or fine-tune, and for procedural memory, you have to do that. I think those are totally orthogonal kind of dimensions.Harrison [00:22:34]: How much background do you have in cognitive sciences, and how much do you model some of your thoughts on?Shunyu [00:22:40]: That's a great question, actually. I think one of the undergrad influences for my follow-up research is I was doing an internship at MIT's Computational Cognitive Science Lab with Josh Tannenbaum, and he's a very famous cognitive scientist. And I think a lot of his ideas still influence me today, like thinking of things in computational terms and getting interested in language and a lot of stuff, or even developing psychology kind of stuff. So I think it still influences me today.Swyx [00:23:14]: As a developer that tried out LangMEM, the way I view it is just it's a materialized view of a stream of logs. And if anything, that's just useful for context compression. I don't have to use the full context to run it over everything. But also it's kind of debuggable. If it's wrong, I can show it to the user, the user can manually fix it, and I can carry on. That's a really good analogy. I like that. I'm going to steal that. Sure. Please, please. You know I'm bullish on memory databases. I guess, Tree of Thoughts? Yeah, Tree of Thoughts.Shunyu [00:23:39]: I feel like I'm relieving the defense in like a podcast format. Yeah, no.Alessio [00:23:45]: I mean, you had a banger. Well, this is the one where you're already successful and we just highlight the glory. It was really good. You mentioned that since thinking is kind of like taking an action, you can use action searching algorithms to think of thinking. So just like you will use Tree Search to find the next thing. And the idea behind Tree of Thought is that you generate all these possible outcomes and then find the best tree to get to the end. Maybe back to the latency question, you can't really do that if you have to respond in real time. So what are maybe some of the most helpful use cases for things like this? Where have you seen people adopt it where the high latency is actually worth the wait?Shunyu [00:24:21]: For things that you don't care about latency, obviously. For example, if you're trying to do math, if you're just trying to come up with a proof. But I feel like one type of task is more about searching for a solution. You can try a hundred times, but if you find one solution, that's good. For example, if you're finding a math proof or if you're finding a good code to solve a problem or whatever, I think another type of task is more like reacting. For example, if you're doing customer service, you're like a web agent booking a ticket for an end user. Those are more reactive kind of tasks, or more real-time tasks. You have to do things fast. They might be easy, but you have to do it reliably. And you care more about can you solve 99% of the time out of a hundred. But for the type of search type of tasks, then you care more about can I find one solution out of a hundred. So it's kind of symmetric and different.Alessio [00:25:11]: Do you have any data or intuition from your user base? What's the split of these type of use cases? How many people are doing more reactive things and how many people are experimenting with deep, long search?Harrison [00:25:23]: I would say React's probably the most popular. I think there's aspects of reflection that get used. Tree of thought, probably the least so. There's a great tweet from Jason Wei, I think you're now a colleague, and he was talking about prompting strategies and how he thinks about them. And I think the four things that he had was, one, how easy is it to implement? How much compute does it take? How many tasks does it solve? And how much does it improve on those tasks? And I'd add a fifth, which is how likely is it to be relevant when the next generation of models come out? And I think if you look at those axes and then you look at React, reflection, tree of thought, it tracks that the ones that score better are used more. React is pretty easy to implement. Tree of thought's pretty hard to implement. The amount of compute, yeah, a lot more for tree of thought. The tasks and how much it improves, I don't have amazing visibility there. But I think if we're comparing React versus tree of thought, React just dominates the first two axes so much that my question around that was going to be like, how do you think about these prompting strategies, cognitive architectures, whatever you want to call them? When you're thinking of them, what are the axes that you're judging them on in your head when you're thinking whether it's a good one or a less good one?Swyx [00:26:38]: Right.Shunyu [00:26:39]: Right. I think there is a difference between a prompting method versus research, in the sense that for research, you don't really even care about does it actually work on practical tasks or does it help? Whatever. I think it's more about the idea or the principle, right? What is the direction that you're unblocking and whatever. And I think for an actual prompting method to solve a concrete problem, I would say simplicity is very important because the simpler it is, the less decision you have to make about it. And it's easier to design. It's easier to propagate. And it's easier to do stuff. So always try to be as simple as possible. And I think latency obviously is important. If you can do things fast and you don't want to do things slow. And I think in terms of the actual prompting method to use for a particular problem, I think we should all be in the minimalist kind of camp, right? You should try the minimum thing and see if it works. And if it doesn't work and there's absolute reason to add something, then you add something, right? If there's absolute reason that you need some tool, then you should add the tool thing. If there's absolute reason to add reflection or whatever, you should add that. Otherwise, if a chain of thought can already solve something, then you don't even need to use any of that.Harrison [00:27:57]: Yeah. Or if it's just better prompting can solve it. Like, you know, you could add a reflection step or you could make your instructions a little bit clearer.Swyx [00:28:03]: And it's a lot easier to do that.Shunyu [00:28:04]: I think another interesting thing is like, I personally have never done those kind of like weird tricks. I think all the prompts that I write are kind of like just talking to a human, right? It's like, I don't know. I never say something like, your grandma is dying and you have to solve it. I mean, those are cool, but I feel like we should all try to solve things in a very intuitive way. Just like talking to your co-worker. That should work 99% of the time. That's my personal take.Swyx [00:28:29]: The problem with how language models, at least in the GPC 3 era, was that they over-optimized to some sets of tokens in sequence. So like reading the Kojima et al. paper that was listing step-by-step, like he tried a bunch of them and they had wildly different results. It should not be the case, but it is the case. And hopefully we're getting better there.Shunyu [00:28:51]: Yeah. I think it's also like a timing thing in the sense that if you think about this whole line of language model, right? Like at the time it was just like a text generator. We don't have any idea how it's going to be used, right? And obviously at the time you will find all kinds of weird issues because it's not trained to do any of that, right? But then I think we have this loop where once we realize chain of thought is important or agent is important or tool using is important, what we see is today's language models are heavily optimized towards those things. So I think in some sense they become more reliable and robust over those use cases. And you don't need to do as much prompt engineering tricks anymore to solve those things. I feel like in some sense, I feel like prompt engineering even is like a slightly negative word at the time because it refers to all those kind of weird tricks that you have to apply. But I think we don't have to do that anymore. Like given today's progress, you should just be able to talk to like a coworker. And if you're clear and concrete and being reasonable, then it should do reasonable things for you.Swyx [00:29:51]: Yeah. The way I put this is you should not be a prompt engineer because it is the goal of the big labs to put you out of a job.Shunyu [00:29:58]: You should just be a good communicator. Like if you're a good communicator to humans, you should be a good communicator to languageSwyx [00:30:02]: models.Harrison [00:30:03]: That's the key though, because oftentimes people aren't good communicators to these language models and that is a very important skill and that's still messing around with the prompt. And so it depends what you're talking about when you're saying prompt engineer.Shunyu [00:30:14]: But do you think it's like very correlated with like, are they like a good communicator to humans? You know, it's like.Harrison [00:30:20]: It may be, but I also think I would say on average, people are probably worse at communicating with language models than to humans right now, at least, because I think we're still figuring out how to do it. You kind of expect it to be magical and there's probably some correlation, but I'd say there's also just like, people are worse at it right now than talking to humans.Shunyu [00:30:36]: We should make it like a, you know, like an elementary school class or whatever, how toSwyx [00:30:41]: talk to language models. Yeah. I don't know. Very pro that. Yeah. Before we leave the topic of trees and searching, not specific about QSTAR, but there's a lot of questions about MCTS and this combination of tree search and language models. And I just had to get in a question there about how seriously should people take this?Shunyu [00:30:59]: Again, I think it depends on the tasks, right? So MCTS was magical for Go, but it's probably not as magical for robotics, right? So I think right now the problem is not even that we don't have good methodologies, it's more about we don't have good tasks. It's also very interesting, right? Because if you look at my citation, it's like, obviously the most cited are React, Refraction and Tree of Thought. Those are methodologies. But I think like equally important, if not more important line of my work is like benchmarks and environments, right? Like WebShop or SuiteVenture or whatever. And I think in general, what people do in academia that I think is not good is they choose a very simple task, like Alford, and then they apply overly complex methods to show they improve 2%. I think you should probably match the level of complexity of your task and your method. I feel like where tasks are kind of far behind the method in some sense, right? Because we have some good test-time approaches, like whatever, React or Refraction or Tree of Thought, or like there are many, many more complicated test-time methods afterwards. But on the benchmark side, we have made a lot of good progress this year, last year. But I think we still need more progress towards that, like better coding benchmark, better web agent benchmark, better agent benchmark, not even for web or code. I think in general, we need to catch up with tasks.Harrison [00:32:27]: What are the biggest reasons in your mind why it lags behind?Shunyu [00:32:31]: I think incentive is one big reason. Like if you see, you know, all the master paper are cited like a hundred times more than the task paper. And also making a good benchmark is actually quite hard. It's almost like a different set of skills in some sense, right? I feel like if you want to build a good benchmark, you need to be like a good kind of product manager kind of mindset, right? You need to think about why people should use your benchmark, why it's challenging, why it's useful. If you think about like a PhD going into like a school, right? The prior skill that expected to have is more about, you know, can they code this method and can they just run experiments and can solve that? I think building a benchmark is not the typical prior skill that we have, but I think things are getting better. I think more and more people are starting to build benchmarks and people are saying that it's like a way to get more impact in some sense, right? Because like if you have a really good benchmark, a lot of people are going to use it. But if you have a super complicated test time method, like it's very hard for people to use it.Harrison [00:33:35]: Are evaluation metrics also part of the reason? Like for some of these tasks that we might want to ask these agents or language models to do, is it hard to evaluate them? And so it's hard to get an automated benchmark. Obviously with SweetBench you can, and with coding, it's easier, but.Shunyu [00:33:50]: I think that's part of the skillset thing that I mentioned, because I feel like it's like a product manager because there are many dimensions and you need to strike a balance and it's really hard, right? If you want to make sense, very easy to autogradable, like automatically gradable, like either to grade or either to evaluate, then you might lose some of the realness or practicality. Or like it might be practical, but it might not be as scalable, right? For example, if you think about text game, human have pre-annotated all the rewards and all the language are real. So it's pretty good on autogradable dimension and the practical dimension. If you think about, you know, practical, like actual English being practical, but it's not scalable, right? It takes like a year for experts to build that game. So it's not really that scalable. And I think part of the reason that SweetBench is so popular now is it kind of hits the balance between these three dimensions, right? Easy to evaluate and being actually practical and being scalable. Like if I were to criticize upon some of my prior work, I think webshop, like it's my initial attempt to get into benchmark world and I'm trying to do a good job striking the balance. But obviously we make it all gradable and it's really scalable, but then I think the practicality is not as high as actually just using GitHub issues, right? Because you're just creating those like synthetic tasks.Harrison [00:35:13]: Are there other areas besides coding that jump to mind as being really good for being autogradable?Shunyu [00:35:20]: Maybe mathematics.Swyx [00:35:21]: Classic. Yeah. Do you have thoughts on alpha proof, the new DeepMind paper? I think it's pretty cool.Shunyu [00:35:29]: I think it's more of a, you know, it's more of like a confidence boost or like sometimes, you know, the work is not even about, you know, the technical details or the methodology that it chooses or the concrete results. I think it's more about a signal, right?Swyx [00:35:47]: Yeah. Existence proof. Yeah.Shunyu [00:35:50]: Yeah. It can be done. This direction is exciting. It kind of encourages people to work more towards that direction. I think it's more like a boost of confidence, I would say.Swyx [00:35:59]: Yeah. So we're going to focus more on agents now and, you know, all of us have a special interest in coding agents. I would consider Devin to be the sort of biggest launch of the year as far as AI startups go. And you guys in the Princeton group worked on Suiagents alongside of Suibench. Tell us the story about Suiagent. Sure.Shunyu [00:36:21]: I think it's kind of like a triology, it's actually a series of three works now. So actually the first work is called Intercode, but it's not as famous, I know. And the second work is called Suibench and the third work is called Suiagent. And I'm just really confused why nobody is working on coding. You know, it's like a year ago, but I mean, not everybody's working on coding, obviously, but a year ago, like literally nobody was working on coding. I was really confused. And the people that were working on coding are, you know, trying to solve human evil in like a sick-to-sick way. There's no agent, there's no chain of thought, there's no anything, they're just, you know, fine tuning the model and improve some points and whatever, like, I was really confused because obviously coding is the best application for agents because it's autogradable, it's super important, you can make everything like API or code action, right? So I was confused and I collaborated with some of the students in Princeton and we have this work called Intercode and the idea is, first, if you care about coding, then you should solve coding in an interactive way, meaning more like a Jupyter Notebook kind of way than just writing a program and seeing if it fails or succeeds and stop, right? You should solve it in an interactive way because that's exactly how humans solve it, right? You don't have to, you know, write a program like next token, next token, next token and stop and never do any edits and you cannot really use any terminal or whatever tool. It doesn't make sense, right? And that's the way people are solving coding at the time, basically like sampling a program from a language model without chain of thought, without tool call, without refactoring, without anything. So the first point is we should solve coding in a very interactive way and that's a very general principle that applies for various coding benchmarks. And also, I think you can make a lot of the agent task kind of like interactive coding. If you have Python and you can call any package, then you can literally also browse internet or do whatever you want, like control a robot or whatever. So that seems to be a very general paradigm. But obviously I think a bottleneck is at the time we're still doing, you know, very simple tasks like human eval or whatever coding benchmark people proposed. They were super hard in 2021, like 20%, but they're like 95% already in 2023. So obviously the next step is we need a better benchmark. And Carlos and John, which are the first authors of Swaybench, I think they come up with this great idea that we should just script GitHub and solve whatever human engineers are solving. And I think it's actually pretty easy to come up with the idea. And I think in the first week, they already made a lot of progress. They script the GitHub and they make all the same, but then there's a lot of painful info work and whatever, you know. I think the idea is super easy, but the engineering is super hard. And I feel like that's a very typical signal of a good work in the AI era now.Swyx [00:39:17]: I think also, I think the filtering was challenging, because if you look at open source PRs, a lot of them are just like, you know, fixing typos. I think it's challenging.Shunyu [00:39:27]: And to be honest, we didn't do a perfect job at the time. So if you look at the recent blog post with OpenAI, we improved the filtering so that it's more solvable.Swyx [00:39:36]: I think OpenAI was just like, look, this is a thing now. We have to fix this. These students just rushed it.Shunyu [00:39:45]: It's a good convergence of interests for me.Alessio [00:39:48]: Was that tied to you joining OpenAI? Or was that just unrelated?Shunyu [00:39:52]: It's a coincidence for me, but it's a good coincidence.Swyx [00:39:55]: There is a history of anytime a big lab adopts a benchmark, they fix it. Otherwise, it's a broken benchmark.Shunyu [00:40:03]: So naturally, once we propose swimmage, the next step is to solve it. But I think the typical way you solve something now is you collect some training samples, or you design some complicated agent method, and then you try to solve it. Either super complicated prompt, or you build a better model with more training data. But I think at the time, we realized that even before those things, there's a fundamental problem with the interface or the tool that you're supposed to use. Because that's like an ignored problem in some sense. What your tool is, or how that matters for your task. So what we found concretely is that if you just use the text terminal off the shelf as a tool for those agents, there's a lot of problems. For example, if you edit something, there's no feedback. So you don't know whether your edit is good or not. That makes the agent very confused and makes a lot of mistakes. There are a lot of small problems, you would say. Well, you can try to do prompt engineering and improve that, but it turns out to be actually very hard. We realized that the interface design is actually a very omitted part of agent design. So we did this switch agent work. And the key idea is just, even before you talk about what the agent is, you should talk about what the environment is. You should make sure that the environment is actually friendly to whatever agent you're trying to apply. That's the same idea for humans. Text terminal is good for some tasks, like git, pool, or whatever. But it's not good if you want to look at browser and whatever. Also, browser is a good tool for some tasks, but it's not a good tool for other tasks. We need to talk about how design interface, in some sense, where we should treat agents as our customers. It's like when we treat humans as a customer, we design human computer interfaces. We design those beautiful desktops or browsers or whatever, so that it's very intuitive and easy for humans to use. And this whole great subject of HCI is all about that. I think now the research idea of switch agent is just, we should treat agents as our customers. And we should do like, you know… AICI.Swyx [00:42:16]: AICI, exactly.Harrison [00:42:18]: So what are the tools that a suite agent should have, or a coding agent in general should have?Shunyu [00:42:24]: For suite agent, it's like a modified text terminal, which kind of adapts to a lot of the patterns of language models to make it easier for language models to use. For example, now for edit, instead of having no feedback, it will actually have a feedback of, you know, actually here you introduced like a syntax error, and you should probably want to fix that, and there's an ended error there. And that makes it super easy for the model to actually do that. And there's other small things, like how exactly you write arguments, right? Like, do you want to write like a multi-line edit, or do you want to write a single line edit? I think it's more interesting to think about the way of the development process of an ACI rather than the actual ACI for like a concrete application. Because I think the general paradigm is very similar to HCI and psychology, right? Basically, for how people develop HCIs, they do behavior experiments on humans, right? I do every test, right? Like, which interface is actually better? And I do those behavior experiments, kind of like psychology experiments to humans, and I change things. And I think what's really interesting for me, for this three-agent paper, is we can probably do the same thing for agents, right? We can do every test for those agents and do behavior tests. And through the process, we not only invent better interfaces for those agents, that's the practical value, but we also better understand agents. Just like when we do those A-B tests, we do those HCI, we better understand humans. Doing those ACI experiments, we actually better understand agents. And that's pretty cool.Harrison [00:43:51]: Besides that A-B testing, what are other processes that people can use to think about this in a good way?Swyx [00:43:57]: That's a great question.Shunyu [00:43:58]: And I think three-agent is an initial work. And what we do is the kind of the naive approach, right? You just try some interface, and you see what's going wrong, and then you try to fix that. We do this kind of iterative fixing. But I think what's really interesting is there'll be a lot of future directions that's very promising if we can apply some of the HCI principles more systematically into the interface design. I think that would be a very cool interdisciplinary research opportunity.Harrison [00:44:26]: You talked a lot about agent-computer interfaces and interactions. What about human-to-agent UX patterns? Curious for any thoughts there that you might have.Swyx [00:44:38]: That's a great question.Shunyu [00:44:39]: And in some sense, I feel like prompt engineering is about human-to-agent interface. But I think there can be a lot of interesting research done about... So prompting is about how humans can better communicate with the agent. But I think there could be interesting research on how agents can better communicate with humans, right? When to ask questions, how to ask questions, what's the frequency of asking questions. And I think those kinds of stuff could be very cool research.Harrison [00:45:07]: Yeah, I think some of the most interesting stuff that I saw here was also related to coding with Devin from Cognition. And they had the three or four different panels where you had the chat, the browser, the terminal, and I guess the code editor as well.Swyx [00:45:19]: There's more now.Harrison [00:45:19]: There's more. Okay, I'm not up to date. Yeah, I think they also did a good job on ACI.Swyx [00:45:25]: I think that's the main learning I have from Devin. They cracked that. Actually, there was no foundational planning breakthrough. The planner is actually pretty simple, but ACI that they broke through on.Shunyu [00:45:35]: I think making the tool good and reliable is probably like 90% of the whole agent. Once the tool is actually good, then the agent design can be much, much simpler. On the other hand, if the tool is bad, then no matter how much you put into the agent design, planning or search or whatever, it's still going to be trash.Harrison [00:45:53]: Yeah, I'd argue the same. Same with like context and instructions. Like, yeah, go hand in hand.Alessio [00:46:00]: On the tool, how do you think about the tension of like, for both of you, I mean, you're building a library, so even more for you. The tension between making now a language or a library that is like easy for the agent to grasp and write versus one that is easy for like the human to grasp and write. Because, you know, the trend is like more and more code gets written by the agent. So why wouldn't you optimize the framework to be as easy as possible for the model versus for the person?Swyx [00:46:24]: I think it's possible to design an interfaceShunyu [00:46:25]: that's both friendly to humans and agents. But what do you think?Harrison [00:46:29]: We haven't thought about that from the perspective, like we're not trying to design LangChain or LangGraph to be friendly. But I mean, I think to be friendly for agents to write.Swyx [00:46:42]: But I mean, I think we see this with like,Harrison [00:46:43]: I saw some paper that used TypeScript notation instead of JSON notation for tool calling and it got a lot better performance. So it's definitely a thing. I haven't really heard of anyone designing like a syntax or a language explicitly for agents, but there's clearly syntaxes that are better.Shunyu [00:46:59]: I think function calling is a good example where it's like a good interface for both human programmers and for agents, right? Like for developers, it's actually a very friendly interface because it's very concrete and you don't have to do prompt engineering anymore. You can be very systematic. And for models, it's also pretty good, right? Like it can use all the existing coding content. So I think we need more of those kinds of designs.Swyx [00:47:21]: I will mostly agree and I'll slightly disagree in terms of this, which is like, whether designing for humans also overlaps with designing for AI. So Malte Ubo, who's the CTO of Vercel, who is creating basically JavaScript's competitor to LangChain, they're observing that basically, like if the API is easy to understand for humans, it's actually much easier to understand for LLMs, for example, because they're not overloaded functions. They don't behave differently under different contexts. They do one thing and they always work the same way. It's easy for humans, it's easy for LLMs. And like that makes a lot of sense. And obviously adding types is another one. Like type annotations only help give extra context, which is really great. So that's the agreement. And then a disagreement is that when I use structured output to do my chain of thought, I have found that I change my field names to hint to the LLM of what the field is supposed to do. So instead of saying topics, I'll say candidate topics. And that gives me a better result because the LLM was like, ah, this is just a draft thing I can use for chain of thought. And instead of like summaries, I'll say topic summaries to link the previous field to the current field. So like little stuff like that, I find myself optimizing for the LLM where I, as a human, would never do that. Interesting.Shunyu [00:48:32]: It's kind of like the way you optimize the prompt, it might be different for humans and for machines. You can have a common ground that's both clear for humans and agents, but to improve the human performance versus improving the agent performance, they might move to different directions.Swyx [00:48:48]: Might move different directions. There's a lot more use of metadata as well, like descriptions, comments, code comments, annotations and stuff like that. Yeah.Harrison [00:48:56]: I would argue that's just you communicatingSwyx [00:48:58]: to the agent what it should do.Harrison [00:49:00]: And maybe you need to communicate a little bit more than to humans because models aren't quite good enough yet.Swyx [00:49:06]: But like, I don't think that's crazy.Harrison [00:49:07]: I don't think that's like- It's not crazy.Swyx [00:49:09]: I will bring this in because it just happened to me yesterday. I was at the cursor office. They held their first user meetup and I was telling them about the LLM OS concept and why basically every interface, every tool was being redesigned for AIs to use rather than humans. And they're like, why? Like, can we just use Bing and Google for LLM search? Why must I use Exa? Or what's the other one that you guys work with?Harrison [00:49:32]: Tavilli.Swyx [00:49:33]: Tavilli. Web Search API dedicated for LLMs. What's the difference?Shunyu [00:49:36]: Exactly. To Bing API.Swyx [00:49:38]: Exactly.Harrison [00:49:38]: There weren't great APIs for search. Like the best one, like the one that we used initially in LangChain was SERP API, which is like maybe illegal. I'm not sure.Swyx [00:49:49]: And like, you know,Harrison [00:49:52]: and now there are like venture-backed companies.Swyx [00:49:53]: Shout out to DuckDuckGo, which is free.Harrison [00:49:55]: Yes, yes.Swyx [00:49:56]: Yeah.Harrison [00:49:56]: I do think there are some differences though. I think you want, like, I think generally these APIs try to return small amounts of text information, clear legible field. It's not a massive JSON blob. And I think that matters. I think like when you talk about designing tools, it's not only the, it's the interface in the entirety, not only the inputs, but also the outputs that really matter. And so I think they try to make the outputs.Shunyu [00:50:18]: They're doing ACI.Swyx [00:50:19]: Yeah, yeah, absolutely.Harrison [00:50:20]: Really?Swyx [00:50:21]: Like there's a whole set of industries that are just being redone for ACI. It's weird. And so my simple answer to them was like the error messages. When you give error messages, they should be basically prompts for the LLM to take and then self-correct. Then your error messages get more verbose, actually, than you normally would with a human. Stuff like that. Like a little, honestly, it's not that big. Again, like, is this worth a venture-backed industry? Unless you can tell us. But like, I think Code Interpreter, I think is a new thing. I hope so.Alessio [00:50:52]: We invested in it to be so.Shunyu [00:50:53]: I think that's a very interesting point. You're trying to optimize to the extreme, then obviously they're going to be different. For example, the error—Swyx [00:51:00]: Because we take it very seriously. Right.Shunyu [00:51:01]: The error for like language model, the longer the better. But for humans, that will make them very nervous and very tired, right? But I guess the point is more like, maybe we should try to find a co-optimized common ground as much as possible. And then if we have divergence, then we should try to diverge. But it's more philosophical now.Alessio [00:51:19]: But I think like part of it is like how you use it. So Google invented the PageRank because ideally you only click on one link, you know, like the top three should have the answer. But with models, it's like, well, you can get 20. So those searches are more like semantic grouping in a way. It's like for this query, I'll return you like 20, 30 things that are kind of good, you know? So it's less about ranking and it's more about grouping.Shunyu [00:51:42]: Another fundamental thing about HCI is the difference between human and machine's kind of memory limit, right? So I think what's really interesting about this concept HCI versus HCI is interfaces that's optimized for them. You can kind of understand some of the fundamental characteristics, differences of humans and machines, right? Why, you know, if you look at find or whatever terminal command, you know, you can only look at one thing at a time or that's because we have a very small working memory. You can only deal with one thing at a time. You can only look at one paragraph of text at the same time. So the interface for us is by design, you know, a small piece of information, but more temporal steps. But for machines, that should be the opposite, right? You should just give them a hundred different results and they should just decide in context what's the most relevant stuff and trade off the context for temporal steps. That's actually also better for language models because like the cost is smaller or whatever. So it's interesting to connect those interfaces to the fundamental kind of differences of those.Harrison [00:52:43]: When you said earlier, you know, we should try to design these to maybe be similar as possible and diverge if we need to.Swyx [00:52:49]: I actually don't have a problem with them diverging nowHarrison [00:52:51]: and seeing venture-backed startups emerging now because we are different from machines code AI. And it's just so early on, like they may still look kind of similar and they may still be small differences, but it's still just so early. And I think we'll only discover more ways that they differ. And so I'm totally fine with them kind of like diverging earlySwyx [00:53:10]: and optimizing for the...Harrison [00:53:11]: I agree. I think it's more like, you know,Shunyu [00:53:14]: we should obviously try to optimize human interface just for humans. We're already doing that for 50 years. We should optimize agent interface just for agents, but we might also try to co-optimize both and see how far we can get. There's enough people to try all three directions. Yeah.Swyx [00:53:31]: There's a thesis I sometimes push, which is the sour lesson as opposed to the bitter lesson, which we're always inspired by human development, but actually AI develops its own path.Shunyu [00:53:40]: Right. We need to understand better, you know, what are the fundamental differences between those creatures.Swyx [00:53:45]: It's funny when really early on this pod, you were like, how much grounding do you have in cognitive development and human brain stuff? And I'm like

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
From API to AGI: Structured Outputs, OpenAI API platform and O1 Q&A — with Michelle Pokrass & OpenAI Devrel + Strawberry team

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Sep 13, 2024 122:59


Congrats to Damien on successfully running AI Engineer London! See our community page and the Latent Space Discord for all upcoming events.This podcast came together in a far more convoluted way than usual, but happens to result in a tight 2 hours covering the ENTIRE OpenAI product suite across ChatGPT-latest, GPT-4o and the new o1 models, and how they are delivered to AI Engineers in the API via the new Structured Output mode, Assistants API, client SDKs, upcoming Voice Mode API, Finetuning/Vision/Whisper/Batch/Admin/Audit APIs, and everything else you need to know to be up to speed in September 2024.This podcast has two parts: the first hour is a regular, well edited, podcast on 4o, Structured Outputs, and the rest of the OpenAI API platform. The second was a rushed, noisy, hastily cobbled together recap of the top takeaways from the o1 model release from yesterday and today.Building AGI with Structured Outputs — Michelle Pokrass of OpenAI API teamMichelle Pokrass built massively scalable platforms at Google, Stripe, Coinbase and Clubhouse, and now leads the API Platform at Open AI. She joins us today to talk about why structured output is such an important modality for AI Engineers that Open AI has now trained and engineered a Structured Output mode with 100% reliable JSON schema adherence. To understand why this is important, a bit of history is important:* June 2023 when OpenAI first added a "function calling" capability to GPT-4-0613 and GPT 3.5 Turbo 0613 (our podcast/writeup here)* November 2023's OpenAI Dev Day (our podcast/writeup here) where the team shipped JSON Mode, a simpler schema-less JSON output mode that nevertheless became more popular because function calling often failed to match the JSON schema given by developers. * Meanwhile, in open source, many solutions arose, including * Instructor (our pod with Jason here) * LangChain (our pod with Harrison here, and he is returning next as a guest co-host)* Outlines (Remi Louf's talk at AI Engineer here)* Llama.cpp's constrained grammar sampling using GGML-BNF* April 2024: OpenAI started implementing constrained sampling with a new `tool_choice: required` parameter in the API* August 2024: the new Structured Output mode, co-led by Michelle* Sept 2024: Gemini shipped Structured Outputs as wellWe sat down with Michelle to talk through every part of the process, as well as quizzing her for updates on everything else the API team has shipped in the past year, from the Assistants API, to Prompt Caching, GPT4 Vision, Whisper, the upcoming Advanced Voice Mode API, OpenAI Enterprise features, and why every Waterloo grad seems to be a cracked engineer.Part 1 Timestamps and TranscriptTranscript here.* [00:00:42] Episode Intro from Suno* [00:03:34] Michelle's Path to OpenAI* [00:12:20] Scaling ChatGPT* [00:13:20] Releasing Structured Output* [00:16:17] Structured Outputs vs Function Calling* [00:19:42] JSON Schema and Constrained Grammar* [00:20:45] OpenAI API team* [00:21:32] Structured Output Refusal Field* [00:24:23] ChatML issues* [00:26:20] Function Calling Evals* [00:28:34] Parallel Function Calling* [00:29:30] Increased Latency* [00:30:28] Prompt/Schema Caching* [00:30:50] Building Agents with Structured Outputs: from API to AGI* [00:31:52] Assistants API* [00:34:00] Use cases for Structured Output* [00:37:45] Prompting Structured Output* [00:39:44] Benchmarking Prompting for Structured Outputs* [00:41:50] Structured Outputs Roadmap* [00:43:37] Model Selection vs GPT4 Finetuning* [00:46:56] Is Prompt Engineering Dead?* [00:47:29] 2 models: ChatGPT Latest vs GPT 4o August* [00:50:24] Why API => AGI* [00:52:40] Dev Day* [00:54:20] Assistants API Roadmap* [00:56:14] Model Reproducibility/Determinism issues* [00:57:53] Tiering and Rate Limiting* [00:59:26] OpenAI vs Ops Startups* [01:01:06] Batch API* [01:02:54] Vision* [01:04:42] Whisper* [01:07:21] Voice Mode API* [01:08:10] Enterprise: Admin/Audit Log APIs* [01:09:02] Waterloo grads* [01:10:49] Books* [01:11:57] Cognitive Biases* [01:13:25] Are LLMs Econs?* [01:13:49] Hiring at OpenAIEmergency O1 Meetup — OpenAI DevRel + Strawberry teamthe following is our writeup from AINews, which so far stands the test of time.o1, aka Strawberry, aka Q*, is finally out! There are two models we can use today: o1-preview (the bigger one priced at $15 in / $60 out) and o1-mini (the STEM-reasoning focused distillation priced at $3 in/$12 out) - and the main o1 model is still in training. This caused a little bit of confusion.There are a raft of relevant links, so don't miss:* the o1 Hub* the o1-preview blogpost* the o1-mini blogpost* the technical research blogpost* the o1 system card* the platform docs* the o1 team video and contributors list (twitter)Inline with the many, many leaks leading up to today, the core story is longer “test-time inference” aka longer step by step responses - in the ChatGPT app this shows up as a new “thinking” step that you can click to expand for reasoning traces, even though, controversially, they are hidden from you (interesting conflict of interest…):Under the hood, o1 is trained for adding new reasoning tokens - which you pay for, and OpenAI has accordingly extended the output token limit to >30k tokens (incidentally this is also why a number of API parameters from the other models like temperature and role and tool calling and streaming, but especially max_tokens is no longer supported).The evals are exceptional. OpenAI o1:* ranks in the 89th percentile on competitive programming questions (Codeforces),* places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME),* and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).You are used to new models showing flattering charts, but there is one of note that you don't see in many model announcements, that is probably the most important chart of all. Dr Jim Fan gets it right: we now have scaling laws for test time compute, and it looks like they scale loglinearly.We unfortunately may never know the drivers of the reasoning improvements, but Jason Wei shared some hints:Usually the big model gets all the accolades, but notably many are calling out the performance of o1-mini for its size (smaller than gpt 4o), so do not miss that.Part 2 Timestamps* [01:15:01] O1 transition* [01:16:07] O1 Meetup Recording* [01:38:38] OpenAI Friday AMA recap* [01:44:47] Q&A Part 2* [01:50:28] O1 DemosDemo Videos to be posted shortly Get full access to Latent Space at www.latent.space/subscribe

This Week in Machine Learning & Artificial Intelligence (AI) Podcast
The Building Blocks of Agentic Systems with Harrison Chase - #698

This Week in Machine Learning & Artificial Intelligence (AI) Podcast

Play Episode Listen Later Aug 19, 2024 59:17


Today, we're joined by Harrison Chase, co-founder and CEO of LangChain to discuss LLM frameworks, agentic systems, RAG, evaluation, and more. We dig into the elements of a modern LLM framework, including the most productive developer experiences and appropriate levels of abstraction. We dive into agents and agentic systems as well, covering the “spectrum of agenticness,” cognitive architectures, and real-world applications. We explore key challenges in deploying agentic systems, and the importance of agentic architectures as a means of communication in system design and operation. Additionally, we review evolving use cases for RAG, and the role of observability, testing, and evaluation tools in moving LLM applications from prototype to production. Lastly, Harrison shares his hot takes on prompting, multi-modal models, and more! The complete show notes for this episode can be found at https://twimlai.com/go/698.

PodRocket - A web development podcast from LogRocket
AI for React developers and beyond with Alexandra Spalato

PodRocket - A web development podcast from LogRocket

Play Episode Listen Later Aug 7, 2024 29:37


Alexandra Spalato, AI integration specialist and frontend visionary, discusses how AI is revolutionizing the workflow for React developers, the indispensable tools making development faster and more efficient, and the exciting possibilities AI opens up for the entire software development process. Links https://www.alexandraspalato.com https://x.com/alexadark https://github.com/alexadark https://www.linkedin.com/in/alexandraspalato https://www.latent.space/p/ai-engineer https://cursor.sh https://pieces.app https://sourcegraph.com https://www.langchain.com https://www.llamaindex.ai https://langbase.com We want to hear from you! How did you find us? Did you see us on Twitter? In a newsletter? Or maybe we were recommended by a friend? Let us know by sending an email to our producer, Emily, at emily.kochanekketner@logrocket.com (mailto:emily.kochanekketner@logrocket.com), or tweet at us at PodRocketPod (https://twitter.com/PodRocketpod). Follow us. Get free stickers. Follow us on Apple Podcasts, fill out this form (https://podrocket.logrocket.com/get-podrocket-stickers), and we'll send you free PodRocket stickers! What does LogRocket do? LogRocket provides AI-first session replay and analytics that surfaces the UX and technical issues impacting user experiences. Start understand where your users are struggling by trying it for free at [LogRocket.com]. Try LogRocket for free today.(https://logrocket.com/signup/?pdr)

Everyday AI Podcast – An AI and ChatGPT Podcast
Ep 306: Our 2024 AI Predictions Midterm Report – how did we do?

Everyday AI Podcast – An AI and ChatGPT Podcast

Play Episode Listen Later Jul 2, 2024 69:19


Send Everyday AI and Jordan a text messageIn December 2023, we laid out 24 spicy hot takes on the future of Generative AI. We're giving ourselves a midterm report as we hit the halfway point of 2024 to see if we crushed our predictions, or if the nonstop pace of GenAI crushed us. Newsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion: Ask Jordan questions on AIRelated Episodes: Ep 124: 5 Ways Generative AI Shows Up in 2024 Ep 171: GenAI in 2024 – What's coming and what it means for youUpcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:1. Review of 24 predictions for 20242. AI Robotics3. AI Misinformation and Legislation4. Large Language Models5. Future of Big Tech Companies6. Future of GenAI7. Competition against NVIDIA8. Predictions about OpenAI9. Job Market Changes due to AITimestamps:02:50 A look at our 2024 AI predictions.05:52 Remote workers struggle to access AI systems.11:11 Excited attendee sees NVIDIA's unveiling of robotics.20:57 Training large language models requires quality data, not bots and trash content from Twitter.26:02 GPT-4: Future desktop assistant with agent capabilities.28:39 Rag becoming a buzzword in AI industry.34:27 Copyright infringement lawsuits expected for large companies.37:47 Expect number 9, big changes in AI.43:09 NVIDIA's stock split made them top company.51:33 Many AI tools can create products quickly.57:29 AI investment leads to successful layoffs in 2024.59:04 Midway through 2024, what about AI?01:06:32 20 out of 24 predictions are true.Keywords:Jordan Wilson, generative AI, AI predictions, second computer AI, individualized learning, AI robotics, AI misinformation, AI legislation, AI watermarks, Grok, Apple AI, home assistants, AI agents, legacy media companies, large language models, Gen AI training, Microsoft, Amazon, NVIDIA, compute resources, Inflexion AI acquisition, AI future of work, agent capabilities, LangChain, retrieval augmentive generation (RAG), mini-RAG, traditional SEO decline, copyright battles in AI, OpenAI, video AI. Get more out of ChatGPT by learning our PPP method in this live, interactive and free training! Sign up now: https://youreverydayai.com/ppp-registration/

Syntax - Tasty Web Development Treats
784: Logging × Blogging × Testing × Freelancing

Syntax - Tasty Web Development Treats

Play Episode Listen Later Jun 19, 2024 55:36


In this Potluck episode, Scott and Wes tackle listener questions on modern blogging, website environmental impact, and using LangChain with LLMs. They also cover CSS hyphens, unit vs. integration testing, and balancing web development with new parenthood. Show Notes 00:00 Welcome to Syntax! 00:13 How to submit a question for future episodes. Potluck Questions. 02:46 Brought to you by Sentry.io. 03:21 Logging from a site. 08:39 Blogging in 2024. 11:49 Sharing website environmental data. Green Web Foundation. Website Carbon Calculator. Syntax Site Results. Scott's Site Results. 17:38 Using LangChain when working with LLMs. 21:03 CSS Hyphens and Overflow-Wrap. Hyphens Browser Compatibility. Overflow-Wrap. 25:52 Similarities between WASM, JVM and .NET. 27:25 Writing unit testing and integration testing. 32:00 How can new parents stay current on web development trends? 34:47 Working globally as a freelance developer. 37:26 Scott's audio setup. Why audio interfaces have DSP built in. ChaseBliss Pedal. 43:04 UI libraries for synth/audio plugins. 44:02 CSS module scripts. CSS Modules in CSS Module Scripts. 48:39 Sick Picks + Shameless Plugs. Sick Picks Scott: Deep Cover Podcast. Wes: Pressure Washer Surface Cleaner. Shameless Plugs Wes: Syntax.fm/videos. Hit us up on Socials! Syntax: X Instagram Tiktok LinkedIn Threads Wes: X Instagram Tiktok LinkedIn Threads Scott: X Instagram Tiktok LinkedIn Threads Randy: X Instagram YouTube Threads