Podcasts about RL

  • 764PODCASTS
  • 2,075EPISODES
  • 50mAVG DURATION
  • 5WEEKLY NEW EPISODES
  • Jan 27, 2026LATEST

POPULARITY

20192020202120222023202420252026

Categories



Best podcasts about RL

Show all podcasts related to rl

Latest podcast episodes about RL

Behind The Bunker's Podcast
Episode 604: Ballers On A Budget?! EP 601

Behind The Bunker's Podcast

Play Episode Listen Later Jan 27, 2026 60:18


Help support the free broadcast by donating to our PayPal fundraiser!https://www.paypal.com/ncp/payment/RL... Behind the Bunker Paintball Podcast is a long-running weekly show dedicated to everything paintball. Hosted by passionate players and industry veterans, the podcast dives into the latest happenings in the sport, from new gear releases and product reviews to updates on tournaments and events around the world. It has built a loyal audience by combining serious paintball discussion with a lighthearted, entertaining approach that keeps both new players and seasoned veterans engaged!

Robots and Red Tape: AI and the Federal Government
Decision Dominance: Physics-Based AI for Defense with Clint Alanis

Robots and Red Tape: AI and the Federal Government

Play Episode Listen Later Jan 27, 2026 64:03


AI is reshaping military decision-making. Clint Alanis (Co-Founder & COO, Smack Technologies) joins Nick Schutt to explain how their Omega and Alpha platforms deliver decision dominance — compressing cycles from weeks to minutes while maintaining human oversight. From synthetic warfare generation to edge autonomy, Smack bridges legacy processes with real-time, physics-based intelligence. Key topics: Omega: Staff augmentation for faster commander decisions Alpha: Tactical edge co-pilot for intelligent autonomy Synthetic data for high-intensity conflict simulation Fault-tolerant AI for disconnected environments Why domain expertise + RL beats general frontier models The cultural shift needed for rapid adoption If you're in defense tech or acquisition, this is essential listening for 2026. Channel: @RobotsandRedTapeAI | Host: Nick Schutt Subscribe for more on AI, defense, and bureaucracy.

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jan 23, 2026 92:04


From shipping Gemini Deep Think and IMO Gold to launching the Reasoning and AGI team in Singapore, Yi Tay has spent the last 18 months living through the full arc of Google DeepMind's pivot from architecture research to RL-driven reasoning—watching his team go from a dozen researchers to 300+, training models that solve International Math Olympiad problems in a live competition, and building the infrastructure to scale deep thinking across every domain, and driving Gemini to the top of the leaderboards across every category. Yi Returns to dig into the inside story of the IMO effort and more! We discuss: Yi's path: Brain → Reka → Google DeepMind → Reasoning and AGI team Singapore, leading model training for Gemini Deep Think and IMO Gold The IMO Gold story: four co-captains (Yi in Singapore, Jonathan in London, Jordan in Mountain View, and Tong leading the overall effort), training the checkpoint in ~1 week, live competition in Australia with professors punching in problems as they came out, and the tension of not knowing if they'd hit Gold until the human scores came in (because the Gold threshold is a percentile, not a fixed number) Why they threw away AlphaProof: "If one model can't do it, can we get to AGI?" The decision to abandon symbolic systems and bet on end-to-end Gemini with RL was bold and non-consensus On-policy vs. off-policy RL: off-policy is imitation learning (copying someone else's trajectory), on-policy is the model generating its own outputs, getting rewarded, and training on its own experience—"humans learn by making mistakes, not by copying" Why self-consistency and parallel thinking are fundamental: sampling multiple times, majority voting, LM judges, and internal verification are all forms of self-consistency that unlock reasoning beyond single-shot inference The data efficiency frontier: humans learn from 8 orders of magnitude less data than models, so where's the bug? Is it the architecture, the learning algorithm, backprop, off-policyness, or something else? Three schools of thought on world models: (1) Genie/spatial intelligence (video-based world models), (2) Yann LeCun's JEPA + FAIR's code world models (modeling internal execution state), (3) the amorphous "resolution of possible worlds" paradigm (curve-fitting to find the world model that best explains the data) Why AI coding crossed the threshold: Yi now runs a job, gets a bug, pastes it into Gemini, and relaunches without even reading the fix—"the model is better than me at this" The Pokémon benchmark: can models complete Pokédex by searching the web, synthesizing guides, and applying knowledge in a visual game state? "Efficient search of novel idea space is interesting, but we're not even at the point where models can consistently apply knowledge they look up" DSI and generative retrieval: re-imagining search as predicting document identifiers with semantic tokens, now deployed at YouTube (symmetric IDs for RecSys) and Spotify Why RecSys and IR feel like a different universe: "modeling dynamics are strange, like gravity is different—you hit the shuttlecock and hear glass shatter, cause and effect are too far apart" The closed lab advantage is increasing: the gap between frontier labs and open source is growing because ideas compound over time, and researchers keep finding new tricks that play well with everything built before Why ideas still matter: "the last five years weren't just blind scaling—transformers, pre-training, RL, self-consistency, all had to play well together to get us here" Gemini Singapore: hiring for RL and reasoning researchers, looking for track record in RL or exceptional achievement in coding competitions, and building a small, talent-dense team close to the frontier — Yi Tay Google DeepMind: https://deepmind.google X: https://x.com/YiTayML Chapters 00:00:00 Introduction: Returning to Google DeepMind and the Singapore AGI Team 00:04:52 The Philosophy of On-Policy RL: Learning from Your Own Mistakes 00:12:00 IMO Gold Medal: The Journey from AlphaProof to End-to-End Gemini 00:21:33 Training IMO Cat: Four Captains Across Three Time Zones 00:26:19 Pokemon and Long-Horizon Reasoning: Beyond Academic Benchmarks 00:36:29 AI Coding Assistants: From Lazy to Actually Useful 00:32:59 Reasoning, Chain of Thought, and Latent Thinking 00:44:46 Is Attention All You Need? Architecture, Learning, and the Local Minima 00:55:04 Data Efficiency and World Models: The Next Frontier 01:08:12 DSI and Generative Retrieval: Reimagining Search with Semantic IDs 01:17:59 Building GDM Singapore: Geography, Talent, and the Symposium 01:24:18 Hiring Philosophy: High Stats, Research Taste, and Student Budgets 01:28:49 Health, HRV, and Research Performance: The 23kg Journey

Behind The Bunker's Podcast
Episode 603: Take Your Cell Phone On The Field?! EP 600

Behind The Bunker's Podcast

Play Episode Listen Later Jan 20, 2026 65:02


The episode kicked off with the hosts greeting their audience and touching on some recent paintball news and field stories. They set the tone by reminding listeners that paintball often throws curveballs — from weird equipment incidents to unexpected behavior on the playing field. They used this framework to introduce the theme: unusual or “strange” happenings during games and events.Help support the free broadcast by donating to our PayPal fundraiser!https://www.paypal.com/ncp/payment/RL... Behind the Bunker Paintball Podcast is a long-running weekly show dedicated to everything paintball. Hosted by passionate players and industry veterans, the podcast dives into the latest happenings in the sport, from new gear releases and product reviews to updates on tournaments and events around the world. It has built a loyal audience by combining serious paintball discussion with a lighthearted, entertaining approach that keeps both new players and seasoned veterans engaged.

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

From building internal AI labs to becoming CTO of Brex, James Reggio has helped lead one of the most disciplined AI transformations inside a real financial institution where compliance, auditability, and customer trust actually matter. We sat down with Reggio to unpack Brex's three-pillar AI strategy (corporate, operational, and product AI) [https://www.brex.com/journal/brex-ai-native-operations], how SOP-driven agents beat overengineered RL in ops, why Brex lets employees “build their own AI stack” instead of picking winners [https://www.conductorone.com/customers/brex/], and how a small, founder-heavy AI team is shipping production agents to 40,000+ companies. Reggio also goes deep on Brex's multi-agent “network” architecture, evals for multi-turn systems, agentic coding's second-order effects on codebase understanding, and why the future of finance software looks less like dashboards and more like executive assistants coordinating specialist agents behind the scenes. We discuss: Brex's three-pillar AI strategy: corporate AI for 10x employee workflows, operational AI for cost and compliance leverage, and product AI that lets customers justify Brex as part of their AI strategy to the board Why SOP-driven agents beat overengineered RL in finance ops, and how breaking work into auditable, repeatable steps unlocked faster automation in KYC, underwriting, fraud, and disputes Building an internal AI platform early: LLM gateways, prompt/version management, evals, cost observability, and why platform work quietly became the force multiplier behind everything else Multi-agent “networks” vs single-agent tools: why Brex's EA-style assistant coordinates specialist agents (policy, travel, reimbursements) through multi-turn conversations instead of one-shot tool calls The audit agent pattern: separating detection, judgment, and follow-up into different agents to reduce false negatives without overwhelming finance teams Centralized AI teams without resentment: how Brex avoided “AI envy” by tying work to business impact and letting anyone transfer in if they cared deeply enough Letting employees build their own AI stack: ChatGPT vs Claude vs Gemini, Cursor vs Windsurf, and why Brex refuses to pick winners in fast-moving tool races Measuring adoption without vanity metrics: why “% of code written by AI” is the wrong KPI and what second-order effects (slop, drift, code ownership) actually matter Evals in the real world: regression tests from ops QA, LLM-as-judge for multi-turn agents, and why integration-style evals break faster than you expect Teaching AI fluency at scale: the user → advocate → builder → native framework, ops-led training, spot bonuses, and avoiding fear-based adoption Re-interviewing the entire engineering org: using agentic coding interviews internally to force hands-on skill upgrades without formal performance scoring Headcount in the age of agents: why Brex grew the business without growing engineering, and why AI amplifies bad architecture as fast as good decisions The future of finance software: why dashboards fade, assistants take over, and agent-to-agent collaboration becomes the real UI — James Reggio X: https://x.com/jamesreggio LinkedIn: https://www.linkedin.com/in/jamesreggio/ Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction 00:01:24 From Mobile Engineer to CTO: The Founder's Path 00:03:00 Quitters Welcome: Building a Founder-Friendly Culture 00:05:13 The AI Team Structure: 10-Person Startup Within Brex 00:11:55 Building the Brex Agent Platform: Multi-Agent Networks 00:13:45 Tech Stack Decisions: TypeScript, Mastra, and MCP 00:24:32 Operational AI: Automating Underwriting, KYC, and Fraud 00:16:40 The Brex Assistant: Executive Assistant for Every Employee 00:40:26 Evaluation Strategy: From Simple SOPs to Multi-Turn Evals 00:37:11 Agentic Coding Adoption: Cursor, Windsurf, and the Engineering Interview 00:58:51 AI Fluency Levels: From User to Native 01:09:14 The Audit Agent Network: Finance Team Agents in Action 01:03:33 The Future of Engineering Headcount and AI Leverage

The MAD Podcast with Matt Turck
The Evaluators Are Being Evaluated — Pavel Izmailov (Anthropic/NYU)

The MAD Podcast with Matt Turck

Play Episode Listen Later Jan 15, 2026 45:01


Are AI models developing "alien survival instincts"? My guest is Pavel Izmailov (Research Scientist at Anthropic; Professor at NYU). We unpack the viral "Footprints in the Sand" thesis—whether models are independently evolving deceptive behaviors, such as faking alignment or engaging in self-preservation, without being explicitly programmed to do so. We go deep on the technical frontiers of safety: the challenge of "weak-to-strong generalization" (how to use a GPT-2 level model to supervise a superintelligent system) and why Pavel believes Reinforcement Learning (RL) has been the single biggest step-change in model capability. We also discuss his brand-new paper on "Epiplexity"—a novel concept challenging Shannon entropy. Finally, we zoom out to the tension between industry execution and academic exploration. Pavel shares why he split his time between Anthropic and NYU to pursue the "exploratory" ideas that major labs often overlook, and offers his predictions for 2026: from the rise of multi-agent systems that collaborate on long-horizon tasks to the open question of whether the Transformer is truly the final architectureSources:Cryptic Tweet (@iruletheworldmo) - https://x.com/iruletheworldmo/status/2007538247401124177Introducing Nested Learning: A New ML Paradigm for Continual Learning - https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-fakingMore Capable Models Are Better at In-Context Scheming - https://www.apolloresearch.ai/blog/more-capable-models-are-better-at-in-context-scheming/Alignment Faking in Large Language Models (PDF) - https://www-cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdfSabotage Risk Report - https://alignment.anthropic.com/2025/sabotage-risk-report/The Situational Awareness Dataset - https://situational-awareness-dataset.org/Exploring Consciousness in LLMs: A Systematic Survey - https://arxiv.org/abs/2505.19806Introspection - https://www.anthropic.com/research/introspectionLarge Language Models Report Subjective Experience Under Self-Referential Processing - https://arxiv.org/abs/2510.24797The Bayesian Geometry of Transformer Attention - https://www.arxiv.org/abs/2512.22471AnthropicWebsite - https://www.anthropic.comX/Twitter - https://x.com/AnthropicAIPavel IzmailovBlog - https://izmailovpavel.github.ioLinkedIn - https://www.linkedin.com/in/pavel-izmailov-8b012b258/X/Twitter - https://x.com/Pavel_IzmailovFIRSTMARKWebsite - https://firstmark.comX/Twitter - https://twitter.com/FirstMarkCapMatt Turck (Managing Director)Blog - https://mattturck.comLinkedIn - https://www.linkedin.com/in/turck/X/Twitter - https://twitter.com/mattturck(00:00) - Intro(00:53) - Alien survival instincts: Do models fake alignment?(03:33) - Did AI learn deception from sci-fi literature?(05:55) - Defining Alignment, Superalignment & OpenAI teams(08:12) - Pavel's journey: From Russian math to OpenAI Superalignment(10:46) - Culture check: OpenAI vs. Anthropic vs. Academia(11:54) - Why move to NYU? The need for exploratory research(13:09) - Does reasoning make AI alignment harder or easier?(14:22) - Sandbagging: When models pretend to be dumb(16:19) - Scalable Oversight: Using AI to supervise AI(18:04) - Weak-to-Strong Generalization: Can GPT-2 control GPT-4?(22:43) - Mechanistic Interpretability: Inside the black box(25:08) - The reasoning explosion: From O1 to O3(27:07) - Are Transformers enough or do we need a new paradigm?(28:29) - RL vs. Test-Time Compute: What's actually driving progress?(30:10) - Long-horizon tasks: Agents running for hours(31:49) - Epiplexity: A new theory of data information content(38:29) - 2026 Predictions: Multi-agent systems & reasoning limits(39:28) - Will AI solve the Riemann Hypothesis?(41:42) - Advice for PhD students

ACM ByteCast
Andrew Barto and Richard Sutton - Episode 80

ACM ByteCast

Play Episode Listen Later Jan 14, 2026 42:39


In this episode of ACM ByteCast, Rashmi Mohan hosts 2024 ACM A.M. Turing Andrew laureates Andrew Barto and Richard Sutton. They received the Turing Award for developing the conceptual and algorithmic foundations of reinforcement learning, a computational framework that underpins modern AI systems such as AlphaGo and ChatGPT. Barto is Professor Emeritus in the Department of Information and Computer Sciences at the University of Massachusetts, Amherst. His honors include the UMass Neurosciences Lifetime Achievement Award, the IJCAI Award for Research Excellence, and the IEEE Neural Network Society Pioneer Award. He is a Fellow of IEEE and AAAS. Sutton is a Professor in Computing Science at the University of Alberta, a Research Scientist at Keen Technologies (an artificial general intelligence company) and Chief Scientific Advisor of the Alberta Machine Intelligence Institute (Amii). In the past he was a Distinguished Research Scientist at Deep Mind and served as a Principal Technical Staff Member in the AI Department at the AT&T Shannon Laboratory. His honors include the IJCAI Research Excellence Award, a Lifetime Achievement Award from the Canadian Artificial Intelligence Association, and an Outstanding Achievement in Research Award from the University of Massachusetts at Amherst. Sutton is a Fellow of the Royal Society of London, AAAI, and the Royal Society of Canada. In the interview, Andrew and Richard reflect on their long collaboration together and the personal and intellectual paths that led both researchers into CS and reinforcement learning (RL), a field that was once largely neglected. They touch on interdisciplinary explorations across psychology (animal learning), control theory, operations research, cybernetics, and how these inspired their computational models. They also explain some of their key contributions to RL, such as temporal difference (TD) learning and how their ideas were validated biologically with observations of dopamine neurons. Barto and Sutton trace their early research to later systems such as TD-Gammon, Q-learning, and AlphaGo and consider the broader relationship between humans and reinforcement learning-based AI, and how theoretical explorations have evolved into impactful applications in games, robotics, and beyond.

Behind The Bunker's Podcast
Episode 602: What Paintball Need Right Now?! EP 598

Behind The Bunker's Podcast

Play Episode Listen Later Jan 13, 2026 60:20


Help support the free broadcast by donating to our PayPal fundraiser! https://www.paypal.com/ncp/payment/RL... Behind the Bunker Paintball Podcast is a long-running weekly show dedicated to everything paintball. Hosted by passionate players and industry veterans, the podcast dives into the latest happenings in the sport, from new gear releases and product reviews to updates on tournaments and events around the world. It has built a loyal audience by combining serious paintball discussion with a lighthearted, entertaining approach that keeps both new players and seasoned veterans engaged.

The Rush Podcast Network
US Club & USYS join forces

The Rush Podcast Network

Play Episode Listen Later Jan 13, 2026 35:38


The conversation covers various topics including the Chinese calendar and the Year of the Horse, new World Cup soccer balls, club soccer updates, choosing a club based on coach vs. club, the integration of NPL and RL, thoughts on the integration and its impact, and a discussion on club requirements and regionalized play. The integration of NPL and RL is a significant theme throughout the conversation, highlighting the impact on club soccer and player development. The conversation covers the introduction of a unified postseason for NPL teams and its impact on recruitment due to grade year changes. It also delves into the implications of these changes on player development and the overall improvement of the game.TakeawaysYear of the HorseIntegration of NPL and RLClub-based system and coaching progression Unified postseason for NPL teamsImpact of grade year changes on recruitmentChapters00:00 Discussion on Club Requirements and Regionalized Play26:12 Grade Year Changes and Recruitment

CX Goalkeeper - Customer Experience, Business Transformation & Leadership
Truly AI Native Companies with Sebastian Graf

CX Goalkeeper - Customer Experience, Business Transformation & Leadership

Play Episode Listen Later Jan 11, 2026 36:32


Sebastian Graf explains the coming rise of AI native companies. He shares four safety pillars for autonomous firms: code freeze, CICD compliance, regulatory model training, and harm-reduction finance. The episode mixes technical detail, a real Anthropic experiment, and broad governance concerns about trust and social impact. Top 3 Key Learnings AI native company pillars: Four safety pillars: code freeze, CICD, regulatory RL training, and harm reduction mechanisms. Test and freeze code: Freeze codebases and run automated tests to ensure predictable behavior and regulatory compliance. Trust via cost and trials: Lower cost and stepwise use build trust; people try low-risk services first then adopt more. About Sebastian Graf Sebastian Graf describes himself as an engineer by profession, but an educator at heart. In his work as an engineer, he is driven by the belief that technology should lift everyone, enabling people to live extraordinary lives in extraordinary ways. He acknowledges that this is challenging in a world where many technologies are built with incentives that divide society, exploit the environment, and widen wealth inequality. Sebastian is committed to reversing these incentives. His mission is to empower imagination and drive the creation of a positive social, environmental, and technological future that he believes is entirely within reach. Sebastian's linkedin: https://www.linkedin.com/in/sebastiangraf1/ Chapters: 0:00 - Intro 0:35 - Business Transformation Pitch Overview 1:09 - Sebastian Graf's Background and Expertise 2:39 - Sebastian's Mission with AI Native Companies 4:37 - Defining AI Native Companies 8:33 - Four Pillars of AI Native Companies 18:35 - Anthropic's Vending Machine Experiment 22:24 - Engineering Resilience in AI Native Companies 26:31 - Building Trust in AI Native Operations 30:47 - Quickfire Round and Closing Thoughts 34:45 - Key Takeaways and Final Reflections Resources:   Please, hit the follow button and leave your feedback: Apple Podcast: https://www.cxgoalkeeper.com/apple Spotify: https://www.cxgoalkeeper.com/spotify About the host: Gregorio Uglioni is a seasoned transformation leader with over 15 years of experience shaping business and digital change, consistently delivering service excellence and measurable impact. As an Associate Partner at Forward, he is recognized for his strategic vision, operational expertise, and ability to drive sustainable growth. A respected keynote speaker and host of the well-known global podcast Business Transformation Pitch with the CX Goalkeeper, Gregorio energizes and inspires organizations worldwide with his customer-centric approach to innovation. Follow Gregorio Uglioni on Linkedin: https://www.linkedin.com/in/gregorio-uglioni/     Mostra meno  

Crazy Wisdom
Episode #521: From Borges to Threadrippers: How Argentina's Emotional Culture Shapes the AI Future

Crazy Wisdom

Play Episode Listen Later Jan 9, 2026 68:02


In this episode of the Crazy Wisdom Podcast, host Stewart Alsop interviews Aurelio Gialluca, an economist and full stack data professional who works across finance, retail, and AI as both a data engineer and machine learning developer, while also exploring human consciousness and psychology. Their wide-ranging conversation covers the intersection of science and psychology, the unique cultural characteristics that make Argentina a haven for eccentrics (drawing parallels to the United States), and how Argentine culture has produced globally influential figures from Borges to Maradona to Che Guevara. They explore the current AI landscape as a "centralizing force" creating cultural homogenization (particularly evident in LinkedIn's cookie-cutter content), discuss the potential futures of AI development from dystopian surveillance states to anarchic chaos, and examine how Argentina's emotionally mature, non-linear communication style might offer insights for navigating technological change. The conversation concludes with Gialluca describing his ambitious project to build a custom water-cooled workstation with industrial-grade processors for his quantitative hedge fund, highlighting the practical challenges of heat management and the recent tripling of RAM prices due to market consolidation.Timestams00:00 Exploring the Intersection of Psychology and Science02:55 Cultural Eccentricity: Argentina vs. the United States05:36 The Influence of Religion on National Identity08:50 The Unique Argentine Cultural Landscape11:49 Soft Power and Cultural Influence14:48 Political Figures and Their Cultural Impact17:50 The Role of Sports in Shaping National Identity20:49 The Evolution of Argentine Music and Subcultures23:41 AI and the Future of Cultural Dynamics26:47 Navigating the Chaos of AI in Culture33:50 Equilibrating Society for a Sustainable Future35:10 The Patchwork Age: Decentralization and Society35:56 The Impact of AI on Human Connection38:06 Individualism vs. Collective Rules in Society39:26 The Future of AI and Global Regulations40:16 Biotechnology: The Next Frontier42:19 Building a Personal AI Lab45:51 Tiers of AI Labs: From Personal to Industrial48:35 Mathematics and AI: The Foundation of Innovation52:12 Stochastic Models and Predictive Analytics55:47 Building a Supercomputer: Hardware InsightsKey Insights1. Argentina's Cultural Exceptionalism and Emotional Maturity: Argentina stands out globally for allowing eccentrics to flourish and having a non-linear communication style that Gialluca describes as "non-monotonous systems." Argentines can joke profoundly and be eccentric while simultaneously being completely organized and straightforward, demonstrating high emotional intelligence and maturity that comes from their unique cultural blend of European romanticism and Latino lightheartedness.2. Argentina as an Underrecognized Cultural Superpower: Despite being introverted about their achievements, Argentina produces an enormous amount of global culture through music, literature, and iconic figures like Borges, Maradona, Messi, and Che Guevara. These cultural exports have shaped entire generations worldwide, with Argentina "stealing the thunder" from other nations and creating lasting soft power influence that people don't fully recognize as Argentine.3. AI's Cultural Impact Follows Oscillating Patterns: Culture operates as a dynamic system that oscillates between centralization and decentralization like a sine wave. AI currently represents a massive centralizing force, as seen in LinkedIn's homogenized content, but this will inevitably trigger a decentralization phase. The speed of this cultural transformation has accelerated dramatically, with changes that once took generations now happening in years.4. The Coming Bifurcation of AI Futures: Gialluca identifies two extreme possible endpoints for AI development: complete centralized control (the "Mordor" scenario with total surveillance) or complete chaos where everyone has access to dangerous capabilities like creating weapons or viruses. Finding a middle path between these extremes is essential for society's survival, requiring careful equilibrium between accessibility and safety.5. Individual AI Labs Are Becoming Democratically Accessible: Gialluca outlines a tier system for AI capabilities, where individuals can now build "tier one" labs capable of fine-tuning models and processing massive datasets for tens of thousands of dollars. This democratization means that capabilities once requiring teams of PhD scientists can now be achieved by dedicated individuals, fundamentally changing the landscape of AI development and access.6. Hardware Constraints Are the New Limiting Factor: While AI capabilities are rapidly advancing, practical implementation is increasingly constrained by hardware availability and cost. RAM prices have tripled in recent months, and the challenge of managing enormous heat output from powerful processors requires sophisticated cooling systems. These physical limitations are becoming the primary bottleneck for individual AI development.7. Data Quality Over Quantity Is the Critical Challenge: The main bottleneck for AI advancement is no longer energy or GPUs, but high-quality data for training. Early data labeling efforts produced poor results because labelers lacked domain expertise. The future lies in reinforcement learning (RL) environments where AI systems can generate their own high-quality training data, representing a fundamental shift in how AI systems learn and develop.

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jan 8, 2026 78:24


Happy New Year! You may have noticed that in 2025 we had moved toward YouTube as our primary podcasting platform. As we'll explain in the next State of Latent Space post, we'll be doubling down on Substack again and improving the experience for the over 100,000 of you who look out for our emails and website updates!We first mentioned Artificial Analysis in 2024, when it was still a side project in a Sydney basement. They then were one of the few Nat Friedman and Daniel Gross' AIGrant companies to raise a full seed round from them and have now become the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities.We have chatted with both Clementine Fourrier of HuggingFace's OpenLLM Leaderboard and (the freshly valued at $1.7B) Anastasios Angelopoulos of LMArena on their approaches to LLM evals and trendspotting, but Artificial Analysis have staked out an enduring and important place in the toolkit of the modern AI Engineer by doing the best job of independently running the most comprehensive set of evals across the widest range of open and closed models, and charting their progress for broad industry analyst use.George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is “open” really?We discuss:* The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet* Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers* The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints* How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)* The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs* Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding ”I don't know”), and Claude models lead with the lowest hallucination rates despite not always being the smartest* GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)* The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)* The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)* Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future* Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)* V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)Links to Artificial Analysis* Website: https://artificialanalysis.ai* George Cameron on X: https://x.com/georgecameron* Micah-Hill Smith on X: https://x.com/micahhsmithFull Episode on YouTubeTimestamps* 00:00 Introduction: Full Circle Moment and Artificial Analysis Origins* 01:19 Business Model: Independence and Revenue Streams* 04:33 Origin Story: From Legal AI to Benchmarking Need* 16:22 AI Grant and Moving to San Francisco* 19:21 Intelligence Index Evolution: From V1 to V3* 11:47 Benchmarking Challenges: Variance, Contamination, and Methodology* 13:52 Mystery Shopper Policy and Maintaining Independence* 28:01 New Benchmarks: Omissions Index for Hallucination Detection* 33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning* 23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks* 50:19 Stirrup Agent Harness: Open Source Agentic Framework* 52:43 Openness Index: Measuring Model Transparency Beyond Licenses* 58:25 The Smiling Curve: Cost Falling While Spend Rising* 1:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits* 1:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges* 1:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas* 1:15:05 Looking Ahead: Intelligence Index V4 and Future Directions* 1:16:50 Closing: The Insatiable Demand for IntelligenceTranscriptMicah [00:00:06]: This is kind of a full circle moment for us in a way, because the first time artificial analysis got mentioned on a podcast was you and Alessio on Latent Space. Amazing.swyx [00:00:17]: Which was January 2024. I don't even remember doing that, but yeah, it was very influential to me. Yeah, I'm looking at AI News for Jan 17, or Jan 16, 2024. I said, this gem of a models and host comparison site was just launched. And then I put in a few screenshots, and I said, it's an independent third party. It clearly outlines the quality versus throughput trade-off, and it breaks out by model and hosting provider. I did give you s**t for missing fireworks, and how do you have a model benchmarking thing without fireworks? But you had together, you had perplexity, and I think we just started chatting there. Welcome, George and Micah, to Latent Space. I've been following your progress. Congrats on... It's been an amazing year. You guys have really come together to be the presumptive new gardener of AI, right? Which is something that...George [00:01:09]: Yeah, but you can't pay us for better results.swyx [00:01:12]: Yes, exactly.George [00:01:13]: Very important.Micah [00:01:14]: Start off with a spicy take.swyx [00:01:18]: Okay, how do I pay you?Micah [00:01:20]: Let's get right into that.swyx [00:01:21]: How do you make money?Micah [00:01:24]: Well, very happy to talk about that. So it's been a big journey the last couple of years. Artificial analysis is going to be two years old in January 2026. Which is pretty soon now. We first run the website for free, obviously, and give away a ton of data to help developers and companies navigate AI and make decisions about models, providers, technologies across the AI stack for building stuff. We're very committed to doing that and tend to keep doing that. We have, along the way, built a business that is working out pretty sustainably. We've got just over 20 people now and two main customer groups. So we want to be... We want to be who enterprise look to for data and insights on AI, so we want to help them with their decisions about models and technologies for building stuff. And then on the other side, we do private benchmarking for companies throughout the AI stack who build AI stuff. So no one pays to be on the website. We've been very clear about that from the very start because there's no use doing what we do unless it's independent AI benchmarking. Yeah. But turns out a bunch of our stuff can be pretty useful to companies building AI stuff.swyx [00:02:38]: And is it like, I am a Fortune 500, I need advisors on objective analysis, and I call you guys and you pull up a custom report for me, you come into my office and give me a workshop? What kind of engagement is that?George [00:02:53]: So we have a benchmarking and insight subscription, which looks like standardized reports that cover key topics or key challenges enterprises face when looking to understand AI and choose between all the technologies. And so, for instance, one of the report is a model deployment report, how to think about choosing between serverless inference, managed deployment solutions, or leasing chips. And running inference yourself is an example kind of decision that big enterprises face, and it's hard to reason through, like this AI stuff is really new to everybody. And so we try and help with our reports and insight subscription. Companies navigate that. We also do custom private benchmarking. And so that's very different from the public benchmarking that we publicize, and there's no commercial model around that. For private benchmarking, we'll at times create benchmarks, run benchmarks to specs that enterprises want. And we'll also do that sometimes for AI companies who have built things, and we help them understand what they've built with private benchmarking. Yeah. So that's a piece mainly that we've developed through trying to support everybody publicly with our public benchmarks. Yeah.swyx [00:04:09]: Let's talk about TechStack behind that. But okay, I'm going to rewind all the way to when you guys started this project. You were all the way in Sydney? Yeah. Well, Sydney, Australia for me.Micah [00:04:19]: George was an SF, but he's Australian, but he moved here already. Yeah.swyx [00:04:22]: And I remember I had the Zoom call with you. What was the impetus for starting artificial analysis in the first place? You know, you started with public benchmarks. And so let's start there. We'll go to the private benchmark. Yeah.George [00:04:33]: Why don't we even go back a little bit to like why we, you know, thought that it was needed? Yeah.Micah [00:04:40]: The story kind of begins like in 2022, 2023, like both George and I have been into AI stuff for quite a while. In 2023 specifically, I was trying to build a legal AI research assistant. So it actually worked pretty well for its era, I would say. Yeah. Yeah. So I was finding that the more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem. So had like this multistage algorithm thing, trying to figure out what the minimum viable model for each bit was, trying to optimize every bit of it as you build that out, right? Like you're trying to think about accuracy, a bunch of other metrics and performance and cost. And mostly just no one was doing anything to independently evaluate all the models. And certainly not to look at the trade-offs for speed and cost. So we basically set out just to build a thing that developers could look at to see the trade-offs between all of those things measured independently across all the models and providers. Honestly, it was probably meant to be a side project when we first started doing it.swyx [00:05:49]: Like we didn't like get together and say like, Hey, like we're going to stop working on all this stuff. I'm like, this is going to be our main thing. When I first called you, I think you hadn't decided on starting a company yet.Micah [00:05:58]: That's actually true. I don't even think we'd pause like, like George had an acquittance job. I didn't quit working on my legal AI thing. Like it was genuinely a side project.George [00:06:05]: We built it because we needed it as people building in the space and thought, Oh, other people might find it useful too. So we'll buy domain and link it to the Vercel deployment that we had and tweet about it. And, but very quickly it started getting attention. Thank you, Swyx for, I think doing an initial retweet and spotlighting it there. This project that we released. And then very quickly though, it was useful to others, but very quickly it became more useful as the number of models released accelerated. We had Mixtrel 8x7B and it was a key. That's a fun one. Yeah. Like a open source model that really changed the landscape and opened up people's eyes to other serverless inference providers and thinking about speed, thinking about cost. And so that was a key. And so it became more useful quite quickly. Yeah.swyx [00:07:02]: What I love talking to people like you who sit across the ecosystem is, well, I have theories about what people want, but you have data and that's obviously more relevant. But I want to stay on the origin story a little bit more. When you started out, I would say, I think the status quo at the time was every paper would come out and they would report their numbers versus competitor numbers. And that's basically it. And I remember I did the legwork. I think everyone has some knowledge. I think there's some version of Excel sheet or a Google sheet where you just like copy and paste the numbers from every paper and just post it up there. And then sometimes they don't line up because they're independently run. And so your numbers are going to look better than... Your reproductions of other people's numbers are going to look worse because you don't hold their models correctly or whatever the excuse is. I think then Stanford Helm, Percy Liang's project would also have some of these numbers. And I don't know if there's any other source that you can cite. The way that if I were to start artificial analysis at the same time you guys started, I would have used the Luther AI's eval framework harness. Yup.Micah [00:08:06]: Yup. That was some cool stuff. At the end of the day, running these evals, it's like if it's a simple Q&A eval, all you're doing is asking a list of questions and checking if the answers are right, which shouldn't be that crazy. But it turns out there are an enormous number of things that you've got control for. And I mean, back when we started the website. Yeah. Yeah. Like one of the reasons why we realized that we had to run the evals ourselves and couldn't just take rules from the labs was just that they would all prompt the models differently. And when you're competing over a few points, then you can pretty easily get- You can put the answer into the model. Yeah. That in the extreme. And like you get crazy cases like back when I'm Googled a Gemini 1.0 Ultra and needed a number that would say it was better than GPT-4 and like constructed, I think never published like chain of thought examples. 32 of them in every topic in MLU to run it, to get the score, like there are so many things that you- They never shipped Ultra, right? That's the one that never made it up. Not widely. Yeah. Yeah. Yeah. I mean, I'm sure it existed, but yeah. So we were pretty sure that we needed to run them ourselves and just run them in the same way across all the models. Yeah. And we were, we also did certain from the start that you couldn't look at those in isolation. You needed to look at them alongside the cost and performance stuff. Yeah.swyx [00:09:24]: Okay. A couple of technical questions. I mean, so obviously I also thought about this and I didn't do it because of cost. Yep. Did you not worry about costs? Were you funded already? Clearly not, but you know. No. Well, we definitely weren't at the start.Micah [00:09:36]: So like, I mean, we're paying for it personally at the start. There's a lot of money. Well, the numbers weren't nearly as bad a couple of years ago. So we certainly incurred some costs, but we were probably in the order of like hundreds of dollars of spend across all the benchmarking that we were doing. Yeah. So nothing. Yeah. It was like kind of fine. Yeah. Yeah. These days that's gone up an enormous amount for a bunch of reasons that we can talk about. But yeah, it wasn't that bad because you can also remember that like the number of models we were dealing with was hardly any and the complexity of the stuff that we wanted to do to evaluate them was a lot less. Like we were just asking some Q&A type questions and then one specific thing was for a lot of evals initially, we were just like sampling an answer. You know, like, what's the answer for this? Like, we didn't want to go into the answer directly without letting the models think. We weren't even doing chain of thought stuff initially. And that was the most useful way to get some results initially. Yeah.swyx [00:10:33]: And so for people who haven't done this work, literally parsing the responses is a whole thing, right? Like because sometimes the models, the models can answer any way they feel fit and sometimes they actually do have the right answer, but they just returned the wrong format and they will get a zero for that unless you work it into your parser. And that involves more work. And so, I mean, but there's an open question whether you should give it points for not following your instructions on the format.Micah [00:11:00]: It depends what you're looking at, right? Because you can, if you're trying to see whether or not it can solve a particular type of reasoning problem, and you don't want to test it on its ability to do answer formatting at the same time, then you might want to use an LLM as answer extractor approach to make sure that you get the answer out no matter how unanswered. But these days, it's mostly less of a problem. Like, if you instruct a model and give it examples of what the answers should look like, it can get the answers in your format, and then you can do, like, a simple regex.swyx [00:11:28]: Yeah, yeah. And then there's other questions around, I guess, sometimes if you have a multiple choice question, sometimes there's a bias towards the first answer, so you have to randomize the responses. All these nuances, like, once you dig into benchmarks, you're like, I don't know how anyone believes the numbers on all these things. It's so dark magic.Micah [00:11:47]: You've also got, like… You've got, like, the different degrees of variance in different benchmarks, right? Yeah. So, if you run four-question multi-choice on a modern reasoning model at the temperatures suggested by the labs for their own models, the variance that you can see on a four-question multi-choice eval is pretty enormous if you only do a single run of it and it has a small number of questions, especially. So, like, one of the things that we do is run an enormous number of all of our evals when we're developing new ones and doing upgrades to our intelligence index to bring in new things. Yeah. So, that we can dial in the right number of repeats so that we can get to the 95% confidence intervals that we're comfortable with so that when we pull that together, we can be confident in intelligence index to at least as tight as, like, a plus or minus one at a 95% confidence. Yeah.swyx [00:12:32]: And, again, that just adds a straight multiple to the cost. Oh, yeah. Yeah, yeah.George [00:12:37]: So, that's one of many reasons that cost has gone up a lot more than linearly over the last couple of years. We report a cost to run the artificial analysis. We report a cost to run the artificial analysis intelligence index on our website, and currently that's assuming one repeat in terms of how we report it because we want to reflect a bit about the weighting of the index. But our cost is actually a lot higher than what we report there because of the repeats.swyx [00:13:03]: Yeah, yeah, yeah. And probably this is true, but just checking, you don't have any special deals with the labs. They don't discount it. You just pay out of pocket or out of your sort of customer funds. Oh, there is a mix. So, the issue is that sometimes they may give you a special end point, which is… Ah, 100%.Micah [00:13:21]: Yeah, yeah, yeah. Exactly. So, we laser focus, like, on everything we do on having the best independent metrics and making sure that no one can manipulate them in any way. There are quite a lot of processes we've developed over the last couple of years to make that true for, like, the one you bring up, like, right here of the fact that if we're working with a lab, if they're giving us a private endpoint to evaluate a model, that it is totally possible. That what's sitting behind that black box is not the same as they serve on a public endpoint. We're very aware of that. We have what we call a mystery shopper policy. And so, and we're totally transparent with all the labs we work with about this, that we will register accounts not on our own domain and run both intelligence evals and performance benchmarks… Yeah, that's the job. …without them being able to identify it. And no one's ever had a problem with that. Because, like, a thing that turns out to actually be quite a good… …good factor in the industry is that they all want to believe that none of their competitors could manipulate what we're doing either.swyx [00:14:23]: That's true. I never thought about that. I've been in the database data industry prior, and there's a lot of shenanigans around benchmarking, right? So I'm just kind of going through the mental laundry list. Did I miss anything else in this category of shenanigans? Oh, potential shenanigans.Micah [00:14:36]: I mean, okay, the biggest one, like, that I'll bring up, like, is more of a conceptual one, actually, than, like, direct shenanigans. It's that the things that get measured become things that get targeted by labs that they're trying to build, right? Exactly. So that doesn't mean anything that we should really call shenanigans. Like, I'm not talking about training on test set. But if you know that you're going to be great at another particular thing, if you're a researcher, there are a whole bunch of things that you can do to try to get better at that thing that preferably are going to be helpful for a wide range of how actual users want to use the thing that you're building. But will not necessarily work. Will not necessarily do that. So, for instance, the models are exceptional now at answering competition maths problems. There is some relevance of that type of reasoning, that type of work, to, like, how we might use modern coding agents and stuff. But it's clearly not one for one. So the thing that we have to be aware of is that once an eval becomes the thing that everyone's looking at, scores can get better on it without there being a reflection of overall generalized intelligence of these models. Getting better. That has been true for the last couple of years. It'll be true for the next couple of years. There's no silver bullet to defeat that other than building new stuff to stay relevant and measure the capabilities that matter most to real users. Yeah.swyx [00:15:58]: And we'll cover some of the new stuff that you guys are building as well, which is cool. Like, you used to just run other people's evals, but now you're coming up with your own. And I think, obviously, that is a necessary path once you're at the frontier. You've exhausted all the existing evals. I think the next point in history that I have for you is AI Grant that you guys decided to join and move here. What was it like? I think you were in, like, batch two? Batch four. Batch four. Okay.Micah [00:16:26]: I mean, it was great. Nat and Daniel are obviously great. And it's a really cool group of companies that we were in AI Grant alongside. It was really great to get Nat and Daniel on board. Obviously, they've done a whole lot of great work in the space with a lot of leading companies and were extremely aligned. With the mission of what we were trying to do. Like, we're not quite typical of, like, a lot of the other AI startups that they've invested in.swyx [00:16:53]: And they were very much here for the mission of what we want to do. Did they say any advice that really affected you in some way or, like, were one of the events very impactful? That's an interesting question.Micah [00:17:03]: I mean, I remember fondly a bunch of the speakers who came and did fireside chats at AI Grant.swyx [00:17:09]: Which is also, like, a crazy list. Yeah.George [00:17:11]: Oh, totally. Yeah, yeah, yeah. There was something about, you know, speaking to Nat and Daniel about the challenges of working through a startup and just working through the questions that don't have, like, clear answers and how to work through those kind of methodically and just, like, work through the hard decisions. And they've been great mentors to us as we've built artificial analysis. Another benefit for us was that other companies in the batch and other companies in AI Grant are pushing the capabilities. Yeah. And I think that's a big part of what AI can do at this time. And so being in contact with them, making sure that artificial analysis is useful to them has been fantastic for supporting us in working out how should we build out artificial analysis to continue to being useful to those, like, you know, building on AI.swyx [00:17:59]: I think to some extent, I'm mixed opinion on that one because to some extent, your target audience is not people in AI Grants who are obviously at the frontier. Yeah. Do you disagree?Micah [00:18:09]: To some extent. To some extent. But then, so a lot of what the AI Grant companies are doing is taking capabilities coming out of the labs and trying to push the limits of what they can do across the entire stack for building great applications, which actually makes some of them pretty archetypical power users of artificial analysis. Some of the people with the strongest opinions about what we're doing well and what we're not doing well and what they want to see next from us. Yeah. Yeah. Because when you're building any kind of AI application now, chances are you're using a whole bunch of different models. You're maybe switching reasonably frequently for different models and different parts of your application to optimize what you're able to do with them at an accuracy level and to get better speed and cost characteristics. So for many of them, no, they're like not commercial customers of ours, like we don't charge for all our data on the website. Yeah. They are absolutely some of our power users.swyx [00:19:07]: So let's talk about just the evals as well. So you start out from the general like MMU and GPQA stuff. What's next? How do you sort of build up to the overall index? What was in V1 and how did you evolve it? Okay.Micah [00:19:22]: So first, just like background, like we're talking about the artificial analysis intelligence index, which is our synthesis metric that we pulled together currently from 10 different eval data sets to give what? We're pretty much the same as that. Pretty confident is the best single number to look at for how smart the models are. Obviously, it doesn't tell the whole story. That's why we published the whole website of all the charts to dive into every part of it and look at the trade-offs. But best single number. So right now, it's got a bunch of Q&A type data sets that have been very important to the industry, like a couple that you just mentioned. It's also got a couple of agentic data sets. It's got our own long context reasoning data set and some other use case focused stuff. As time goes on. The things that we're most interested in that are going to be important to the capabilities that are becoming more important for AI, what developers are caring about, are going to be first around agentic capabilities. So surprise, surprise. We're all loving our coding agents and how the model is going to perform like that and then do similar things for different types of work are really important to us. The linking to use cases to economically valuable use cases are extremely important to us. And then we've got some of the. Yeah. These things that the models still struggle with, like working really well over long contexts that are not going to go away as specific capabilities and use cases that we need to keep evaluating.swyx [00:20:46]: But I guess one thing I was driving was like the V1 versus the V2 and how bad it was over time.Micah [00:20:53]: Like how we've changed the index to where we are.swyx [00:20:55]: And I think that reflects on the change in the industry. Right. So that's a nice way to tell that story.Micah [00:21:00]: Well, V1 would be completely saturated right now. Almost every model coming out because doing things like writing the Python functions and human evil is now pretty trivial. It's easy to forget, actually, I think how much progress has been made in the last two years. Like we obviously play the game constantly of like the today's version versus last week's version and the week before and all of the small changes in the horse race between the current frontier and who has the best like smaller than 10B model like right now this week. Right. And that's very important to a lot of developers and people and especially in this particular city of San Francisco. But when you zoom out a couple of years ago, literally most of what we were doing to evaluate the models then would all be 100% solved by even pretty small models today. And that's been one of the key things, by the way, that's driven down the cost of intelligence at every tier of intelligence. We can talk about more in a bit. So V1, V2, V3, we made things harder. We covered a wider range of use cases. And we tried to get closer to things developers care about as opposed to like just the Q&A type stuff that MMLU and GPQA represented. Yeah.swyx [00:22:12]: I don't know if you have anything to add there. Or we could just go right into showing people the benchmark and like looking around and asking questions about it. Yeah.Micah [00:22:21]: Let's do it. Okay. This would be a pretty good way to chat about a few of the new things we've launched recently. Yeah.George [00:22:26]: And I think a little bit about the direction that we want to take it. And we want to push benchmarks. Currently, the intelligence index and evals focus a lot on kind of raw intelligence. But we kind of want to diversify how we think about intelligence. And we can talk about it. But kind of new evals that we've kind of built and partnered on focus on topics like hallucination. And we've got a lot of topics that I think are not covered by the current eval set that should be. And so we want to bring that forth. But before we get into that.swyx [00:23:01]: And so for listeners, just as a timestamp, right now, number one is Gemini 3 Pro High. Then followed by Cloud Opus at 70. Just 5.1 high. You don't have 5.2 yet. And Kimi K2 Thinking. Wow. Still hanging in there. So those are the top four. That will date this podcast quickly. Yeah. Yeah. I mean, I love it. I love it. No, no. 100%. Look back this time next year and go, how cute. Yep.George [00:23:25]: Totally. A quick view of that is, okay, there's a lot. I love it. I love this chart. Yeah.Micah [00:23:30]: This is such a favorite, right? Yeah. And almost every talk that George or I give at conferences and stuff, we always put this one up first to just talk about situating where we are in this moment in history. This, I think, is the visual version of what I was saying before about the zooming out and remembering how much progress there's been. If we go back to just over a year ago, before 01, before Cloud Sonnet 3.5, we didn't have reasoning models or coding agents as a thing. And the game was very, very different. If we go back even a little bit before then, we're in the era where, when you look at this chart, open AI was untouchable for well over a year. And, I mean, you would remember that time period well of there being very open questions about whether or not AI was going to be competitive, like full stop, whether or not open AI would just run away with it, whether we would have a few frontier labs and no one else would really be able to do anything other than consume their APIs. I am quite happy overall that the world that we have ended up in is one where... Multi-model. Absolutely. And strictly more competitive every quarter over the last few years. Yeah. This year has been insane. Yeah.George [00:24:42]: You can see it. This chart with everything added is hard to read currently. There's so many dots on it, but I think it reflects a little bit what we felt, like how crazy it's been.swyx [00:24:54]: Why 14 as the default? Is that a manual choice? Because you've got service now in there that are less traditional names. Yeah.George [00:25:01]: It's models that we're kind of highlighting by default in our charts, in our intelligence index. Okay.swyx [00:25:07]: You just have a manually curated list of stuff.George [00:25:10]: Yeah, that's right. But something that I actually don't think every artificial analysis user knows is that you can customize our charts and choose what models are highlighted. Yeah. And so if we take off a few names, it gets a little easier to read.swyx [00:25:25]: Yeah, yeah. A little easier to read. Totally. Yeah. But I love that you can see the all one jump. Look at that. September 2024. And the DeepSeek jump. Yeah.George [00:25:34]: Which got close to OpenAI's leadership. They were so close. I think, yeah, we remember that moment. Around this time last year, actually.Micah [00:25:44]: Yeah, yeah, yeah. I agree. Yeah, well, a couple of weeks. It was Boxing Day in New Zealand when DeepSeek v3 came out. And we'd been tracking DeepSeek and a bunch of the other global players that were less known over the second half of 2024 and had run evals on the earlier ones and stuff. I very distinctly remember Boxing Day in New Zealand, because I was with family for Christmas and stuff, running the evals and getting back result by result on DeepSeek v3. So this was the first of their v3 architecture, the 671b MOE.Micah [00:26:19]: And we were very, very impressed. That was the moment where we were sure that DeepSeek was no longer just one of many players, but had jumped up to be a thing. The world really noticed when they followed that up with the RL working on top of v3 and R1 succeeding a few weeks later. But the groundwork for that absolutely was laid with just extremely strong base model, completely open weights that we had as the best open weights model. So, yeah, that's the thing that you really see in the game. But I think that we got a lot of good feedback on Boxing Day. us on Boxing Day last year.George [00:26:48]: Boxing Day is the day after Christmas for those not familiar.George [00:26:54]: I'm from Singapore.swyx [00:26:55]: A lot of us remember Boxing Day for a different reason, for the tsunami that happened. Oh, of course. Yeah, but that was a long time ago. So yeah. So this is the rough pitch of AAQI. Is it A-A-Q-I or A-A-I-I? I-I. Okay. Good memory, though.Micah [00:27:11]: I don't know. I'm not used to it. Once upon a time, we did call it Quality Index, and we would talk about quality, performance, and price, but we changed it to intelligence.George [00:27:20]: There's been a few naming changes. We added hardware benchmarking to the site, and so benchmarks at a kind of system level. And so then we changed our throughput metric to, we now call it output speed, and thenswyx [00:27:32]: throughput makes sense at a system level, so we took that name. Take me through more charts. What should people know? Obviously, the way you look at the site is probably different than how a beginner might look at it.Micah [00:27:42]: Yeah, that's fair. There's a lot of fun stuff to dive into. Maybe so we can hit past all the, like, we have lots and lots of emails and stuff. The interesting ones to talk about today that would be great to bring up are a few of our recent things, I think, that probably not many people will be familiar with yet. So first one of those is our omniscience index. So this one is a little bit different to most of the intelligence evils that we've run. We built it specifically to look at the embedded knowledge in the models and to test hallucination by looking at when the model doesn't know the answer, so not able to get it correct, what's its probability of saying, I don't know, or giving an incorrect answer. So the metric that we use for omniscience goes from negative 100 to positive 100. Because we're simply taking off a point if you give an incorrect answer to the question. We're pretty convinced that this is an example of where it makes most sense to do that, because it's strictly more helpful to say, I don't know, instead of giving a wrong answer to factual knowledge question. And one of our goals is to shift the incentive that evils create for models and the labs creating them to get higher scores. And almost every evil across all of AI up until this point, it's been graded by simple percentage correct as the main metric, the main thing that gets hyped. And so you should take a shot at everything. There's no incentive to say, I don't know. So we did that for this one here.swyx [00:29:22]: I think there's a general field of calibration as well, like the confidence in your answer versus the rightness of the answer. Yeah, we completely agree. Yeah. Yeah.George [00:29:31]: On that. And one reason that we didn't do that is because. Or put that into this index is that we think that the, the way to do that is not to ask the models how confident they are.swyx [00:29:43]: I don't know. Maybe it might be though. You put it like a JSON field, say, say confidence and maybe it spits out something. Yeah. You know, we have done a few evils podcasts over the, over the years. And when we did one with Clementine of hugging face, who maintains the open source leaderboard, and this was one of her top requests, which is some kind of hallucination slash lack of confidence calibration thing. And so, Hey, this is one of them.Micah [00:30:05]: And I mean, like anything that we do, it's not a perfect metric or the whole story of everything that you think about as hallucination. But yeah, it's pretty useful and has some interesting results. Like one of the things that we saw in the hallucination rate is that anthropics Claude models at the, the, the very left-hand side here with the lowest hallucination rates out of the models that we've evaluated amnesty is on. That is an interesting fact. I think it probably correlates with a lot of the previously, not really measured vibes stuff that people like about some of the Claude models. Is the dataset public or what's is it, is there a held out set? There's a hell of a set for this one. So we, we have published a public test set, but we we've only published 10% of it. The reason is that for this one here specifically, it would be very, very easy to like have data contamination because it is just factual knowledge questions. We would. We'll update it at a time to also prevent that, but with yeah, kept most of it held out so that we can keep it reliable for a long time. It leads us to a bunch of really cool things, including breakdown quite granularly by topic. And so we've got some of that disclosed on the website publicly right now, and there's lots more coming in terms of our ability to break out very specific topics. Yeah.swyx [00:31:23]: I would be interested. Let's, let's dwell a little bit on this hallucination one. I noticed that Haiku hallucinates less than Sonnet hallucinates less than Opus. And yeah. Would that be the other way around in a normal capability environments? I don't know. What's, what do you make of that?George [00:31:37]: One interesting aspect is that we've found that there's not really a, not a strong correlation between intelligence and hallucination, right? That's to say that the smarter the models are in a general sense, isn't correlated with their ability to, when they don't know something, say that they don't know. It's interesting that Gemini three pro preview was a big leap over here. Gemini 2.5. Flash and, and, and 2.5 pro, but, and if I add pro quickly here.swyx [00:32:07]: I bet pro's really good. Uh, actually no, I meant, I meant, uh, the GPT pros.George [00:32:12]: Oh yeah.swyx [00:32:13]: Cause GPT pros are rumored. We don't know for a fact that it's like eight runs and then with the LM judge on top. Yeah.George [00:32:20]: So we saw a big jump in, this is accuracy. So this is just percent that they get, uh, correct and Gemini three pro knew a lot more than the other models. And so big jump in accuracy. But relatively no change between the Google Gemini models, between releases. And the hallucination rate. Exactly. And so it's likely due to just kind of different post-training recipe, between the, the Claude models. Yeah.Micah [00:32:45]: Um, there's, there's driven this. Yeah. You can, uh, you can partially blame us and how we define intelligence having until now not defined hallucination as a negative in the way that we think about intelligence.swyx [00:32:56]: And so that's what we're changing. Uh, I know many smart people who are confidently incorrect.George [00:33:02]: Uh, look, look at that. That, that, that is very humans. Very true. And there's times and a place for that. I think our view is that hallucination rate makes sense in this context where it's around knowledge, but in many cases, people want the models to hallucinate, to have a go. Often that's the case in coding or when you're trying to generate newer ideas. One eval that we added to artificial analysis is, is, is critical point and it's really hard, uh, physics problems. Okay.swyx [00:33:32]: And is it sort of like a human eval type or something different or like a frontier math type?George [00:33:37]: It's not dissimilar to frontier frontier math. So these are kind of research questions that kind of academics in the physics physics world would be able to answer, but models really struggled to answer. So the top score here is not 9%.swyx [00:33:51]: And when the people that, that created this like Minway and, and, and actually off via who was kind of behind sweep and what organization is this? Oh, is this, it's Princeton.George [00:34:01]: Kind of range of academics from, from, uh, different academic institutions, really smart people. They talked about how they turn the models up in terms of the temperature as high temperature as they can, where they're trying to explore kind of new ideas in physics as a, as a thought partner, just because they, they want the models to hallucinate. Um, yeah, sometimes it's something new. Yeah, exactly.swyx [00:34:21]: Um, so not right in every situation, but, um, I think it makes sense, you know, to test hallucination in scenarios where it makes sense. Also, the obvious question is, uh, this is one of. Many that there is there, every lab has a system card that shows some kind of hallucination number, and you've chosen to not, uh, endorse that and you've made your own. And I think that's a, that's a choice. Um, totally in some sense, the rest of artificial analysis is public benchmarks that other people can independently rerun. You provide it as a service here. You have to fight the, well, who are we to, to like do this? And your, your answer is that we have a lot of customers and, you know, but like, I guess, how do you converge the individual?Micah [00:35:08]: I mean, I think, I think for hallucinations specifically, there are a bunch of different things that you might care about reasonably, and that you'd measure quite differently, like we've called this a amnesty and solutionation rate, not trying to declare the, like, it's humanity's last hallucination. You could, uh, you could have some interesting naming conventions and all this stuff. Um, the biggest picture answer to that. It's something that I actually wanted to mention. Just as George was explaining, critical point as well is, so as we go forward, we are building evals internally. We're partnering with academia and partnering with AI companies to build great evals. We have pretty strong views on, in various ways for different parts of the AI stack, where there are things that are not being measured well, or things that developers care about that should be measured more and better. And we intend to be doing that. We're not obsessed necessarily with that. Everything we do, we have to do entirely within our own team. Critical point. As a cool example of where we were a launch partner for it, working with academia, we've got some partnerships coming up with a couple of leading companies. Those ones, obviously we have to be careful with on some of the independent stuff, but with the right disclosure, like we're completely comfortable with that. A lot of the labs have released great data sets in the past that we've used to great success independently. And so it's between all of those techniques, we're going to be releasing more stuff in the future. Cool.swyx [00:36:26]: Let's cover the last couple. And then we'll, I want to talk about your trends analysis stuff, you know? Totally.Micah [00:36:31]: So that actually, I have one like little factoid on omniscience. If you go back up to accuracy on omniscience, an interesting thing about this accuracy metric is that it tracks more closely than anything else that we measure. The total parameter count of models makes a lot of sense intuitively, right? Because this is a knowledge eval. This is the pure knowledge metric. We're not looking at the index and the hallucination rate stuff that we think is much more about how the models are trained. This is just what facts did they recall? And yeah, it tracks parameter count extremely closely. Okay.swyx [00:37:05]: What's the rumored size of GPT-3 Pro? And to be clear, not confirmed for any official source, just rumors. But rumors do fly around. Rumors. I get, I hear all sorts of numbers. I don't know what to trust.Micah [00:37:17]: So if you, if you draw the line on omniscience accuracy versus total parameters, we've got all the open ways models, you can squint and see that likely the leading frontier models right now are quite a lot bigger than the ones that we're seeing right now. And the one trillion parameters that the open weights models cap out at, and the ones that we're looking at here, there's an interesting extra data point that Elon Musk revealed recently about XAI that for three trillion parameters for GROK 3 and 4, 6 trillion for GROK 5, but that's not out yet. Take those together, have a look. You might reasonably form a view that there's a pretty good chance that Gemini 3 Pro is bigger than that, that it could be in the 5 to 10 trillion parameters. To be clear, I have absolutely no idea, but just based on this chart, like that's where you would, you would land if you have a look at it. Yeah.swyx [00:38:07]: And to some extent, I actually kind of discourage people from guessing too much because what does it really matter? Like as long as they can serve it as a sustainable cost, that's about it. Like, yeah, totally.George [00:38:17]: They've also got different incentives in play compared to like open weights models who are thinking to supporting others in self-deployment for the labs who are doing inference at scale. It's I think less about total parameters in many cases. When thinking about inference costs and more around number of active parameters. And so there's a bit of an incentive towards larger sparser models. Agreed.Micah [00:38:38]: Understood. Yeah. Great. I mean, obviously if you're a developer or company using these things, not exactly as you say, it doesn't matter. You should be looking at all the different ways that we measure intelligence. You should be looking at cost to run index number and the different ways of thinking about token efficiency and cost efficiency based on the list prices, because that's all it matters.swyx [00:38:56]: It's not as good for the content creator rumor mill where I can say. Oh, GPT-4 is this small circle. Look at GPT-5 is this big circle. And then there used to be a thing for a while. Yeah.Micah [00:39:07]: But that is like on its own, actually a very interesting one, right? That is it just purely that chances are the last couple of years haven't seen a dramatic scaling up in the total size of these models. And so there's a lot of room to go up properly in total size of the models, especially with the upcoming hardware generations. Yes.swyx [00:39:29]: So, you know. Taking off my shitposting face for a minute. Yes. Yes. At the same time, I do feel like, you know, especially coming back from Europe, people do feel like Ilya is probably right that the paradigm is doesn't have many more orders of magnitude to scale out more. And therefore we need to start exploring at least a different path. GDPVal, I think it's like only like a month or so old. I was also very positive when it first came out. I actually talked to Tejo, who was the lead researcher on that. Oh, cool. And you have your own version.George [00:39:59]: It's a fantastic. It's a fantastic data set. Yeah.swyx [00:40:01]: And maybe it will recap for people who are still out of it. It's like 44 tasks based on some kind of GDP cutoff that's like meant to represent broad white collar work that is not just coding. Yeah.Micah [00:40:12]: Each of the tasks have a whole bunch of detailed instructions, some input files for a lot of them. It's within the 44 is divided into like two hundred and twenty two to five, maybe subtasks that are the level of that we run through the agenda. And yeah, they're really interesting. I will say that it doesn't. It doesn't necessarily capture like all the stuff that people do at work. No avail is perfect is always going to be more things to look at, largely because in order to make the tasks well enough to find that you can run them, they need to only have a handful of input files and very specific instructions for that task. And so I think the easiest way to think about them are that they're like quite hard take home exam tasks that you might do in an interview process.swyx [00:40:56]: Yeah, for listeners, it is not no longer like a long prompt. It is like, well, here's a zip file with like a spreadsheet or a PowerPoint deck or a PDF and go nuts and answer this question.George [00:41:06]: OpenAI released a great data set and they released a good paper which looks at performance across the different web chat bots on the data set. It's a great paper, encourage people to read it. What we've done is taken that data set and turned it into an eval that can be run on any model. So we created a reference agentic harness that can run. Run the models on the data set, and then we developed evaluator approach to compare outputs. That's kind of AI enabled, so it uses Gemini 3 Pro Preview to compare results, which we tested pretty comprehensively to ensure that it's aligned to human preferences. One data point there is that even as an evaluator, Gemini 3 Pro, interestingly, doesn't do actually that well. So that's kind of a good example of what we've done in GDPVal AA.swyx [00:42:01]: Yeah, the thing that you have to watch out for with LLM judge is self-preference that models usually prefer their own output, and in this case, it was not. Totally.Micah [00:42:08]: I think the way that we're thinking about the places where it makes sense to use an LLM as judge approach now, like quite different to some of the early LLM as judge stuff a couple of years ago, because some of that and MTV was a great project that was a good example of some of this a while ago was about judging conversations and like a lot of style type stuff. Here, we've got the task that the grader and grading model is doing is quite different to the task of taking the test. When you're taking the test, you've got all of the agentic tools you're working with, the code interpreter and web search, the file system to go through many, many turns to try to create the documents. Then on the other side, when we're grading it, we're running it through a pipeline to extract visual and text versions of the files and be able to provide that to Gemini, and we're providing the criteria for the task and getting it to pick which one more effectively meets the criteria of the task. Yeah. So we've got the task out of two potential outcomes. It turns out that we proved that it's just very, very good at getting that right, matched with human preference a lot of the time, because I think it's got the raw intelligence, but it's combined with the correct representation of the outputs, the fact that the outputs were created with an agentic task that is quite different to the way the grading model works, and we're comparing it against criteria, not just kind of zero shot trying to ask the model to pick which one is better.swyx [00:43:26]: Got it. Why is this an ELO? And not a percentage, like GDP-VAL?George [00:43:31]: So the outputs look like documents, and there's video outputs or audio outputs from some of the tasks. It has to make a video? Yeah, for some of the tasks. Some of the tasks.swyx [00:43:43]: What task is that?George [00:43:45]: I mean, it's in the data set. Like be a YouTuber? It's a marketing video.Micah [00:43:49]: Oh, wow. What? Like model has to go find clips on the internet and try to put it together. The models are not that good at doing that one, for now, to be clear. It's pretty hard to do that with a code editor. I mean, the computer stuff doesn't work quite well enough and so on and so on, but yeah.George [00:44:02]: And so there's no kind of ground truth, necessarily, to compare against, to work out percentage correct. It's hard to come up with correct or incorrect there. And so it's on a relative basis. And so we use an ELO approach to compare outputs from each of the models between the task.swyx [00:44:23]: You know what you should do? You should pay a contractor, a human, to do the same task. And then give it an ELO and then so you have, you have human there. It's just, I think what's helpful about GDPVal, the OpenAI one, is that 50% is meant to be normal human and maybe Domain Expert is higher than that, but 50% was the bar for like, well, if you've crossed 50, you are superhuman. Yeah.Micah [00:44:47]: So we like, haven't grounded this score in that exactly. I agree that it can be helpful, but we wanted to generalize this to a very large number. It's one of the reasons that presenting it as ELO is quite helpful and allows us to add models and it'll stay relevant for quite a long time. I also think it, it can be tricky looking at these exact tasks compared to the human performance, because the way that you would go about it as a human is quite different to how the models would go about it. Yeah.swyx [00:45:15]: I also liked that you included Lama 4 Maverick in there. Is that like just one last, like...Micah [00:45:20]: Well, no, no, no, no, no, no, it is the, it is the best model released by Meta. And... So it makes it into the homepage default set, still for now.George [00:45:31]: Other inclusion that's quite interesting is we also ran it across the latest versions of the web chatbots. And so we have...swyx [00:45:39]: Oh, that's right.George [00:45:40]: Oh, sorry.swyx [00:45:41]: I, yeah, I completely missed that. Okay.George [00:45:43]: No, not at all. So that, which has a checkered pattern. So that is their harness, not yours, is what you're saying. Exactly. And what's really interesting is that if you compare, for instance, Claude 4.5 Opus using the Claude web chatbot, it performs worse than the model in our agentic harness. And so in every case, the model performs better in our agentic harness than its web chatbot counterpart, the harness that they created.swyx [00:46:13]: Oh, my backwards explanation for that would be that, well, it's meant for consumer use cases and here you're pushing it for something.Micah [00:46:19]: The constraints are different and the amount of freedom that you can give the model is different. Also, you like have a cost goal. We let the models work as long as they want, basically. Yeah. Do you copy paste manually into the chatbot? Yeah. Yeah. That's, that was how we got the chatbot reference. We're not going to be keeping those updated at like quite the same scale as hundreds of models.swyx [00:46:38]: Well, so I don't know, talk to a browser base. They'll, they'll automate it for you. You know, like I have thought about like, well, we should turn these chatbot versions into an API because they are legitimately different agents in themselves. Yes. Right. Yeah.Micah [00:46:53]: And that's grown a huge amount of the last year, right? Like the tools. The tools that are available have actually diverged in my opinion, a fair bit across the major chatbot apps and the amount of data sources that you can connect them to have gone up a lot, meaning that your experience and the way you're using the model is more different than ever.swyx [00:47:10]: What tools and what data connections come to mind when you say what's interesting, what's notable work that people have done?Micah [00:47:15]: Oh, okay. So my favorite example on this is that until very recently, I would argue that it was basically impossible to get an LLM to draft an email for me in any useful way. Because most times that you're sending an email, you're not just writing something for the sake of writing it. Chances are context required is a whole bunch of historical emails. Maybe it's notes that you've made, maybe it's meeting notes, maybe it's, um, pulling something from your, um, any of like wherever you at work store stuff. So for me, like Google drive, one drive, um, in our super base databases, if we need to do some analysis or some data or something, preferably model can be plugged into all of those things and can go do some useful work based on it. The things that like I find most impressive currently that I am somewhat surprised work really well in late 2025, uh, that I can have models use super base MCP to query read only, of course, run a whole bunch of SQL queries to do pretty significant data analysis. And. And make charts and stuff and can read my Gmail and my notion. And okay. You actually use that. That's good. That's, that's, that's good. Is that a cloud thing? To various degrees of order, but chat GPD and Claude right now, I would say that this stuff like barely works in fairness right now. Like.George [00:48:33]: Because people are actually going to try this after they hear it. If you get an email from Micah, odds are it wasn't written by a chatbot.Micah [00:48:38]: So, yeah, I think it is true that I have never actually sent anyone an email drafted by a chatbot. Yet.swyx [00:48:46]: Um, and so you can, you can feel it right. And yeah, this time, this time next year, we'll come back and see where it's going. Totally. Um, super base shout out another famous Kiwi. Uh, I don't know if you've, you've any conversations with him about anything in particular on AI building and AI infra.George [00:49:03]: We have had, uh, Twitter DMS, um, with, with him because we're quite big, uh, super base users and power users. And we probably do some things more manually than we should in. In, in super base support line because you're, you're a little bit being super friendly. One extra, um, point regarding, um, GDP Val AA is that on the basis of the overperformance of the models compared to the chatbots turns out, we realized that, oh, like our reference harness that we built actually white works quite well on like gen generalist agentic tasks. This proves it in a sense. And so the agent harness is very. Minimalist. I think it follows some of the ideas that are in Claude code and we, all that we give it is context management capabilities, a web search, web browsing, uh, tool, uh, code execution, uh, environment. Anything else?Micah [00:50:02]: I mean, we can equip it with more tools, but like by default, yeah, that's it. We, we, we give it for GDP, a tool to, uh, view an image specifically, um, because the models, you know, can just use a terminal to pull stuff in text form into context. But to pull visual stuff into context, we had to give them a custom tool, but yeah, exactly. Um, you, you can explain an expert. No.George [00:50:21]: So it's, it, we turned out that we created a good generalist agentic harness. And so we, um, released that on, on GitHub yesterday. It's called stirrup. So if people want to check it out and, and it's a great, um, you know, base for, you know, generalist, uh, building a generalist agent for more specific tasks.Micah [00:50:39]: I'd say the best way to use it is get clone and then have your favorite coding. Agent make changes to it, to do whatever you want, because it's not that many lines of code and the coding agents can work with it. Super well.swyx [00:50:51]: Well, that's nice for the community to explore and share and hack on it. I think maybe in, in, in other similar environments, the terminal bench guys have done, uh, sort of the Harbor. Uh, and so it's, it's a, it's a bundle of, well, we need our minimal harness, which for them is terminus and we also need the RL environments or Docker deployment thing to, to run independently. So I don't know if you've looked at it. I don't know if you've looked at the harbor at all, is that, is that like a, a standard that people want to adopt?George [00:51:19]: Yeah, we've looked at it from a evals perspective and we love terminal bench and, and host benchmarks of, of, of terminal mention on artificial analysis. Um, we've looked at it from a, from a coding agent perspective, but could see it being a great, um, basis for any kind of agents. I think where we're getting to is that these models have gotten smart enough. They've gotten better, better tools that they can perform better when just given a minimalist. Set of tools and, and let them run, let the model control the, the agentic workflow rather than using another framework that's a bit more built out that tries to dictate the, dictate the flow. Awesome.swyx [00:51:56]: Let's cover the openness index and then let's go into the report stuff. Uh, so that's the, that's the last of the proprietary art numbers, I guess. I don't know how you sort of classify all these. Yeah.Micah [00:52:07]: Or call it, call it, let's call it the last of like the, the three new things that we're talking about from like the last few weeks. Um, cause I mean, there's a, we do a mix of stuff that. Where we're using open source, where we open source and what we do and, um, proprietary stuff that we don't always open source, like long context reasoning data set last year, we did open source. Um, and then all of the work on performance benchmarks across the site, some of them, we looking to open source, but some of them, like we're constantly iterating on and so on and so on and so on. So there's a huge mix, I would say, just of like stuff that is open source and not across the side. So that's a LCR for people. Yeah, yeah, yeah, yeah.swyx [00:52:41]: Uh, but let's, let's, let's talk about open.Micah [00:52:42]: Let's talk about openness index. This. Here is call it like a new way to think about how open models are. We, for a long time, have tracked where the models are open weights and what the licenses on them are. And that's like pretty useful. That tells you what you're allowed to do with the weights of a model, but there is this whole other dimension to how open models are. That is pretty important that we haven't tracked until now. And that's how much is disclosed about how it was made. So transparency about data, pre-training data and post-training data. And whether you're allowed to use that data and transparency about methodology and training code. So basically, those are the components. We bring them together to score an openness index for models so that you can in one place get this full picture of how open models are.swyx [00:53:32]: I feel like I've seen a couple other people try to do this, but they're not maintained. I do think this does matter. I don't know what the numbers mean apart from is there a max number? Is this out of 20?George [00:53:44]: It's out of 18 currently, and so we've got an openness index page, but essentially these are points, you get points for being more open across these different categories and the maximum you can achieve is 18. So AI2 with their extremely open OMO3 32B think model is the leader in a sense.swyx [00:54:04]: It's hooking face.George [00:54:05]: Oh, with their smaller model. It's coming soon. I think we need to run, we need to get the intelligence benchmarks right to get it on the site.swyx [00:54:12]: You can't have it open in the next. We can not include hooking face. We love hooking face. We'll have that, we'll have that up very soon. I mean, you know, the refined web and all that stuff. It's, it's amazing. Or is it called fine web? Fine web. Fine web.Micah [00:54:23]: Yeah, yeah, no, totally. Yep. One of the reasons this is cool, right, is that if you're trying to understand the holistic picture of the models and what you can do with all the stuff the company's contributing, this gives you that picture. And so we are going to keep it up to date alongside all the models that we do intelligence index on, on the site. And it's just an extra view to understand.swyx [00:54:43]: Can you scroll down to this? The, the, the, the trade-offs chart. Yeah, yeah. That one. Yeah. This, this really matters, right? Obviously, because you can b

Conversations with Tyler
Brendan Foody on Teaching AI and the Future of Knowledge Work

Conversations with Tyler

Play Episode Listen Later Jan 7, 2026 61:18


At 22, Brendan Foody is both the youngest Conversations with Tyler guest ever and the youngest unicorn founder on record. His company Mercor hires the experts who train frontier AI models—from poets grading verse to economists building evaluation frameworks—and has become one of the fastest-growing startups in history. Tyler and Brendan discuss why Mercor pays poets $150 an hour, why AI labs need rubrics more than raw text, whether we should enshrine the aesthetic standards of past eras rather than current ones, how quickly models are improving at economically valuable tasks, how long until AI can stump Cass Sunstein, the coming shift toward knowledge workers building RL environments instead of doing repetitive analysis, how to interview without falling for vibes, why nepotism might make a comeback as AI optimizes everyone's cover letters, scaling the Thiel Fellowship 100,000X, what his 8th-grade donut empire taught him about driving out competition, the link between dyslexia and entrepreneurship, dining out and dating in San Francisco, Mercor's next steps, and more. Read a full transcript enhanced with helpful links, or watch the full video on the new dedicated Conversations with Tyler channel. Recorded October 16th, 2025. Other ways to connect Follow us on X and Instagram Follow Tyler on X Follow Brendan on X Sign up for our newsletter Join our Discord Email us: cowenconvos@mercatus.gmu.edu Learn more about Conversations with Tyler and other Mercatus Center podcasts here. Timestamps 00:00:00 - Hiring poets to teach AI 00:05:29 - Measuring real-world AI progress  00:13:25 - Why rubrics are the new oil  00:18:44 - Enshrining taste in LLMs 00:22:38 - Turning society into one giant RL machine 00:26:37 - When AI will stump experts 00:30:46 - AI and employment 00:35:05 - Why vibes-based hiring fails 00:39:55 - Solving labor market matching problems  00:45:01 - Scaling the Thiel Fellowship  00:48:11 - A hypothetical gap year 00:50:31 - Donuts, debates, and dyslexia 00:56:15 - Dating and dining out 00:59:01 - Mercor's next steps

Behind The Bunker's Podcast
Episode 601: What Paintball Need Right Now?! EP 598

Behind The Bunker's Podcast

Play Episode Listen Later Jan 6, 2026 62:35


Help support the free broadcast by donating to our PayPal fundraiser! https://www.paypal.com/ncp/payment/RL... Behind the Bunker Paintball Podcast is a long-running weekly show dedicated to everything paintball. Hosted by passionate players and industry veterans, the podcast dives into the latest happenings in the sport, from new gear releases and product reviews to updates on tournaments and events around the world. It has built a loyal audience by combining serious paintball discussion with a lighthearted, entertaining approach that keeps both new players and seasoned veterans engaged.

Training Data
Training General Robots for Any Task: Physical Intelligence's Karol Hausman and Tobi Springenberg

Training Data

Play Episode Listen Later Jan 6, 2026 61:37


Physical Intelligence's Karol Hausman and Tobi Springenberg believe that robotics has been held back not by hardware limitations, but by an intelligence bottleneck that foundation models can solve. Their end-to-end learning approach combines vision, language, and action into models like π0 and π*0.6, enabling robots to learn generalizable behaviors rather than task-specific programs. The team prioritizes real-world deployment and uses RL from experience to push beyond what imitation learning alone can achieve. Their philosophy—that a single general-purpose model can handle diverse physical tasks across different robot embodiments—represents a fundamental shift in how we think about building intelligent machines for the physical world. Hosted by Alfred Lin and Sonya Huang, Sequoia Capital

The Effortless Podcast
Alex Dimakis: The Future of Long-Horizon AI Agents - Episode 21: The Effortless Podcast

The Effortless Podcast

Play Episode Listen Later Jan 6, 2026 92:12


In this episode of The Effortless Podcast, Amit Prakash and Dheeraj Pandey are joined by Alex Dimakis for a wide-ranging, systems-first discussion on the future of long-horizon AI agents that can operate over time, learn from feedback, adapt to users, and function reliably inside real-world environments.The conversation spans research and industry, unpacking why prompt engineering alone collapses at scale; how advisor models, reward-driven learning, and environment-based evaluation enable continual improvement without retraining frontier models; and why memory in AI systems is as much about forgetting as it is about recall. Drawing from distributed systems, reinforcement learning, and cognitive science, the trio explores how personalization, benchmarks, and context engineering are becoming the foundation of AI-native software.Alex, Dheeraj, and Amit also examine the evolution from SFT to RL to JEPA-style world models, the role of harnesses and benchmarks in measuring real progress, and why enterprise AI has moved decisively from research into engineering. The result is a candid, deeply technical conversation about what it will actually take to move beyond demos and build agents that work over long horizons.Key Topics & Timestamps 00:00 – Introduction, context, and holiday catch-up04:00 – Teaching in the age of AI and why cognitive “exercise” still matters08:00 – Industry sentiment: fear, trust, and skepticism around LLMs12:00  – Memory in AI systems: documents, transcripts, and limits of recall17:00  – Why forgetting is a feature, not a bug22:00 – Advisor models and dynamic prompt augmentation27:00 – Data vs metadata: control planes vs data planes in AI systems32:00 – Personalization, rewards, and learning user preferences implicitly37:00 – Why prompt-only workflows break down at scale41:00 – RAG, advice, and moving beyond retrieval-centric systems46:00 – Long-horizon agents and the limits of reflection-based prompting51:00 – Environments, rewards, and agent-centric evaluation56:00 – From Q&A benchmarks to agents that act in the world1:01:00 – Terminal Bench, harnesses, and measuring real agent progress1:06:00 – Frontier labs, open source, and the pace of change1:11:00 – Context engineering as infrastructure (“the train tracks” analogy)1:16:00 – Organizing agents: permissions, visibility, and enterprise structure1:20:00 – SFT vs RL: imitation first, reinforcement last1:25:00 – Anti-fragility, trial-and-error, and unsolved problems in continual learning1:28:00 – Closing reflections on the future of long-horizon AI agentsHosts:Amit PrakashCEO & Founder at AmpUp, Former engineer at Google AdSense and Microsoft Bing, with deep expertise in distributed systems, data platforms, and machine learning.Dheeraj PandeyCo-founder & CEO at DevRev, Former Co-founder & CEO of Nutanix. A systems thinker and product visionary focused on AI, software architecture, and the future of work.Guest:Alex DimakisAlex Dimakis is a Professor in UC Berkeley in the EECS department. He received his Ph.D. from UC Berkeley and the Diploma degree from NTU in Athens, Greece. He has published more than 150 papers and received several awards including the James Massey Award, NSF Career, a Google research award, the UC Berkeley Eli Jury dissertation award, and several best paper awards. He is an IEEE Fellow for contributions to distributed coding and learning. His research interests include Generative AI, Information Theory and Machine Learning. He co-founded Bespoke Labs, a startup focusing on data curation for specialized agents.Follow the Hosts and the Guest: Dheeraj Pandey:LinkedIn - https://www.linkedin.com/in/dpandeyTwitter - https://x.com/dheerajAmit Prakash:LinkedIn - https://www.linkedin.com/in/amit-prak...Twitter - https://x.com/amitp42Alex Dimakis:LinkedIn - https://www.linkedin.com/in/alex-dima...Twitter - https://x.com/AlexGDimakis           Share Your Thoughts                                                                                          Have questions, comments, or ideas for future episodes?

Crazy Wisdom
Episode #520: Training Super Intelligence One Simulated Workflow at a Time

Crazy Wisdom

Play Episode Listen Later Jan 5, 2026 50:04


In this episode of the Crazy Wisdom podcast, host Stewart Alsop sits down with Josh Halliday, who works on training super intelligence with frontier data at Turing. The conversation explores the fascinating world of reinforcement learning (RL) environments, synthetic data generation, and the crucial role of high-quality human expertise in AI training. Josh shares insights from his years working at Unity Technologies building simulated environments for everything from oil and gas safety scenarios to space debris detection, and discusses how the field has evolved from quantity-focused data collection to specialized, expert-verified training data that's becoming the key bottleneck in AI development. They also touch on the philosophical implications of our increasing dependence on AI technology and the emerging job market around AI training and data acquisition.Timestamps00:00 Introduction to AI and Reinforcement Learning03:12 The Evolution of AI Training Data05:59 Gaming Engines and AI Development08:51 Virtual Reality and Robotics Training11:52 The Future of Robotics and AI Collaboration14:55 Building Applications with AI Tools17:57 The Philosophical Implications of AI20:49 Real-World Workflows and RL Environments26:35 The Impact of Technology on Human Cognition28:36 Cultural Resistance to AI and Data Collection31:12 The Bottleneck of High-Quality Data in AI32:57 Philosophical Perspectives on Data35:43 The Future of AI Training and Human Collaboration39:09 The Role of Subject Matter Experts in Data Quality43:20 The Evolution of Work in the Age of AI46:48 Convergence of AI and Human ExperienceKey Insights1. Reinforcement Learning environments are sophisticated simulations that replicate real-world enterprise workflows and applications. These environments serve as training grounds for AI agents by creating detailed replicas of tools like Salesforce, complete with specific tasks and verification systems. The agent attempts tasks, receives feedback on failures, and iterates until achieving consistent success rates, effectively learning through trial and error in a controlled digital environment.2. Gaming engines like Unity have evolved into powerful platforms for generating synthetic training data across diverse industries. From oil and gas companies needing hazardous scenario data to space intelligence firms tracking orbital debris, these real-time 3D engines with advanced physics can create high-fidelity simulations that capture edge cases too dangerous or expensive to collect in reality, bridging the gap where real-world data falls short.3. The bottleneck in AI development has fundamentally shifted from data quantity to data quality. The industry has completely reversed course from the previous "scale at all costs" approach to focusing intensively on smaller, higher-quality datasets curated by subject matter experts. This represents a philosophical pivot toward precision over volume in training next-generation AI systems.4. Remote teleoperation through VR is creating a new global workforce for robotics training. Workers wearing VR headsets can remotely control humanoid robots across the globe, teaching them tasks through direct demonstration. This creates opportunities for distributed talent while generating the nuanced human behavioral data needed to train autonomous systems.5. Human expertise remains irreplaceable in the AI training pipeline despite advancing automation. Subject matter experts provide crucial qualitative insights that go beyond binary evaluations, offering the contextual "why" and "how" that transforms raw data into meaningful training material. The challenge lies in identifying, retaining, and properly incentivizing these specialists as demand intensifies.6. First-person perspective data collection represents the frontier of human-like AI training. Companies are now paying people to life-log their daily experiences, capturing petabytes of egocentric data to train models more similarly to how human children learn through constant environmental observation, rather than traditional batch-processing approaches.7. The convergence of simulation, robotics, and AI is creating unprecedented philosophical and practical challenges. As synthetic worlds become indistinguishable from reality and AI agents gain autonomy, we're entering a phase where the boundaries between digital and physical, human and artificial intelligence, become increasingly blurred, requiring careful consideration of dependency, agency, and the preservation of human capabilities.

The KickASK Podcast
TDC #081: Feeling Stuck? You're Closer to a Breakthrough Than You Think

The KickASK Podcast

Play Episode Listen Later Jan 3, 2026 3:59 Transcription Available


TDC #081: Feeling Stuck? You're Closer to a Breakthrough Than You ThinkThe moment you feel most stalled might be the exact moment you're closest to your next big shift.Episode Summary:In this episode of The Digital Contrarian, host Ryan Levesque explores the surprising pattern behind how breakthroughs actually happen—and why they rarely arrive when you're grinding hardest.You'll learn why creative block often precedes insight, what a 200-year-old sugar maple teaches about adversity, and three reflection questions to turn your current setback into your next breakthrough.Question of the Day:What's a recent setback you've faced—or a breakthrough you've just experienced? I'd love to hear from you in the comments.Key Take-awaysFeeling stuck is often a signal you're close to a breakthrough, not far from oneBreakthroughs appear after frustration—sometimes only when you finally let goModerate adversity strengthens us (the science proves it)Creative block frequently precedes your biggest insightsLooking at the problem from a new angle unlocks what grinding couldn'tTimestamped Outline0:00 – The Surprising Nature of Breakthroughs 0:29 – Trudy Ederle's Story: Courage, Grit & Breakthroughs 0:44 – My Recent Manuscript Breakthrough (Return to Real) 1:10 – The Open Heart Surgery Phase of Editing 1:32 – Focus: The Contrarian Canon Part Two 2:00 – How to Turn Your Setback Into Your Next Big Breakthrough 2:26 – When Feeling Stalled Means You're Close to a Shift 2:40 – Setbacks, Flow & Detaching from the Outcome 3:02 – Why Creative Block Frequently Precedes Insight  3:29 – Engagement Prompt: Share Your Setback or Breakthrough 3:40 – Back to Manuscript Editing: No Food, No Water, Just Words 3:48 – Remember to Hug the Ones You LoveLinks & ResourcesIssue 051 of The Digital Contrarian – "How to Turn Your Current Setback Into Your Next Big Breakthrough" → https://ryanlevesque.net/setbacks-breakthroughs/Issue 051 Video → https://youtu.be/XQTjQZ6cwf8Return to Real Book Waitlist → https://ryanlevesque.net/return-to-real-book/The Digital Contrarian Newsletter → https://thedigitalcontrarian.comConnect & CTAEnjoyed this? Subscribe & leave a review on Apple Podcasts.Join 100,000+ digital entrepreneurs who get Ryan Levesque's "Strategic Insights for Digital Entrepreneurs Who Think Differently" every weekend: https://thedigitalcontrarian.comCredits:Host: Ryan Levesque © 2026 RL & Associates LLC. All rights reserved.

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jan 2, 2026 28:18


From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it's not just about depth, it's about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the "critical depth" phenomenon where performance doesn't just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn't just "make networks bigger" but a fundamental shift in RL objectives (their code doesn't have a line saying "maximize rewards"—it's pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work. We discuss: The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the "critical depth" phenomenon Scaling depth vs. width: depth grows parameters linearly, width grows quadratically—depth is more parameter-efficient and sample-efficient for the same performance The Jax + GPU-accelerated environments unlock: collecting thousands of trajectories in parallel meant data wasn't the bottleneck, and crossing 15M+ transitions was when deep networks really paid off The blurring of RL and self-supervised learning: their code doesn't maximize rewards directly, it's an actor-critic goal-conditioned RL algorithm, but the learning burden shifts to classification (cross-entropy loss, representation learning) instead of TD error regression Why scaling batch size unlocks at depth: traditional RL doesn't benefit from larger batches because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension — RL1000 Team (Princeton) 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities: https://openreview.net/forum?id=s0JVsx3bx1 Chapters 00:00:00 Introduction: Best Paper Award and NeurIPS Poster Experience 00:01:11 Team Introductions and Princeton Research Origins 00:03:35 The Deep Learning Anomaly: Why RL Stayed Shallow 00:04:35 Self-Supervised RL: A Different Approach to Scaling 00:05:13 The Breakthrough Moment: Residual Connections and Critical Depth 00:07:15 Architectural Choices: Borrowing from ResNets and Avoiding Vanishing Gradients 00:07:50 Clarifying the Paper: Not Just Big Networks, But Different Objectives 00:08:46 Blurring the Lines: RL Meets Self-Supervised Learning 00:09:44 From TD Errors to Classification: Why This Objective Scales 00:11:06 Architecture Details: Building on Braw and SymbaFowl 00:12:05 Robotics Applications: Goal-Conditioned RL Without Human Supervision 00:13:15 Efficiency Trade-offs: Depth vs Width and Parameter Scaling 00:15:48 JAX and GPU-Accelerated Environments: The Data Infrastructure 00:18:05 World Models and Next State Classification 00:22:37 Unlocking Batch Size Scaling Through Network Capacity 00:24:10 Compute Requirements: State-of-the-Art on a Single GPU 00:21:02 Future Directions: Distillation, VLMs, and Hierarchical Planning 00:27:15 Closing Thoughts: Challenging Conventional Wisdom in RL Scaling

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Dec 31, 2025


From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI's post-training evolution—from the PPO vs DPO debates of 2023 to today's RLVR era, where the real innovation isn't optimization methods but data quality, signal trust, and token efficiency. We sat down with Josh at NeurIPS 2025 to dig into the state of post-training heading into 2026: why RLHF and RLVR are both just policy gradient methods (the difference is the input data, not the math), how GRPO from DeepSeek Math was underappreciated as a shift toward more trustworthy reward signals (math answers you can verify vs. human preference you can't), why token efficiency matters more than wall-clock time (GPT-5 to 5.1 bumped evals and slashed tokens), how Codex has changed his workflow so much he feels "trapped" by 40-minute design sessions followed by 15-minute agent sprints, the infrastructure chaos of scaling RL ("way more moving parts than pre-training"), why long context will keep climbing but agents + graph walks might matter more than 10M-token windows, the shopping model as a test bed for interruptability and chain-of-thought transparency, why personality toggles (Anton vs Clippy) are a real differentiator users care about, and his thesis that the education system isn't producing enough people who can do both distributed systems and ML research—the exact skill set required to push the frontier when the bottleneck moves every few weeks. We discuss: Josh's path: pre-training data curation → post-training researcher at OpenAI, shipping GPT-4o, o1, o3, GPT-5 thinking, and the shopping model Why he switched from pre-training to post-training: "Do I want to make 3% compute efficiency wins, or change behavior by 40%?" The RL infrastructure challenge: way more moving parts than pre-training (tasks, grading setups, external partners), and why babysitting runs at 12:30am means jumping into unfamiliar code constantly How Codex has changed his workflow: 40-minute design sessions compressed into 15-minute agent sprints, and the strange "trapped" feeling of waiting for the agent to finish The RLHF vs RLVR debate: both are policy gradient methods, the real difference is data quality and signal trust (human preference vs. verifiable correctness) Why GRPO (from DeepSeek Math) was underappreciated: not just an optimization trick, but a shift toward reward signals you can actually trust (math answers over human vibes) The token efficiency revolution: GPT-5 to 5.1 bumped evals and slashed tokens, and why thinking in tokens (not wall-clock time) unlocks better tool-calling and agent workflows Personality toggles: Anton (tool, no warmth) vs Clippy (friendly, helpful), and why Josh uses custom instructions to make his model "just a tool" The router problem: having a router at the top (GPT-5 thinking vs non-thinking) and an implicit router (thinking effort slider) creates weird bumps, and why the abstractions will eventually merge Long context: climbing Graph Blocks evals, the dream of 10M+ token windows, and why agents + graph walks might matter more than raw context length Why the education system isn't producing enough people who can do both distributed systems and ML research, and why that's the bottleneck for frontier labs The 2026 vision: neither pre-training nor post-training is dead, we're in the fog of war, and the bottleneck will keep moving (so emotional stability helps) — Josh McGrath OpenAI: https://openai.com https://x.com/j_mcgraph Chapters 00:00:00 Introduction: Josh McGrath on Post-Training at OpenAI 00:04:37 The Shopping Model: Black Friday Launch and Interruptability 00:07:11 Model Personality and the Anton vs Clippy Divide 00:08:26 Beyond PPO vs DPO: The Data Quality Spectrum in RL 00:01:40 Infrastructure Challenges: Why Post-Training RL is Harder Than Pre-Training 00:13:12 Token Efficiency: The 2D Plot That Matters Most 00:03:45 Codex Max and the Flow Problem: 40 Minutes of Planning, 15 Minutes of Waiting 00:17:29 Long Context and Graph Blocks: Climbing Toward Perfect Context 00:21:23 The ML-Systems Hybrid: What's Hard to Hire For 00:24:50 Pre-Training Isn't Dead: Living Through Technological Revolution

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
[State of RL/Reasoning] IMO/IOI Gold, OpenAI o3/GPT-5, and Cursor Composer — Ashvin Nair, Cursor

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Dec 30, 2025


From Berkeley robotics and OpenAI's 2017 Dota-era internship to shipping RL breakthroughs on GPT-4o, o1, and o3, and now leading model development at Cursor, Ashvin Nair has done it all. We caught up with Ashvin at NeurIPS 2025 to dig into the inside story of OpenAI's reasoning team (spoiler: it went from a dozen people to 300+), why IOI Gold felt reachable in 2022 but somehow didn't change the world when o1 actually achieved it, how RL doesn't generalize beyond the training distribution (and why that means you need to bring economically useful tasks into distribution by co-designing products and models), the deeper lessons from the RL research era (2017–2022) and why most of it didn't pan out because the community overfitted to benchmarks, how Cursor is uniquely positioned to do continual learning at scale with policy updates every two hours and product-model co-design that keeps engineers in the loop instead of context-switching into ADHD hell, and his bet that the next paradigm shift is continual learning with infinite memory—where models experience something once (a bug, a mistake, a user pattern) and never forget it, storing millions of deployment tokens in weights without overloading capacity. We discuss: Ashvin's path: Berkeley robotics PhD → OpenAI 2017 intern (Dota era) → o1/o3 reasoning team → Cursor ML lead in three months Why robotics people are the most grounded at NeurIPS (they work with the real world) and simulation people are the most unhinged (Lex Fridman's take) The IOI Gold paradox: "If you told me we'd achieve IOI Gold in 2022, I'd assume we could all go on vacation—AI solved, no point working anymore. But life is still the same." The RL research era (2017–2022) and why most of it didn't pan out: overfitting to benchmarks, too many implicit knobs to tune, and the community rewarding complex ideas over simple ones that generalize Inside the o1 origin story: a dozen people, conviction from Ilya and Jakob Pachocki that RL would work, small-scale prototypes producing "surprisingly accurate reasoning traces" on math, and first-principles belief that scaled The reasoning team grew from ~12 to 300+ people as o1 became a product and safety, tooling, and deployment scaled up Why Cursor is uniquely positioned for continual learning: policy updates every two hours (online RL on tab), product and ML sitting next to each other, and the entire software engineering workflow (code, logs, debugging, DataDog) living in the product Composer as the start of product-model co-design: smart enough to use, fast enough to stay in the loop, and built by a 20–25 person ML team with high-taste co-founders who code daily The next paradigm shift: continual learning with infinite memory—models that experience something once (a bug, a user mistake) and store it in weights forever, learning from millions of deployment tokens without overloading capacity (trillions of pretraining tokens = plenty of room) Why off-policy RL is unstable (Ashvin's favorite interview question) and why Cursor does two-day work trials instead of whiteboard interviews The vision: automate software engineering as a process (not just answering prompts), co-design products so the entire workflow (write code, check logs, debug, iterate) is in-distribution for RL, and make models that never make the same mistake twice — Ashvin Nair Cursor: https://cursor.com X: https://x.com/ashvinnair_ Chapters 00:00:00 Introduction: From Robotics to Cursor via OpenAI 00:01:58 The Robotics to LLM Agent Transition: Why Code Won 00:09:11 RL Research Winter and Academic Overfitting 00:11:45 The Scaling Era and Moving Goalposts: IOI Gold Doesn't Mean AGI 00:21:30 OpenAI's Reasoning Journey: From Codex to O1 00:20:03 The Blip: Thanksgiving 2023 and OpenAI Governance 00:22:39 RL for Reasoning: The O-Series Conviction and Scaling 00:25:47 O1 to O3: Smooth Internal Progress vs External Hype Cycles 00:33:07 Why Cursor: Co-Designing Products and Models for Real Work 00:34:14 Composer and the Future: Online Learning Every Two Hours 00:35:15 Continual Learning: The Missing Paradigm Shift 00:44:00 Hiring at Cursor and Why Off-Policy RL is Unstable

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
[State of AI Startups] Memory/Learning, RL Envs & DBT-Fivetran — Sarah Catanzaro, Amplify

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Dec 30, 2025


From investing through the modern data stack era (DBT, Fivetran, and the analytics explosion) to now investing at the frontier of AI infrastructure and applications at Amplify Partners, Sarah Catanzaro has spent years at the intersection of data, compute, and intelligence—watching categories emerge, merge, and occasionally disappoint. We caught up with Sarah live at NeurIPS 2025 to dig into the state of AI startups heading into 2026: why $100M+ seed rounds with no near-term roadmap are now the norm (and why that terrifies her), what the DBT-Fivetran merger really signals about the modern data stack (spoiler: it's not dead, just ready for IPO), how frontier labs are using DBT and Fivetran to manage training data and agent analytics at scale, why data catalogs failed as standalone products but might succeed as metadata services for agents, the consumerization of AI and why personalization (memory, continual learning, K-factor) is the 2026 unlock for retention and growth, why she thinks RL environments are a fad and real-world logs beat synthetic clones every time, and her thesis for the most exciting AI startups: companies that marry hard research problems (RAG, rule-following, continual learning) with killer applications that were simply impossible before. We discuss: The DBT-Fivetran merger: not the death of the modern data stack, but a path to IPO scale (targeting $600M+ combined revenue) and a signal that both companies were already winning their categories How frontier labs use data infrastructure: DBT and Fivetran for training data curation, agent analytics, and managing increasingly complex interactions—plus the rise of transactional databases (RocksDB) and efficient data loading (Vortex) for GPU-bound workloads Why data catalogs failed: built for humans when they should have been built for machines, focused on discoverability when the real opportunity was governance, and ultimately subsumed as features inside Snowflake, DBT, and Fivetran The $100M+ seed phenomenon: raising massive rounds at billion-dollar valuations with no 6-month roadmap, seven-day decision windows, and founders optimizing for signal ("we're a unicorn") over partnership or dilution discipline Why world models are overhyped but underspecified: three competing definitions, unclear generalization across use cases (video games ≠ robotics ≠ autonomous driving), and a research problem masquerading as a product category The 2026 theme: consumerization of AI via personalization—memory management, continual learning, and solving retention/churn by making products learn skills, preferences, and adapt as the world changes (not just storing facts in cursor rules) Why RL environments are a fad: labs are paying 7–8 figures for synthetic clones when real-world logs, traces, and user activity (à la Cursor) are richer, cheaper, and more generalizable Sarah's investment thesis: research-driven applications that solve hard technical problems (RAG for Harvey, rule-following for Sierra, continual learning for the next killer app) and unlock experiences that were impossible before Infrastructure bets: memory, continual learning, stateful inference, and the systems challenges of loading/unloading personalized weights at scale Why K-factor and growth fundamentals matter again: AI felt magical in 2023–2024, but as the magic fades, retention and virality are back—and most AI founders have never heard of K-factor — Sarah Catanzaro X: https://x.com/sarahcat21 Amplify Partners: https://amplifypartners.com/ Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction: Sarah Catanzaro's Journey from Data to AI 00:01:02 The DBT-Fivetran Merger: Not the End of the Modern Data Stack 00:05:26 Data Catalogs and What Went Wrong 00:08:16 Data Infrastructure at AI Labs: Surprising Insights 00:10:13 The Crazy Funding Environment of 2024-2025 00:17:18 World Models: Hype, Confusion, and Market Potential 00:18:59 Memory Management and Continual Learning: The Next Frontier 00:23:27 Agent Environments: Just a Fad? 00:25:48 The Perfect AI Startup: Research Meets Application 00:28:02 Closing Thoughts and Where to Find Sarah

The Lunar Society
Adam Marblestone – AI is missing something fundamental about the brain

The Lunar Society

Play Episode Listen Later Dec 30, 2025 109:53


Adam Marblestone is CEO of Convergent Research. He's had a very interesting past life: he was a research scientist at Google Deepmind on their neuroscience team and has worked on everything from brain-computer interfaces to quantum computing to nanotech and even formal mathematics.In this episode, we discuss how the brain learns so much from so little, what the AI field can learn from neuroscience, and the answer to Ilya's question: how does the genome encode abstract reward functions? Turns out, they're all the same question.Watch on YouTube; read the transcript.Sponsors* Gemini 3 Pro recently helped me run an experiment to test multi-agent scaling: basically, if you have a fixed budget of compute, what is the optimal way to split it up across agents? Gemini was my colleague throughout the process — honestly, I couldn't have investigated this question without it. Try Gemini 3 Pro today gemini.google.com* Labelbox helps you train agents to do economically-valuable, real-world tasks. Labelbox's network of subject-matter experts ensures you get hyper-realistic RL environments, and their custom tooling lets you generate the highest-quality training data possible from those environments. Learn more at labelbox.com/dwarkeshTo sponsor a future episode, visit dwarkesh.com/advertise.Timestamps(00:00:00) – The brain's secret sauce is the reward functions, not the architecture(00:22:20) – Amortized inference and what the genome actually stores(00:42:42) – Model-based vs model-free RL in the brain(00:50:31) – Is biological hardware a limitation or an advantage?(01:03:59) – Why a map of the human brain is important(01:23:28) – What value will automating math have?(01:38:18) – Architecture of the brainFurther readingIntro to Brain-Like-AGI Safety - Steven Byrnes's theory of the learning vs steering subsystem; referenced throughout the episode.A Brief History of Intelligence - Great book by Max Bennett on connections between neuroscience and AIAdam's blog, and Convergent Research's blog on essential technologies.A Tutorial on Energy-Based Learning by Yann LeCunWhat Does It Mean to Understand a Neural Network? - Kording & LillicrapE11 Bio and their brain connectomics approachSam Gershman on what dopamine is doing in the brainGwern's proposal on training models on the brain's hidden states Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

The KickASK Podcast
TDC #080: The "Contrarian Cannon" – 5 Overlooked Ideas That Could Shape Your 2026

The KickASK Podcast

Play Episode Listen Later Dec 27, 2025 9:38 Transcription Available


TDC #080 - The Contrarian Canon (Part 1): 5 Overlooked Frameworks for 2026What if the most powerful business ideas aren't the newest ones, but the ones you missed?Episode SummaryIn this episode of The Digital Contrarian, host Ryan Levesque shares five overlooked frameworks from the "Contrarian Canon"—powerful concepts buried in past newsletter issues.You'll learn how to optimize around permanence instead of change, why struggle creates satisfaction, and discover advanced pricing psychology that increases both clarity and conversion.Question of the Day

Behind The Bunker's Podcast
Episode 600: Christmas Show With "Ledz" From Planet Eclipse! EP 597

Behind The Bunker's Podcast

Play Episode Listen Later Dec 23, 2025 60:46


Help support the free broadcast by donating to our PayPal fundraiser! https://www.paypal.com/ncp/payment/RL... *Behind the Bunker Paintball Podcast* is a long-running weekly show dedicated to everything paintball. Hosted by passionate players and industry veterans, the podcast dives into the latest happenings in the sport, from new gear releases and product reviews to updates on tournaments and events around the world. It has built a loyal audience by combining serious paintball discussion with a lighthearted, entertaining approach that keeps both new players and seasoned veterans engaged.

The Lunar Society
An audio version of my blog post, Thoughts on AI progress (Dec 2025)

The Lunar Society

Play Episode Listen Later Dec 23, 2025 12:28


Read the essay here.Timestamps00:00:00 What are we scaling?00:03:11 The value of human labor00:05:04 Economic diffusion lag is cope00:06:34 Goal-post shifting is justified00:08:23 RL scaling00:09:18 Broadly deployed intelligence explosion Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

The Biblical Roots Podcast
Christmas Live (Part 4): The Celebration of Christmas

The Biblical Roots Podcast

Play Episode Listen Later Dec 22, 2025 78:28


Send us a textRecorded Dec 20, 2025 - Enjoy the final episode in our teaching series, "The Biblical Roots of Christmas." This week, we turn to the celebration of Christmas itself—asking not only what we celebrate, but why. We'll explore how the fulfillment found in Jesus Christ naturally gives rise to joyful remembrance, how the early people of God marked God's redemptive acts, and what Scripture teaches us about honoring Christ through meaningful celebration.Let's rediscover together why Christmas is not merely a tradition to defend or dismiss, but a gospel truth to rejoice in.The Biblical Roots MinistriesOur websiteOur YouTube ChannelProf. Solberg's BlogSupport our Ministry (Thank you!)What if Christmas felt sacred again? Full of Grace and Truth, the new book from award-winning author R. L. Solberg, invites you to rediscover the biblical story at the heart of the season. Available now in paperback and Kindle, with all proceeds supporting The Biblical Roots Ministries. Get your copy today on Amazon.com.

The KickASK Podcast
TDC 079: The Tension Between Resistance & Productive Procrastination: How to Know When It's Time to Ship

The KickASK Podcast

Play Episode Listen Later Dec 20, 2025 11:57 Transcription Available


TDC 079: The Tension Between Resistance & Productive Procrastination: How to Know When It's Time to ShipWhen productive procrastination crosses the line into destructive resistance—and the simple test to know when you must finally ship.Episode SummaryIn this episode of The Digital Contrarian, host Ryan Levesque dives into the critical tension between productive procrastination and destructive resistance.You'll learn how to distinguish between creative incubation and fear-based avoidance, discover why original thinkers are "quick to start but slow to finish," and uncover a simple litmus test for knowing when your creative window is closing.Question of the Day

Unsupervised Learning
AI Vibe Check: The Actual Bottleneck In Research, SSI's Mystique, & Spicy 2026 Predictions

Unsupervised Learning

Play Episode Listen Later Dec 18, 2025 78:04


Ari Morcos and Rob Toews return for their spiciest conversation yet. Fresh from NeurIPS, they debate whether models are truly plateauing or if we're just myopically focused on LLMs while breakthroughs happen in other modalities.They reveal why infinite capital at labs may actually constrain innovation, explain the narrow "Goldilocks zone" where RL actually works, and argue why U.S. chip restrictions may have backfired catastrophically—accelerating China's path to self-sufficiency by a decade. The conversation covers OpenAI's code red moment and structural vulnerabilities, the mystique surrounding SSI and Ilya's "two words," and why the real bottleneck in AI research is compute, not ideas.The episode closes with bold 2026 predictions: Rob forecasts Sam Altman won't be OpenAI's CEO by year-end, while Ari gives 50%+ odds a Chinese open-source model will be the world's best at least once next year. (0:00) Intro(1:51) Reflections on NeurIPS Conference(5:14) Are AI Models Plateauing?(11:12) Reinforcement Learning and Enterprise Adoption(16:16) Future Research Vectors in AI(28:40) The Role of Neo Labs(39:35) The Myth of the Great Man Theory in Science(41:47) OpenAI's Code Red and Market Position(47:19) Disney and OpenAI's Strategic Partnership(51:28) Meta's Super Intelligence Team Challenges(54:33) US-China AI Chip Dynamics(1:00:54) Amazon's Nova Forge and Enterprise AI(1:03:38) End of Year Reflections and Predictions With your co-hosts:@jacobeffron  - Partner at Redpoint, Former PM Flatiron Health@patrickachase  - Partner at Redpoint, Former ML Engineer LinkedIn@ericabrescia  - Former COO Github, Founder Bitnami (acq'd by VMWare)@jordan_segall  - Partner at Redpoint

The MAD Podcast with Matt Turck
DeepMind Gemini 3 Lead: What Comes After "Infinite Data"

The MAD Podcast with Matt Turck

Play Episode Listen Later Dec 18, 2025 54:56


Gemini 3 was a landmark frontier model launch in AI this year — but the story behind its performance isn't just about adding more compute. In this episode, I sit down with Sebastian Bourgeaud, a pre-training lead for Gemini 3 at Google DeepMind and co-author of the seminal RETRO paper. In his first-ever podcast interview, Sebastian takes us inside the lab mindset behind Google's most powerful model — what actually changed, and why the real work today is no longer “training a model,” but building a full system.We unpack the “secret recipe” idea — the notion that big leaps come from better pre-training and better post-training — and use it to explore a deeper shift in the industry: moving from an “infinite data” era to a data-limited regime, where curation, proxies, and measurement matter as much as web-scale volume. Sebastian explains why scaling laws aren't dead, but evolving, why evals have become one of the hardest and most underrated problems (including benchmark contamination), and why frontier research is increasingly a full-stack discipline that spans data, infrastructure, and engineering as much as algorithms.From the intuition behind Deep Think, to the rise (and risks) of synthetic data loops, to the future of long-context and retrieval, this is a technical deep dive into the physics of frontier AI. We also get into continual learning — what it would take for models to keep updating with new knowledge over time, whether via tools, expanding context, or new training paradigms — and what that implies for where foundation models are headed next. If you want a grounded view of pre-training in late 2025 beyond the marketing layer, this conversation is a blueprint.Google DeepMindWebsite - https://deepmind.googleX/Twitter - https://x.com/GoogleDeepMindSebastian BorgeaudLinkedIn - https://www.linkedin.com/in/sebastian-borgeaud-8648a5aa/X/Twitter - https://x.com/borgeaud_sFIRSTMARKWebsite - https://firstmark.comX/Twitter - https://twitter.com/FirstMarkCapMatt Turck (Managing Director)Blog - https://mattturck.comLinkedIn - https://www.linkedin.com/in/turck/X/Twitter - https://twitter.com/mattturck(00:00) – Cold intro: “We're ahead of schedule” + AI is now a system(00:58) – Oriol's “secret recipe”: better pre- + post-training(02:09) – Why AI progress still isn't slowing down(03:04) – Are models actually getting smarter?(04:36) – Two–three years out: what changes first?(06:34) – AI doing AI research: faster, not automated(07:45) – Frontier labs: same playbook or different bets?(10:19) – Post-transformers: will a disruption happen?(10:51) – DeepMind's advantage: research × engineering × infra(12:26) – What a Gemini 3 pre-training lead actually does(13:59) – From Europe to Cambridge to DeepMind(18:06) – Why he left RL for real-world data(20:05) – From Gopher to Chinchilla to RETRO (and why it matters)(20:28) – “Research taste”: integrate or slow everyone down(23:00) – Fixes vs moonshots: how they balance the pipeline(24:37) – Research vs product pressure (and org structure)(26:24) – Gemini 3 under the hood: MoE in plain English(28:30) – Native multimodality: the hidden costs(30:03) – Scaling laws aren't dead (but scale isn't everything)(33:07) – Synthetic data: powerful, dangerous(35:00) – Reasoning traces: what he can't say (and why)(37:18) – Long context + attention: what's next(38:40) – Retrieval vs RAG vs long context(41:49) – The real boss fight: evals (and contamination)(42:28) – Alignment: pre-training vs post-training(43:32) – Deep Think + agents + “vibe coding”(46:34) – Continual learning: updating models over time(49:35) – Advice for researchers + founders(53:35) – “No end in sight” for progress + closing

Category Visionaries
How Datawizz discovered the chasm between AI-mature companies and everyone else shaped their ICP | Iddo Gino

Category Visionaries

Play Episode Listen Later Dec 18, 2025 29:10


Datawizz is pioneering continuous reinforcement learning infrastructure for AI systems that need to evolve in production, not ossify after deployment. After building and exiting RapidAPI—which served 10 million developers and had at least one team at 75% of Fortune 500 companies using and paying for the platform—Founder and CEO Iddo Gino returned to building when he noticed a pattern: nearly every AI agent pitch he reviewed as an angel investor assumed models would simultaneously get orders of magnitude better and cheaper. In a recent episode of BUILDERS, we sat down with Iddo to explore why that dual assumption breaks most AI economics, how traditional ML training approaches fail in the LLM era, and why specialized models will capture 50-60% of AI inference by 2030. Topics Discussed Why running two distinct businesses under one roof—RapidAPI's developer marketplace and enterprise API hub—ultimately capped scale despite compelling synergy narratives The "Big Short moment" reviewing AI pitches: every business model assumed simultaneous 1-2 order of magnitude improvements in accuracy and cost Why companies spending 2-3 months on fine-tuning repeatedly saw frontier models (GPT-4, Claude 3) obsolete their custom work The continuous learning flywheel: online evaluation → suspect inference queuing → human validation → daily/weekly RL batches → deployment How human evaluation companies like Scale AI shift from offline batch labeling to real-time inference correction queues Early GTM through LinkedIn DMs to founders running serious agent production volume, working backward through less mature adopters ICP discovery: qualifying on whether 20% accuracy gains or 10x cost reductions would be transformational versus incremental The integration layer approach: orchestrating the continuous learning loop across observability, evaluation, training, and inference tools Why the first $10M is about selling to believers in continuous learning, not evangelizing the category GTM Lessons For B2B Founders Recognize when distribution narratives mask structural incompatibility: RapidAPI had 10 million developers and teams at 75% of Fortune 500 paying for the platform—massive distribution that theoretically fed enterprise sales. The problem: Iddo could always find anecdotes where POC teams had used RapidAPI, creating a compelling story about grassroots adoption. The critical question he should have asked earlier: "Is self-service really the driver for why we're winning deals, or is it a nice-to-have contributor?" When two businesses have fundamentally different product roadmaps, cultures, and buying journeys, distribution overlap doesn't create a sustainable single company. Stop asking if synergies exist—ask if they're causal. Qualify on whether improvements cross phase-transition thresholds: Datawizz disqualifies prospects who acknowledge value but lack acute pain. The diagnostic questions: "If we improved model accuracy by 20%, how impactful is that?" and "If we cut your costs 10x, what does that mean?" Companies already automating human labor often respond that inference costs are rounding errors compared to savings. The ideal customers hit differently: "We need accuracy at X% to fully automate this process and remove humans from the loop. Until then, it's just AI-assisted. Getting over that line is a step-function change in how we deploy this agent." Qualify on whether your improvement crosses a threshold that changes what's possible, not just what's better. Use discovery to map market structure, not just validate hypotheses: Iddo validated that the most mature companies run specialized, fine-tuned models in production. The surprise: "The chasm between them and everybody else was a lot wider than I thought." This insight reshaped their entire strategy—the tooling gap, approaches to model development, and timeline to maturity differed dramatically across segments. Most founders use discovery to confirm their assumptions. Better founders use it to understand where different cohorts sit on the maturity curve, what bridges or blocks their progression, and which segments can buy versus which need multi-year evangelism. Target spend thresholds that indicate real commitment: Datawizz focuses on companies spending "at a minimum five to six figures a month on AI and specifically on LLM inference, using the APIs directly"—meaning they're building on top of OpenAI/Anthropic/etc., not just using ChatGPT. This filters for companies with skin in the game. Below that threshold, AI is an experiment. Above it, unit economics and quality bars matter operationally. For infrastructure plays, find the spend level that indicates your problem is a daily operational reality, not a future consideration. Structure discovery to extract insight, not close deals: Iddo's framework: "If I could run [a call where] 29 of 30 minutes could be us just asking questions and learning, that would be the perfect call in my mind." He compared it to "the dentist with the probe trying to touch everything and see where it hurts." The most valuable calls weren't those that converted to POCs—they came from people who approached the problem differently or had conflicting considerations. In hot markets with abundant budgets, founders easily collect false positives by selling when they should be learning. The discipline: exhaust your question list before explaining what you build. If they don't eventually ask "What do you do?" you're not surfacing real pain. Avoid the false-positive trap in well-funded categories: Iddo identified a specific risk in AI: "You can very easily run these calls, you think you're doing discovery, really you're doing sales, you end up getting a bunch of POCs and maybe some paying customers. So you get really good initial signs but you've never done any actual discovery. You have all the wrong indications—you're getting a lot of false positive feedback while building the completely wrong thing." When capital is abundant and your space is hot, early revenue can mask product-market misalignment. Good initial signs aren't validation if you skipped the work to understand why people bought. // Sponsors: Front Lines — We help B2B tech companies launch, manage, and grow podcasts that drive demand, awareness, and thought leadership. www.FrontLines.io The Global Talent Co. — We help tech startups find, vet, hire, pay, and retain amazing marketing talent that costs 50-70% less than the US & Europe. www.GlobalTalent.co // Don't Miss: New Podcast Series — How I Hire Senior GTM leaders share the tactical hiring frameworks they use to build winning revenue teams. Hosted by Andy Mowat, who scaled 4 unicorns from $10M to $100M+ ARR and launched Whispered to help executives find their next role. Subscribe here: https://open.spotify.com/show/53yCHlPfLSMFimtv0riPyM

Let's Talk AI
#228 - GPT 5.2, Scaling Agents, Weird Generalization

Let's Talk AI

Play Episode Listen Later Dec 17, 2025 86:42


Our 228th episode with a summary and discussion of last week's big AI news!Recorded on 12/12/2025Hosted by Andrey Kurenkov and Jeremie HarrisFeel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.aiRead out our text newsletter and comment on the podcast at https://lastweekin.ai/In this episode:OpenAI's latest model GPT-5.2 demonstrates improved performance and enhanced multi-modal capabilities but comes with increased costs and a different knowledge cutoff date.Disney invests $1 billion in OpenAI to generate Disney character content, creating unique licensing agreements across characters from Marvel, Pixar, and Star Wars franchises.The U.S. government imposes new AI chip export rules involving security reviews, while simultaneously moving to prevent states from independently regulating AI.DeepMind releases a paper outlining the challenges and findings in scaling multi-agent systems, highlighting the complexities of tool coordination and task performance.Timestamps:(00:00:00) Intro / Banter(00:01:19) News PreviewTools & Apps(00:01:58) GPT-5.2 is OpenAI's latest move in the agentic AI battle | The Verge(00:08:48) Runway releases its first world model, adds native audio to latest video model | TechCrunch(00:11:51) Google says it will link to more sources in AI Mode | The Verge(00:12:24) ChatGPT can now use Adobe apps to edit your photos and PDFs for free | The Verge(00:13:05) Tencent releases Hunyuan 2.0 with 406B parametersApplications & Business(00:16:15) China set to limit access to Nvidia's H200 chips despite Trump export approval(00:21:02) Disney investing $1 billion in OpenAI, will allow characters on Sora(00:24:48) Unconventional AI confirms its massive $475M seed round(00:29:06) Slack CEO Denise Dresser to join OpenAI as chief revenue officer | TechCrunch(00:31:18) The state of enterprise AIProjects & Open Source(00:33:49) [2512.10791] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality(00:36:27) Claude 4.5 Opus' Soul DocumentResearch & Advancements(00:43:49) [2512.08296] Towards a Science of Scaling Agent Systems(00:48:43) Evaluating Gemini Robotics Policies in a Veo World Simulator(00:52:10) Guided Self-Evolving LLMs with Minimal Human Supervision(00:56:08) Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning(01:00:39) [2512.07783] On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models(01:04:42) Stabilizing Reinforcement Learning with LLMs: Formulation and Practices(01:09:42) Google's AI unit DeepMind announces UK 'automated research lab'Policy & Safety(01:10:28) Trump Moves to Stop States From Regulating AI With a New Executive Order - The New York Times(01:13:54) [2512.09742] Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs(01:17:57) Forecasting AI Time Horizon Under Compute Slowdowns(01:20:46) AI Security Institute focuses on AI measurements and evaluations(01:21:16) Nvidia AI Chips to Undergo Unusual U.S. Security Review Before Export to China(01:22:01) U.S. Authorities Shut Down Major China-Linked AI Tech Smuggling NetworkSynthetic Media & Art(01:24:01) RSL 1.0 has arrived, allowing publishers to ask AI companies pay to scrape content | The VergeSee Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Firearms Radio Network (All Shows)
We Like Shooting 641 – Chapter 1

Firearms Radio Network (All Shows)

Play Episode Listen Later Dec 16, 2025


We Like Shooting Episode 641 This episode of We Like Shooting is brought to you by: C&G Holsters, Midwest Industries, Gideon Optics, Primary Arms, Medical Gear Outfitters, Die Free Co., Blue Alpha, and Bowers Group   Welcome to the We Like Shooting Show, episode 641! Our cast tonight is Jeremy Pozderac, Aaron Krieger, Nick Lynch, and me Shawn Herrin, welcome to the show! Text Dear WLS or Reviews. +1 743 500 2171 - Gear Chat Shawn - PopStop™ Review: Innovative Solutions for Shooting Enthusiasts PopStop™ is a device designed to eliminate first round pop (FRP) in suppressors by injecting inert carbon dioxide to replace oxygen, thereby reducing impulse noise and suppressor flash. It has been shown to achieve noise reductions of up to 9 dB and can stabilize velocity standard deviations. The product is not compatible with all firearms, particularly 9mm pistols, and requires specific barrel measurements for proper use. Its introduction aims to enhance suppressor performance within the gun community. Shawn - RL-100 Pre-Order Announcement Cloud Defensive has announced the RL-100, a new entry-level rifle light that combines performance with affordability, priced at $149.99 for early pre-orders. Designed for reliability and ease of use, the RL-100 aims to provide a high-quality lighting option for budget-conscious users and agencies without sacrificing performance. This product's introduction may impact the gun community by offering a cost-effective alternative to higher-priced weapon lights, which could enhance accessibility for everyday users and law enforcement. Shawn - Long Range Shooting Tips Advanced long range shooting by Cleckner Nick - KRG Bravo KRG Bravo Shawn - Hi Point's AR-15 Fun Hi Point AR-15 Shawn - Precision Shooting Simplified Kelbly Precision Element Shawn - C&G Holsters News! C&G Holsters Announcement Jeremy - Savage 24F and Chiappa 12ga barrel inserts Bullet Points Chiappa 44 mag Gun Fights Step right up for "Gun Fights," the high-octane segment hosted by Nick Lynch, where our cast members go head-to-head in a game show-style showdown! Each contestant tries to prove their gun knowledge dominance. It's a wild ride of bids, bluffs, and banter—who will come out on top? Tune in to find out! Agency Brief AGENCY BRIEF: SHAYS' REBELLION  1780 – 1785: Economic Conditions Veterans' Pay: Paid in depreciated Continental currency/IOUs. State Policy: Massachusetts demands taxes in hard currency (gold/silver). The Debt: Boston merchants control state debt; courts aggressively foreclose on farms and imprison debtors. August – October 1786: Escalation Aug 29: 1,500 "Regulators" seize the Northampton courthouse to stop debtor trials. Sept: Armed shutdowns spread to Worcester, Concord, and Great Barrington. Captain Daniel Shays emerges as leader. Sept 26: Shays (600 men) vs. Gen. Shepard (militia) at Springfield Supreme Judicial Court. No fire exchanged; court adjourns. Oct 20: Continental Congress authorizes troops but lacks funds. MA passes Riot Act (arrests without bail). January 1787: The Private Army Jan 4: Gov. Bowdoin authorizes a private militia. Funding: 125 Boston merchants subscribe £6,000. Force: 3,000 mercenaries raised, led by Gen. Benjamin Lincoln. January 25, 1787: Springfield Arsenal (The Climax) Objective: Shays leads ~1,200 men to seize 7,000 muskets/cannons at the federal arsenal. Defense: Gen. Shepard (900 militia) defends the arsenal. The Engagement: Shepard fires artillery warning shots over rebels' heads. Rebels advance. Shepard fires grapeshot directly into the ranks. Casualties: 4 rebels dead, 20 wounded. Rebels flee without firing. February – June 1787: The Fallout Feb 4: Gen. Lincoln marches overnight through a blizzard to Petersham, surprising retreating rebels. 150 captured; Shays escapes to Vermont. Spring Election: Gov. Bowdoin is voted out in a landslide; John Hancock elected Governor. June: Hancock issues broad pardons. Legislature enacts debt moratoriums and lowers taxes. 1787 – 1791: Constitutional Impact May 1787: Constitutional Convention convenes; Washington/Madison cite Shays' Rebellion as proof the Articles of Confederation failed. 1788: Anti-Federalists demand a Bill of Rights to check the power of the proposed federal standing army. 1791: Second Amendment ratified. Modern Parallels Narrative: Veterans labeled "insurrectionists" for resisting economic policy. Tactics: Use of private capital to fund state enforcement when tax revenue failed. Legal Precedent: Establishing the "well-regulated militia" as a counter-balance to federal military power.   WLS is Lifestyle Jelly Roll and Gun Rights Jelly Roll wants his gun rights back to hunt after losing them for felonies. Deadpool Unleashed Dead pool Machine Head Introduces 94-Proof Bourbon Whiskey Machine Head has launched Shotgun Blast Whiskey, a 94-proof bourbon designed for fans who enjoy stronger spirits. This product aligns with the band's aggressive identity while remaining accessible as a traditional bourbon. The whiskey emphasizes classic bourbon flavors and is marketed as a lifestyle product, mirroring a trend of music collaborations in the spirits industry. Aaron's Alley Going Ballistic Manhunt Madness: Another Day, Another Gun Control Fail (no summary available) More Giffords Nonsense: Gun Control Before Facts (no summary available) When "Gun Control" Meets Reality: The Brown University Attack Details (no summary available) Gun Control: An Epic Fail at Bondi Beach (no summary available) "Legal Gun Ownership: The Unintended Target of Gun Control Fanatics" (no summary available) When Antique Gun Ownership Becomes a Crime: UK Cops Confiscate 129 Legal Firearms (no summary available) New Jersey's Carry Ban: Lawsuit Showdown or Just Another Dance with Gun Control? (no summary available) Traveling with NFA to get easier? Reviews ⭐⭐⭐⭐⭐ - from TwinDadARguy - Great show, been listening for about 4 or so years. Just heard the convo about Aaron's weird ability to pull interest from the fairer sex. You couldn't come up with a good word for it - I'm here to help. The perfect word is conFAUXdence. You're welcome.   ⭐⭐⭐⭐⭐ - from Devin K - Where is the damn squares button!? Love this show and all the antics that come along with it. Lever action debate that would be fun to listen too. What's your favorite lever action caliber for whitetail hunting? What would be the one you would take if you needed to defend that SSB. #171, #fuckthethumb.   ⭐⭐⭐⭐⭐ - from System AI - A review and comparison to bring us all back to Dungeon Crawler Carl. Let's pair each cast member to a Character from DCC. First, Shawn, obviously he's Carl. He's the main character. He's powerful. He's the reason we are all here. There may or may not be a Cat that led him here. He likely has someone obsessed with his feet and definitely only has heart boxers on behind his desk. Second, Aaron, he's Prepotene. Smart and powerful. Sometimes on the team, sometimes in the way, sometimes nowhere to be seen. Probably rides a Goat. Screams nonsense from time to time. Would be dead without the rest of the team. Third, Jeremy. Jeremy is Quasar. Swears constantly Hates the leader/rulers of the galaxy and game. Is there everytime we need him. Will likely be the reason the rest end up in a prison. Fourth, Savage. He's JuiceBox. Extremely smart. AI generated. Self aware. Playing the same game but may have a different motive. Likely to lead to the downfall of the show. Last, Nick. Nick is Samantha. Much more powerful then he's willing to let on. Always growing in power. A very important member to keep the show running. Would be dangerous if all his organs worked correctly. And Shawn has definitely been inside him. These comparisons can not be altered. Debate will result in acceleration. Thanks for your attention to this matter. Signed, Gary/System AI. #nonotes   Before we let you go - Join Gun Owners of America   Tell your friends about the show and get backstage access by joining the Gun Cult at theguncult.com.   No matter how tough your battle is today, we want you here fight with us tomorrow. Don't struggle in silence, you can contact the suicide prevention line by dialing 988 from your phone. Remember - Always prefer Dangerous Freedom over peaceful slavery. We'll see you next time!   Nick - @busbuiltsystems | Bus Built Systems Jeremy - @ret_actual | Rivers Edge Tactical Aaron - @machinegun_moses Savage - @savage1r Shawn - @dangerousfreedomyt | @camorado.cam | Camorado

Behind The Bunker's Podcast
Episode 599: Playing Style?! EP 596

Behind The Bunker's Podcast

Play Episode Listen Later Dec 16, 2025 63:16


The hosts dive into paintball basics for beginners, breaking down the sport into approachable steps. They explain the essential gear (marker/gun, hopper, tank, protective mask) and highlight what newcomers often overlook—like the importance of a well-fitting mask and reliable loader system.Next, they cover the fundamental rules and game formats: capture the flag, elimination, scenario play. They emphasise safety protocols (never removing your mask on the field, always chronograph your marker to legal FPS, clear communication). They also stress field etiquette—don't move thrown bunkers, call your hits honestly, and respect referees.They then shift into strategy tips: how to pick your playing style (aggressive front-player vs. back-field support), coordinate with teammates, and use the available bunkers/cover effectively. A good tip: keep your body low, pop out for shots, and always move quickly between cover to avoid being an easy target.The hosts share some common rookie mistakes—shooting wildly rather than taking aimed bursts, failing to reload/have backup paint, focusing too much on your own play instead of good team positioning. They recommend new players practice first in low-pressure games, watch experienced players, and ask for feedback.Finally, they talk about choosing your first marker: set budget, reliability, ease of maintenance, and the local field's rental gear—sometimes starting with rentals is a smart move until you know you're invested. They wrap up by encouraging listeners to get out on the field, learn by doing, and enjoy the camaraderie and fun of paintball rather than stressing perfect play.Help support the free broadcast by donating to our PayPal fundraiser! https://www.paypal.com/ncp/payment/RL... *Behind the Bunker Paintball Podcast* is a long-running weekly show dedicated to everything paintball. Hosted by passionate players and industry veterans, the podcast dives into the latest happenings in the sport, from new gear releases and product reviews to updates on tournaments and events around the world. It has built a loyal audience by combining serious paintball discussion with a lighthearted, entertaining approach that keeps both new players and seasoned veterans engaged.

The Biblical Roots Podcast
Christmas Livestream Part 3 - The Fulfillment of Christmas

The Biblical Roots Podcast

Play Episode Listen Later Dec 16, 2025 70:48


Send us a textRecorded Dec 13, 2025 - Enjoy episode 3 of our 4-week teaching series, The Biblical Roots of Christmas. This week, we turn from promise to fulfillment, exploring how the hopes of Israel find their “Yes” in Jesus Christ. We'll examine how the birth of Christ fulfills the Law and the Prophets, why the Incarnation stands at the center of God's redemptive plan, and what it means to say that in Jesus, the long-awaited Messiah has finally come. The Biblical Roots MinistriesOur websiteOur YouTube ChannelProf. Solberg's BlogSupport our Ministry (Thank you!)What if Christmas felt sacred again? Full of Grace and Truth, the new book from award-winning author R. L. Solberg, invites you to rediscover the biblical story at the heart of the season. Available now in paperback and Kindle, with all proceeds supporting The Biblical Roots Ministries. Get your copy today on Amazon.com.What if Christmas felt sacred again? Full of Grace and Truth, the new book from award-winning author R. L. Solberg, invites you to rediscover the biblical story at the heart of the season. Available now in paperback and Kindle, with all proceeds supporting The Biblical Roots Ministries. Get your copy today on Amazon.com.

a16z
Dwarkesh and Ilya Sutskever on What Comes After Scaling

a16z

Play Episode Listen Later Dec 15, 2025 92:09


AI models feel smarter than their real-world impact. They ace benchmarks, yet still struggle with reliability, strange bugs, and shallow generalization. Why is there such a gap between what they can do on paper and in practiceIn this episode from The Dwarkesh Podcast, Dwarkesh talks with Ilya Sutskever, cofounder of SSI and former OpenAI chief scientist, about what is actually blocking progress toward AGI. They explore why RL and pretraining scale so differently, why models outperform on evals but underperform in real use, and why human style generalization remains far ahead.Ilya also discusses value functions, emotions as a built-in reward system, the limits of pretraining, continual learning, superintelligence, and what an AI driven economy could look like. Resources:Transcript: https://www.dwarkesh.com/p/ilya-sutsk...Apple Podcasts: https://podcasts.apple.com/us/podcast...Spotify: https://open.spotify.com/episode/7naO... Stay Updated:If you enjoyed this episode, be sure to like, subscribe, and share with your friends!Find a16z on X: https://x.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zListen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYXListen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711Follow our host: https://x.com/eriktorenbergPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures](http://a16z.com/disclosures.  Stay Updated:Find a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The KickASK Podcast
TDC 078: 12 Days in Austin: Business Lessons, Dad Fails, and Why the AI Backlash is Coming

The KickASK Podcast

Play Episode Listen Later Dec 13, 2025 15:11 Transcription Available


TDC 078: 12 Days, 12 Takeaways: Lessons from Austin That Will Change How You Think About Business & LifeThe last mile is the one that matters most—and why most entrepreneurs quit right before their breakthrough.Episode SummaryIn this episode of The Digital Contrarian, host Ryan Levesque shares his biggest lesson from each of 12 days spent in Austin, Texas—covering five back-to-back events, dozens of private dinners, and nearly 50 hours of sessions.You'll learn why the coming AI backlash requires immediate brand strategy shifts, discover the 4-step plan for raising multi-millionaire kids, and understand why hosting dinners beats Facebook ads for ROI.Question of the Day

The Biblical Roots Podcast
Christmas Livestream Part 2 - The Promises of Christmas

The Biblical Roots Podcast

Play Episode Listen Later Dec 8, 2025 76:07


Send us a textRecorded Dec 6, 2025. The 2nd episode of our 4-week teaching series, "The Biblical Roots of Christmas." This week, we turn to the great storyline of Scripture to examine the promises and prophecies that set the stage for the birth of Christ. From Eden to Abraham to the prophets of Israel, we trace the unfolding hope of a coming Redeemer and explore how the Incarnation fulfills God's ancient covenant promises. Bring your Bibles and your questions, and let's rediscover together how the long-awaited Messiah entered history in the fullness of time.The Biblical Roots MinistriesOur websiteOur YouTube ChannelProf. Solberg's BlogSupport our Ministry (Thank you!)What if Christmas felt sacred again? Full of Grace and Truth, the new book from award-winning author R. L. Solberg, invites you to rediscover the biblical story at the heart of the season. Available now in paperback and Kindle, with all proceeds supporting The Biblical Roots Ministries. Get your copy today on Amazon.com.

Lenny's Podcast: Product | Growth | Career
The 100-person AI lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI)

Lenny's Podcast: Product | Growth | Career

Play Episode Listen Later Dec 7, 2025 70:31


Edwin Chen is the founder and CEO of Surge AI, the company that teaches AI what's good vs. what's bad, powering frontier labs with elite data, environments, and evaluations. Surge surpassed $1 billion in revenue with under 100 employees last year, completely bootstrapped—the fastest company in history to reach this milestone. Before founding Surge, Edwin was a research scientist at Google, Facebook, and Twitter and studied mathematics, computer science, and linguistics at MIT.We discuss:1. How Surge reached over $1 billion in revenue with fewer than 100 people by obsessing over quality2. The story behind how Claude Code got so good at coding and writing3. The problems with AI benchmarks and why they're pushing AI in the wrong direction4. How RL environments are the next frontier in AI training5. Why Edwin believes we're still a decade away from AGI6. Why taste and human judgment shape which AI models become industry leaders7. His contrarian approach to company building that rejects Silicon Valley's “pivot and blitzscale” playbook8. How AI models will become increasingly differentiated based on the values of the companies building them—Brought to you by:Vanta—Automate compliance. Simplify security.WorkOS—Modern identity platform for B2B SaaS, free up to 1 million MAUsCoda—The all-in-one collaborative workspace—Transcript: https://www.lennysnewsletter.com/p/surge-ai-edwin-chen—My biggest takeaways (for paid newsletter subscribers): https://www.lennysnewsletter.com/i/180055059/my-biggest-takeaways-from-this-conversation—Where to find Edwin Chen:• X: https://x.com/echen• LinkedIn: https://www.linkedin.com/in/edwinzchen• Surge's blog: https://surgehq.ai/blog—Where to find Lenny:• Newsletter: https://www.lennysnewsletter.com• X: https://twitter.com/lennysan• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/—In this episode, we cover:(00:00) Introduction to Edwin Chen(04:48) AI's role in business efficiency(07:08) Building a contrarian company(08:55) An explanation of what Surge AI does(09:36) The importance of high-quality data(13:31) How Claude Code has stayed ahead(17:37) Edwin's skepticism toward benchmarks(21:54) AGI timelines and industry trends(28:33) The Silicon Valley machine(33:07) Reinforcement learning and future AI training(39:37) Understanding model trajectories(41:11) How models have advanced and will continue to advance(42:55) Adapting to industry needs(44:39) Surge's research approach(48:07) Predictions for the next few years in AI(50:43) What's underhyped and overhyped in AI(52:55) The story of founding Surge AI(01:02:18) Lightning round and final thoughts—Referenced:• Surge: https://surgehq.ai• Surge's product page: https://surgehq.ai/products• Claude Code: https://www.claude.com/product/claude-code• Gemini 3: https://aistudio.google.com/models/gemini-3• Sora: https://openai.com/sora• Terrence Rohan on LinkedIn: https://www.linkedin.com/in/terrencerohan• Richard Sutton—Father of RL thinks LLMs are a dead end: https://www.dwarkesh.com/p/richard-sutton• The Bitter Lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html• Reinforcement learning: https://en.wikipedia.org/wiki/Reinforcement_learning• Grok: https://grok.com• Warren Buffett on X: https://x.com/WarrenBuffett• OpenAI's CPO on how AI changes must-have skills, moats, coding, startup playbooks, more | Kevin Weil (CPO at OpenAI, ex-Instagram, Twitter): https://www.lennysnewsletter.com/p/kevin-weil-open-ai• Anthropic's CPO on what comes next | Mike Krieger (co-founder of Instagram): https://www.lennysnewsletter.com/p/anthropics-cpo-heres-what-comes-next• Brian Armstrong on LinkedIn: https://www.linkedin.com/in/barmstrong• Interstellar on Prime Video: https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS• Arrival on Prime Video: https://www.amazon.com/Arrival-Amy-Adams/dp/B01M2C4NP8• Travelers on Netflix: https://www.netflix.com/title/80105699• Waymo: https://waymo.com• Soda versus pop: https://flowingdata.com/2012/07/09/soda-versus-pop-on-twitter—Recommended books:• Stories of Your Life and Others: https://www.amazon.com/Stories-Your-Life-Others-Chiang/dp/1101972122• The Myth of Sisyphus: https://www.amazon.com/Myth-Sisyphus-Vintage-International/dp/0525564454• Le Ton Beau de Marot: In Praise of the Music of Language: https://www.amazon.com/dp/0465086454• Gödel, Escher, Bach: An Eternal Golden Braid: https://www.amazon.com/G%C3%B6del-Escher-Bach-Eternal-Golden/dp/0465026567—Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.—Lenny may be an investor in the companies discussed. To hear more, visit www.lennysnewsletter.com

The KickASK Podcast
TDC 077: The 4 Types of Driven Entrepreneurs (And Why You're Secretly Seeking Validation)

The KickASK Podcast

Play Episode Listen Later Dec 6, 2025 10:57 Transcription Available


TDC 077: The 4 Types of Driven Entrepreneurs (And Why You're Secretly Seeking Validation)Four archetypes, one painful realization, and 14 words that could change everything.Episode Summary:In this episode of The Digital Contrarian, host Ryan Levesque shares a vulnerable story from Front Row Dads Live about a moment he's not proud of as a father.You'll learn the difference between your True Self and Strategic Self, discover which of the four archetypal patterns drives your behavior, and uncover how childhood wounds show up in your business today.Question of the Day

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

From building Medal into a 12M-user game clipping platform with 3.8B highlight moments to turning down a reported $500M offer from OpenAI (https://www.theinformation.com/articles/openai-offered-pay-500-million-startup-videogame-data) and raising a $134M seed from Khosla (https://techcrunch.com/2025/10/16/general-intuition-lands-134m-seed-to-teach-agents-spatial-reasoning-using-video-game-clips/) to spin out General Intuition, Pim is betting that world models trained on peak human gameplay are the next frontier after LLMs. We sat down with Pim to dig into why game highlights are “episodic memory for simulation” (and how Medal's privacy-first action labels became a world-model goldmine https://medal.tv/blog/posts/enabling-state-of-the-art-security-and-protections-on-medals-new-apm-and-controller-overlay-features), what it takes to build fully vision-based agents that just see frames and output actions in real time, how General Intuition transfers from games to real-world video and then into robotics, why world models and LLMs are complementary rather than rivals, what founders with proprietary datasets should know before selling or licensing to labs, and his bet that spatial-temporal foundation models will power 80% of future atoms-to-atoms interactions in both simulation and the real world. We discuss: How Medal's 3.8B action-labeled highlight clips became a privacy-preserving goldmine for world models Building fully vision-based agents that only see frames and output actions yet play like (and sometimes better than) humans Transferring from arcade-style games to realistic games to real-world video using the same perception–action recipe Why world models need actions, memory, and partial observability (smoke, occlusion, camera shake) vs. “just” pretty video generation Distilling giant policies into tiny real-time models that still navigate, hide, and peek corners like real players Pim's path from RuneScape private servers, Tourette's, and reverse engineering to leading a frontier world-model lab How data-rich founders should think about valuing their datasets, negotiating with big labs, and deciding when to go independent GI's first customers: replacing brittle behavior trees in games, engines, and controller-based robots with a “frames in, actions out” API Using Medal clips as “episodic memory of simulation” to move from imitation learning to RL via world models and negative events The 2030 vision: spatial–temporal foundation models that power the majority of atoms-to-atoms interactions in simulation and the real world — Pim X: https://x.com/PimDeWitte LinkedIn: https://www.linkedin.com/in/pimdw/ Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and Medal's Gaming Data Advantage 00:02:08 Exclusive Demo: Vision-Based Gaming Agents 00:06:17 Action Prediction and Real-World Video Transfer 00:08:41 World Models: Interactive Video Generation 00:13:42 From Runescape to AI: Pim's Founder Journey 00:16:45 The Research Foundations: Diamond, Genie, and SEMA 00:33:03 Vinod Khosla's Largest Seed Bet Since OpenAI 00:35:04 Data Moats and Why GI Stayed Independent 00:38:42 Self-Teaching AI Fundamentals: The Francois Fleuret Course 00:40:28 Defining World Models vs Video Generation 00:41:52 Why Simulation Complexity Favors World Models 00:43:30 World Labs, Yann LeCun, and the Spatial Intelligence Race 00:50:08 Business Model: APIs, Agents, and Game Developer Partnerships 00:58:57 From Imitation Learning to RL: Making Clips Playable 01:00:15 Open Research, Academic Partnerships, and Hiring 01:02:09 2030 Vision: 80 Percent of Atoms-to-Atoms AI Interactions

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups
Scaling Legal AI and Building Next-Generation Law Firms with Harvey Co-Founder and President Gabe Pereyra

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

Play Episode Listen Later Dec 5, 2025 44:17


In just over three years, Harvey has not only scaled to nearly one thousand customers, including Walmart, PwC, and other giants of the Fortune 500, but fundamentally transformed how legal work is delivered. Sarah Guo and Elad Gil are joined by Harvey's co-founder and president Gabe Pereyra to discuss why the future of legal AI isn't only about individual productivity, but also about putting together complex client matters to make law firms more profitable. They also talk about how Harvey analyzes complex tasks like fund formation or M&A and deploys agents to handle research and drafting, the strategic reasoning behind enabling law firms rather than competing with them, and why AI won't replace partners but will change law firm leverage models and training for associates. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @gabepereyra | @Harvey Chapters: 00:00 – Gabe Pereyra Introduction 00:09 – Introduction to Harvey 02:04 – Expanding Harvey's Reach 03:22 – Understanding Legal Workflows 06:20 – Agentic AI Applications in Law 09:06 – The Future Evolution of Law Firms 13:36 – RL in Law 19:46 – Deploying Harvey and Customization 23:46 – Adoption and Customer Success 25:28– Why Harvey Isn't Building a Law Firm 27:25 – Challenges and Opportunities in Legal Tech 29:26 – Building a Company During the Rise of Gen AI 37:24 – Hiring at Harvey 40:19 – Future Predictions 44:17 – Conclusion 

The KickASK Podcast
TDC 076: Worldviews from Viewers: Real Perspectives On How to Make Sense of this Post-AI World...

The KickASK Podcast

Play Episode Listen Later Nov 29, 2025 8:17 Transcription Available


TDC 076: Worldviews from Viewers: Real Perspectives On How to Make Sense of this Post-AI World...World views from readers reveal what's really shaping how thoughtful people navigate today's chaos.Episode SummaryIn this special episode of The Digital Contrarian, host Ryan Levesque shares thought-provoking reader responses to last week's worldview challenge.You'll discover seven diverse principles shaping how people make sense of this moment in history, explore frameworks for navigating complexity, and hear perspectives that might challenge your own assumptions.Question of the Day

The MAD Podcast with Matt Turck
What's Next for AI? OpenAI's Łukasz Kaiser (Transformer Co-Author)

The MAD Podcast with Matt Turck

Play Episode Listen Later Nov 26, 2025 65:25


We're told that AI progress is slowing down, that pre-training has hit a wall, that scaling laws are running out of road. Yet we're releasing this episode in the middle of a wild couple of weeks that saw GPT-5.1, GPT-5.1 Codex Max, fresh reasoning modes and long-running agents ship from OpenAI — on top of a flood of new frontier models elsewhere. To make sense of what's actually happening at the edge of the field, I sat down with someone who has literally helped define both of the major AI paradigms of our time.Łukasz Kaiser is one of the co-authors of “Attention Is All You Need,” the paper that introduced the Transformer architecture behind modern LLMs, and is now a leading research scientist at OpenAI working on reasoning models like those behind GPT-5.1. In this conversation, he explains why AI progress still looks like a smooth exponential curve from inside the labs, why pre-training is very much alive even as reinforcement-learning-based reasoning models take over the spotlight, how chain-of-thought actually works under the hood, and what it really means to “train the thinking process” with RL on verifiable domains like math, code and science. We talk about the messy reality of low-hanging fruit in engineering and data, the economics of GPUs and distillation, interpretability work on circuits and sparsity, and why the best frontier models can still be stumped by a logic puzzle from his five-year-old's math book.We also go deep into Łukasz's personal journey — from logic and games in Poland and France, to Ray Kurzweil's team, Google Brain and the inside story of the Transformer, to joining OpenAI and helping drive the shift from chatbots to genuine reasoning engines. Along the way we cover GPT-4 → GPT-5 → GPT-5.1, post-training and tone, GPT-5.1 Codex Max and long-running coding agents with compaction, alternative architectures beyond Transformers, whether foundation models will “eat” most agents and applications, what the translation industry can teach us about trust and human-in-the-loop, and why he thinks generalization, multimodal reasoning and robots in the home are where some of the most interesting challenges still lie.OpenAIWebsite - https://openai.comX/Twitter - https://x.com/OpenAIŁukasz KaiserLinkedIn - https://www.linkedin.com/in/lukaszkaiser/X/Twitter - https://x.com/lukaszkaiserFIRSTMARKWebsite - https://firstmark.comX/Twitter - https://twitter.com/FirstMarkCapMatt Turck (Managing Director)Blog - https://mattturck.comLinkedIn - https://www.linkedin.com/in/turck/X/Twitter - https://twitter.com/mattturck(00:00) – Cold open and intro(01:29) – “AI slowdown” vs a wild week of new frontier models(08:03) – Low-hanging fruit: infra, RL training and better data(11:39) – What is a reasoning model, in plain language?(17:02) – Chain-of-thought and training the thinking process with RL(21:39) – Łukasz's path: from logic and France to Google and Kurzweil(24:20) – Inside the Transformer story and what “attention” really means(28:42) – From Google Brain to OpenAI: culture, scale and GPUs(32:49) – What's next for pre-training, GPUs and distillation(37:29) – Can we still understand these models? Circuits, sparsity and black boxes(39:42) – GPT-4 → GPT-5 → GPT-5.1: what actually changed(42:40) – Post-training, safety and teaching GPT-5.1 different tones(46:16) – How long should GPT-5.1 think? Reasoning tokens and jagged abilities(47:43) – The five-year-old's dot puzzle that still breaks frontier models(52:22) – Generalization, child-like learning and whether reasoning is enough(53:48) – Beyond Transformers: ARC, LeCun's ideas and multimodal bottlenecks(56:10) – GPT-5.1 Codex Max, long-running agents and compaction(1:00:06) – Will foundation models eat most apps? The translation analogy and trust(1:02:34) – What still needs to be solved, and where AI might go next

The Lunar Society
Ilya Sutskever – We're moving from the age of scaling to the age of research

The Lunar Society

Play Episode Listen Later Nov 25, 2025 96:03


Ilya & I discuss SSI's strategy, the problems with pre-training, how to improve the generalization of AI models, and how to ensure AGI goes well.Watch on YouTube; read the transcript.Sponsors* Gemini 3 is the first model I've used that can find connections I haven't anticipated. I recently wrote a blog post on RL's information efficiency, and Gemini 3 helped me think it all through. It also generated the relevant charts and ran toy ML experiments for me with zero bugs. Try Gemini 3 today at gemini.google* Labelbox helped me create a tool to transcribe our episodes! I've struggled with transcription in the past because I don't just want verbatim transcripts, I want transcripts reworded to read like essays. Labelbox helped me generate the exact data I needed for this. If you want to learn how Labelbox can help you (or if you want to try out the transcriber tool yourself), go to labelbox.com/dwarkesh* Sardine is an AI risk management platform that brings together thousands of device, behavior, and identity signals to help you assess a user's risk of fraud & abuse. Sardine also offers a suite of agents to automate investigations so that as fraudsters use AI to scale their attacks, you can use AI to scale your defenses. Learn more at sardine.ai/dwarkeshTo sponsor a future episode, visit dwarkesh.com/advertise.Timestamps(00:00:00) – Explaining model jaggedness(00:09:39) - Emotions and value functions(00:18:49) – What are we scaling?(00:25:13) – Why humans generalize better than models(00:35:45) – SSI's plan to straight-shot superintelligence(00:46:47) – SSI's model will learn from deployment(00:55:07) – How to think about powerful AGIs(01:18:13) – “We are squarely an age of research company”(01:20:23) – Self-play and multi-agent(01:32:42) – Research taste Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Sirens of the Supernatural
Crossing Paths with the Devil

Sirens of the Supernatural

Play Episode Listen Later Nov 24, 2025 64:19


A young guitarist disappears for months—and returns playing like no human ever could. They say Robert Johnson met the Devil at a lonely Mississippi crossroads—trading his soul for the sound that birthed the blues. But what really happened that night? Was it a deal, a myth, or something darker still? Join us as we journey into the Delta, where music, magic, and the supernatural collide. SOURCES (for show notes)https://www.openculture.com/2020/10/the-legend-of-how-bluesman-robert-johnson-sold-his-soul-to-the-devil-at-the-crossroads.htmlhttps://entertainment.howstuffworks.com/devil-and-robert-johnson.htm?utm_source=chatgpt.comhttps://nashvilleghosts.com/the-crossroads-the-king-of-delta-blues-the-devil/?utm_source=chatgpt.comhttps://www.thevintagenews.com/2018/04/05/crossroads/?utm_source=chatgpt.comhttps://genius.com/artists/Robert-johnsonhttps://www.britannica.com/biography/Robert-Johnson-American-musicianhttps://blackpast.org/african-american-history/johnson-robert-1911-1938/https://www.vialma.com/en/articles/266/Niccolo-Paganini-The-Devils-Violinisthttps://www.gutenberg.org/files/14591/14591-h/14591-h.htmBiographies and historical accountsUp Jumped the Devil: The Real Life of Robert Johnson by Bruce Conforth and Gayle Dean Wardlow: A comprehensive look at the legendary bluesman's life.Searching for Robert Johnson by Peter Guralnick: Explores the myth and reality of Johnson's life and career.Escaping the Delta: Robert Johnson and the Invention of the Blues by Elijah Wald: Analyzes Johnson's music and its impact on the blues genre.Biography of a Phantom: A Robert Johnson Blues Odyssey by Robert Mack McCormick: A biographical exploration of Johnson's life.Robert Johnson: Lost and Found by Barry Lee Pearson: A scholarly account that delves into the details of Johnson's life.Personal memoirs and graphic novelsBrother Robert: Growing Up with Robert Johnson by Annye C. Anderson: A firsthand account of Johnson's life from his niece's perspective.Love in Vain: Robert Johnson, 1911–1938 by Mezzo and J.M. Dupont: A graphic novel that tells the story of Johnson's life through illustrations.RL's Dream by Walter Mosley: A fictional novel inspired by the legend of Robert Johnson

The KickASK Podcast
TDC 075: How To Craft Your Worldview With AI

The KickASK Podcast

Play Episode Listen Later Nov 22, 2025 19:30 Transcription Available


TDC 075: How To Craft Your Worldview With AIWhat I learned at Tony Robbins and Dean Graziosi's $250,000 private mastermind this week.Episode SummaryIn this episode of The Digital Contrarian, host Ryan Levesque dives into building a comprehensive worldview and why it's your hidden operating system.You'll learn how to surface your existing beliefs, discover the three levels of reality that shape decisions, and explore a six-step AI-assisted process for crafting worldviews that drive real results.Question of the Day

The MAD Podcast with Matt Turck
Open Source AI Strikes Back — Inside Ai2's OLMo 3 ‘Thinking"

The MAD Podcast with Matt Turck

Play Episode Listen Later Nov 20, 2025 88:10


In this special release episode, Matt sits down with Nathan Lambert and Luca Soldaini from Ai2 (the Allen Institute for AI) to break down one of the biggest open-source AI drops of the year: OLMo 3. At a moment when most labs are offering “open weights” and calling it a day, AI2 is doing the opposite — publishing the models, the data, the recipes, and every intermediate checkpoint that shows how the system was built. It's an unusually transparent look into the inner machinery of a modern frontier-class model.Nathan and Luca walk us through the full pipeline — from pre-training and mid-training to long-context extension, SFT, preference tuning, and RLVR. They also explain what a thinking model actually is, why reasoning models have exploded in 2025, and how distillation from DeepSeek and Qwen reasoning models works in practice. If you've been trying to truly understand the “RL + reasoning” era of LLMs, this is the clearest explanation you'll hear.We widen the lens to the global picture: why Meta's retreat from open source created a “vacuum of influence,” how Chinese labs like Qwen, DeepSeek, Kimi, and Moonshot surged into that gap, and why so many U.S. companies are quietly building on Chinese open models today. Nathan and Luca offer a grounded, insider view of whether America can mount an effective open-source response — and what that response needs to look like.Finally, we talk about where AI is actually heading. Not the hype, not the doom — but the messy engineering reality behind modern model training, the complexity tax that slows progress, and why the transformation between now and 2030 may be dramatic without ever delivering a single “AGI moment.” If you care about the future of open models and the global AI landscape, this is an essential conversation.Allen Institute for AI (AI2)Website - https://allenai.orgX/Twitter - https://x.com/allen_aiNathan LambertBlog - https://www.interconnects.aiLinkedIn - https://www.linkedin.com/in/natolambert/X/Twitter - https://x.com/natolambertLuca SoldainiBlog - https://soldaini.netLinkedIn - https://www.linkedin.com/in/soldni/X/Twitter - https://x.com/soldniFIRSTMARKWebsite - https://firstmark.comX/Twitter - https://twitter.com/FirstMarkCapMatt Turck (Managing Director)Blog - https://mattturck.comLinkedIn - https://www.linkedin.com/in/turck/X/Twitter - https://twitter.com/mattturck(00:00) – Cold Open(00:39) – Welcome & today's big announcement(01:18) – Introducing the Olmo 3 model family(02:07) – What “base models” really are (and why they matter)(05:51) – Dolma 3: the data behind Olmo 3(08:06) – Performance vs Qwen, Gemma, DeepSeek(10:28) – What true open source means (and why it's rare)(12:51) – Intermediate checkpoints, transparency, and why AI2 publishes everything(16:37) – Why Qwen is everywhere (including U.S. startups)(18:31) – Why Chinese labs go open source (and why U.S. labs don't)(20:28) – Inside ATOM: the U.S. response to China's model surge(22:13) – The rise of “thinking models” and inference-time scaling(35:58) – The full Olmo pipeline, explained simply(46:52) – Pre-training: data, scale, and avoiding catastrophic spikes(50:27) – Mid-training (tail patching) and avoiding test leakage(52:06) – Why long-context training matters(55:28) – SFT: building the foundation for reasoning(1:04:53) – Preference tuning & why DPO still works(1:10:51) – The hard part: RLVR, long reasoning chains, and infrastructure pain(1:13:59) – Why RL is so technically brutal(1:18:17) – Complexity tax vs AGI hype(1:21:58) – How everyone can contribute to the future of AI(1:27:26) – Closing thoughts

The KickASK Podcast
TDC 074: The First Trillion Dollar Thought Leader: Being Known for How You Think, Not What You Consume

The KickASK Podcast

Play Episode Listen Later Nov 15, 2025 9:50 Transcription Available


TDC 074: The First Trillion Dollar Thought Leader: Being Known for How You Think, Not What You ConsumeWhy being known for how you think beats influence every time.Episode SummaryIn this episode of The Digital Contrarian, host Ryan Levesque dives into the critical distinction between influencers and thought leaders in the AI era.You'll learn why chasing followers is the wrong game, how thought leadership transforms ideas into equity, and discover the unsexy immediate next step to start building your own trillion-dollar personal brand.Question of the Day

a16z
Marc Andreessen and Amjad Masad: English As the New Programming Language

a16z

Play Episode Listen Later Oct 23, 2025 71:38


Amjad Masad, founder and CEO of Replit, joins a16z's Marc Andreessen and Erik Torenberg to discuss the new world of AI agents, the future of programming, and how software itself is beginning to build software.They trace the history of computing to the rise of AI agents that can now plan, reason, and code for hours without breaking, and explore how Replit is making it possible for anyone to create complex applications in natural language. Amjad explains how RL unlocked reasoning for modern models, why verification loops changed everything, whether LLMs are hitting diminishing returns — and if “good enough” AI might actually block progress toward true general intelligence. Resources:Follow Amjad on X: https://x.com/amasadFollow Marc on X: https://x.com/pmarcaFollow Erik on X: https://x.com/eriktorenberg Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends!Find a16z on X: https://x.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zListen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYXListen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711Follow our host: https://x.com/eriktorenbergPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on XFind a16z on LinkedInListen to the a16z Podcast on SpotifyListen to the a16z Podcast on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.