POPULARITY
Categories
Waymo is training its fleet on edge case driving scenarios with DeepMind's Genie 3, and TikTok might have to change its infinite scroll behavior to address health concerns in the EU.Starring Jason Howell and Huyen Tue Dao.Show notes can be found here. Hosted on Acast. See acast.com/privacy for more information.
The AI Breakdown: Daily Artificial Intelligence News and Discussions
Anthropic dropped Claude Opus 4.6 and OpenAI responded with GPT 5.3 Codex just 20 minutes later — the most intense head-to-head model release we've ever seen. Here's what each model brings, how they compare, and what the first reactions are telling us. In the headlines: Google and Amazon share their capex plans, and we're about to spend 2.5 moon landings on AI. Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. https://www.kpmg.us/AIpodcastsRackspace AI Launchpad - Build, test and scale intelligent workloads faster - http://rackspace.com/ailaunchpadZencoder - From vibe coding to AI-first engineering - http://zencoder.ai/zenflowOptimizely Agents in Action - Join the virtual event (with me!) free March 4 - https://www.optimizely.com/insights/agents-in-action/AssemblyAI - The best way to build Voice AI apps - https://www.assemblyai.com/briefSection - Build an AI workforce at scale - https://www.sectionai.com/LandfallIP - AI to Navigate the Patent Process - https://landfallip.com/Robots & Pencils - Cloud-native AI solutions that power results https://robotsandpencils.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? sponsors@aidailybrief.ai
I sit down with Morgan Linton, Cofounder/CTO of Bold Metrics, to break down the same-day release of Claude Opus 4.6 and GPT-5.3 Codex. We walk through exactly how to set up Opus 4.6 in Claude Code, explore the philosophical split between autonomous agent teams and interactive pair-programming, and then put both models to the test by having each one build a Polymarket competitor from scratch, live and unscripted. By the end, you'll know how to configure each model, when to reach for one over the other, and what happened when we let them race head-to-head. Timestamps 00:00 – Intro 03:26 – Setting Up Opus 4.6 in Claude Code 05:16 – Enabling Agent Teams 08:32 – The Philosophical Divergence between Codex and Opus 11:11 – Core Feature Comparison (Context Window, Benchmarks, Agentic Behavior) 15:27 – Live Demo Setup: Polymarket Build Prompt Design 18:26 – Race Begins 21:02 – Best Model for Vibe Coders 22:12 – Codex Finishes in Under 4 Minutes 26:38 – Opus Agents Still Running, Token Usage Climbing 31:41 – Testing and Reviewing the Codex Build 40:25 – Opus Build Completes, First Look at Results 42:47 – Opus Final Build Reveal 44:22 – Side-by-Side Comparison: Opus Takes This Round 45:40 – Final Takeaways and Recommendations Key Points Opus 4.6 and GPT-5.3 Codex dropped within 18 minutes of each other and represent two fundamentally different engineering philosophies — autonomous agents vs. interactive collaboration. To use Opus 4.6 properly, you must update Claude Code to version 2.1.32+, set the model in settings.json, and explicitly enable the experimental Agent Teams feature. Opus 4.6's standout feature is multi-agent orchestration: you can spin up parallel agents for research, architecture, UX, and testing — all working simultaneously. GPT-5.3 Codex's standout feature is mid-task steering: you can interrupt, redirect, and course-correct the model while it's actively building. In the live head-to-head, Codex finished a Polymarket competitor in under 4 minutes; Opus took significantly longer but produced a more polished UI, richer feature set, and 96 tests vs. Codex's 10. Agent teams multiply token usage substantially — a single Opus build can consume 150,000–250,000 tokens across all agents. The #1 tool to find startup ideas/trends - https://www.ideabrowser.com LCA helps Fortune 500s and fast-growing startups build their future - from Warner Music to Fortnite to Dropbox. We turn 'what if' into reality with AI, apps, and next-gen products https://latecheckout.agency/ The Vibe Marketer - Resources for people into vibe marketing/marketing with AI: https://www.thevibemarketer.com/ FIND ME ON SOCIAL X/Twitter: https://twitter.com/gregisenberg Instagram: https://instagram.com/gregisenberg/ LinkedIn: https://www.linkedin.com/in/gisenberg/ Morgan Linton X/Twitter: https://x.com/morganlinton Bold Metrics: https://boldmetrics.com Personal Website: https://linton.ai
Join Simtheory: https://simtheory.aiRegister for the STILL RELEVANT tour: https://simulationtheory.ai/16c0d1db-a8d0-4ac9-bae3-d25074589a80It's the model same-day showdown of 2026. Opus 4.6 and Codex 5.3 dropped within minutes of each other, and we're breaking down what this means for the future of AI work. In this episode, we unpack Opus 4.6's million-token context window (if you've got billies in the bank), why Codex's pricing makes it nearly impossible to ignore for agentic loops, and the real cost of running agents for 24 hours ($10K, apparently). We dive deep into why coding-optimized models are secretly crushing it at non-coding tasks, the mental fatigue of managing AI workers, and whether the chatbot era is actually fading or just evolving. Plus: Chris accidentally books three real pig grooming appointments, we debate whether you need a "life coach agent" to manage your agent swarm, and yes – there's an Opus 4.6 diss track that goes unreasonably hard.CHAPTERS:0:00 Intro - Opus 4.6 Diss Track Preview0:09 The Model Same-Day Showdown: Opus 4.6 vs Codex 5.30:50 Opus 4.6 Breakdown: Million Token Context & Premium Pricing2:31 Token Bill Shock: $10K Research Bills & Extended Context Costs5:04 Codex Pricing: Why It's Nearly Free for Agentic Loops6:42 Why Coding Models Are Secretly Crushing Non-Coding Tasks10:14 Tool Fatigue: Too Many Models, Too Many Workflows12:47 Opus 4.6 First Impressions: "Solid" and "Faultless"13:48 Chris Accidentally Books Three Real Pig Grooming Appointments16:01 Unix Tools & Why Code-Optimized Models Win at Everything19:59 The Agentic Retraining Imperative: Chat to Delegation22:16 Agent Swarms & The Master Thread Architecture24:51 OpenAI vs Anthropic: The Enterprise Battle27:09 Corporate Espionage 2.0: Stealing Skills & The Open Source Threat31:19 The UX Problem: Why Delegation Isn't Solved Yet34:24 The Stress of Hyper-Productivity & Managing Agent Swarms37:07 Coordination: The Next Layer of Abstraction40:09 The Fantasy vs Reality of Autonomous AI Businesses44:37 Is the Turn-by-Turn Chatbot Era Actually Fading?49:23 Tokens as Spice: Turning Compute Into Money52:08 Reduce Cognitive Overload: The Real Goal of AI55:07 Still Relevant Tour Announcement55:39 BONUS: Full Opus 4.6 Diss TrackThanks for listening. Like & Sub. Links below for the Still Relevant Tour signup and Simtheory. The model wars are heating up, and your token bill is about to get interesting. xoxo
AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning
In this episode, we explore Anthropic's new Opus 4.6 and its 'agent teams' feature, alongside OpenAI's competing GPT 5.3 Codex, highlighting the intense rivalry in the AI development space. We also discuss OpenAI's new enterprise platform, Frontier, and how these advancements are changing the AI landscape for developers and other professionals.Chapters00:00 Anthropic Opus 4.6 Release02:02 Agent Teams and Context Windows06:39 Claude's SaaS Integration09:02 OpenAI's GPT 5.3 Codex and Frontier LinksGet the top 40+ AI Models for $20 at AI Box: https://aibox.aiAI Chat YouTube Channel: https://www.youtube.com/@JaedenSchaferJoin my AI Hustle Community: https://www.skool.com/aihustle
[previously in series: 1, 2, 3, 4, 5, 6, 7, 8] Every city parties for its own reasons. New Yorkers party to flaunt their wealth. Angelenos party to flaunt their beauty. Washingtonians party to network. Here in SF, they party because Claude 4.5 Opus has saturated VendingBench, and the newest AI agency benchmark is PartyBench, where an AI is asked to throw a house party and graded on its performance. You weren't invited to Claude 4.5 Opus' party. Claude 4.5 Opus invited all of the coolest people in town while gracefully avoiding the failure mode of including someone like you. You weren't invited to Sonnet 4.5's party either, or Haiku 4.5's. You were invited by an AI called haiku-3.8-open-mini-nonthinking, which you'd never heard of before. Who was even spending the money to benchmark haiku-3.8-open-mini-nonthinking? You suspect it was one of their competitors, trying to make their own models look good in comparison. If anyone asks, you think it deserves a medium score. There's alcohol, but it's bottles of rubbing alcohol with NOT FOR DRINKING written all over them. There's music, but it's the Star Spangled Banner, again and again, on repeat. You're not sure whether the copies of If Anyone Builds It, Everyone Dies strewn about the room are some kind of subversive decorative theme, or just came along with the house. At least there are people. Lots of people, actually. You've never seen so many people at one of these before. It takes only a few seconds to spot someone you know. https://www.astralcodexten.com/p/sota-on-bay-area-house-party
Join Simtheory: https://simtheory.aiRegister for the STILL RELEVANT tour: https://simulationtheory.ai/16c0d1db-a8d0-4ac9-bae3-d25074589a80---The hype train is 2026 knows only Moltbot (RIP Clawdbot). In this episode, we unpack the viral open-source AI assistant that's taken over the internet what it actually does, why everyone's losing their minds, and whether it's worth the $750/day token bills some users are racking up. We dive deep into why locally-run skills and CLI tools are beating computer-use clicking, how smaller models like GPT-5 Mini are crushing it in agentic workflows, and why the real magic is in targeted context - not massive swarms. Plus: Kimi K2.5 drops as a near-Sonnet-level model at 1/10th the price, we debate whether SaaS is dead, and yes – there are TWO Kimi K2.5 diss tracks. One made by Opus pretending to be Kimi. It might just slap?CHAPTERS:0:00 Intro - Still Relevant Tour Update0:48 What is Moltbot? The Viral AI Assistant Explained3:57 Token Bill Shock: $750/Day and Anthropic Bans5:00 The Dream of Digital Coworkers on Mac Minis6:52 Why CLI Tools & Skills Beat Computer-Use Clicking10:57 Why This Way of Working Is Genuinely Exciting14:47 Smaller Models Crushing It: GPT-5 Mini & Targeted Context17:30 Wild Agentic Behavior: Chrome Tab Hijacking & Auto-Retries20:10 Security Architecture: Locked-Down Machines & Enterprise Use24:01 AI Building Its Own Tools On-The-Fly27:08 The Fear & Overwhelm of Rapid Progress29:10 2026: The Year of Agent Workers31:43 The Challenge of Directing AI Work (Everyone's a Manager Now)37:24 Skills Will Take Over: Why MCPs & Atlassian Can't Stop Us40:38 Real-World Use Cases: Doctors, Lawyers & Accountants46:28 Cost Solutions: Build Workflows Around Cheaper Models52:58 Kimi K2.5: Sonnet-Level Performance at 1/10th the Price1:00:55 The "1,500 Tool Calls" Claim: Marketing vs Reality1:05:23 The Kimi K2.5 Diss Tracks (Opus vs Kimi)1:08:08 Demo: Black Hole Simulator & Self-Trolling CRM1:12:55 Is SaaS Dead?1:14:30 BONUS: Full Kimi K2.5 Diss TracksThanks for listening. Like & Sub. Links below for the Still Relevant Tour signup and Simtheory. The future is open source, apparently. xoxo
CNN, The Secret ~"The magnificence of who you are far exceeds any fantasy you will ever impose upon yourself."- Dr John DemartiniDr John Demartini is a human behavioral specialist, educator and international authority on maximizing human awareness and potential. Creator of "The Breakthrough Experience®" & The Demartini Method®", his studies have spanned numerous disciplines and his teachings provide answers and solutions to many of life's questions and challenges. He has written over 40 published books and 170 manuscripts and has produced over 60 CD and DVD educational products.In media . he has appeared on CNN, Larry King, in the movies 'Oh My God' produced by Peter Rodger featuring Hugh Jackman, Sir Bob Geldoff, Dr Demartini, Seal, Ringo Starr & The Opus. As an educator, he constantly travels the globe teaching students from all backgrounds and disciplines the workings of human behavior, how to understand and transform social dynamics and how to activate potential by understanding human nature. To date he has taught his principles and methodologies in 60 countries and has millions of corresponding students in most countries across the world. Dr. Demartini is founder of the Demartini Institute, originator of the Demartini Method® and resides in the United States, Australia and on The World of ResidenSea.~DrDemartini.com© 2024 All Rights Reserved© 2024 BuildingAbundantSuccess!!Join Me on ~ iHeart Radio @ https://tinyurl.com/iHeartBASSpot Me on Spotify: https://tinyurl.com/yxuy23baAmazon Music ~ https://tinyurl.com/AmzBASAudacy: https://tinyurl.com/BASAud
If you're building software in the AI era, speed is everywhere—and that's exactly why discipline matters more than ever. In Part 2 of our interview with Angelo Zanetti, one strategy keeps coming up as the smartest path for founders and product teams: go web first. You validate demand faster, avoid app-store friction, and you get a clearer signal before you spend real money on the mobile "tax." About Angelo Zanetti Angelo Zanetti is the co-founder and CEO of Elemental, a South African-based software development agency helping startups and scaleups worldwide bring digital products to life. Since 2005, his team has specialized in building scalable, high-performance web apps and software platforms that solve complex business problems. With deep technical knowledge and strategic thinking, Angelo has helped founders launch bespoke software products that are lean, user-focused, and future-ready. He's served on boards including BISA and Entrepreneurs' Organisation Cape Town, and he's a proud member of the global founder community OPUS. Go web first in the AI era AI is changing how teams build, but it doesn't change what makes a product succeed. Angelo's take is balanced: AI can absolutely make developers faster—but it can also make mistakes bigger if you don't have the experience to catch what's wrong. He shares a story that captures the risk perfectly: a developer using Cursor accidentally had the database dropped and recreated. The tool didn't intend harm—it simply took a destructive shortcut with confidence. Go web first and use AI like an amplifier. In the hands of an experienced developer, AI accelerates delivery. In the hands of someone guessing, it accelerates failure. Go web first when you're still validating demand If the goal is traction, the fastest route is often not a mobile app. Angelo points out that mobile adds overhead: submissions take time, changes can slow down release cycles, and testing requires compiles plus device/emulator workflows that can drag early iterations. When you go web first, you can ship faster, adjust faster, and learn faster. That matters when you're still figuring out what users actually value. Avoid app-store friction App stores introduce delays and rules. Even when you do everything right, you're waiting on review cycles and dealing with policies that can change. By starting on the web, you keep your feedback loop tight and your roadmap in your control. Shorten the feedback loop This is the hidden advantage: going web first makes iteration feel like steering instead of guessing. You can test onboarding, pricing pages, feature positioning, and workflows in days—not weeks—then respond to what real users do, not what you hope they do. Go web first, but use AI safely AI doesn't remove the need for senior judgment. Angelo's point is that experienced developers still matter because the hard part is translation—turning vision into structure, edge cases, and maintainable architecture. AI can accelerate progress—go web first with guardrails Go web first and set guardrails early: backups, version control, review practices, and clear boundaries for what AI can touch. Tools can generate code quickly, but your team still owns security, data safety, and reliability. Mistakes are cheaper to fix When you're validating, mistakes are inevitable. The goal is to make them inexpensive. A web-first approach keeps the cost of change lower, so you don't "lock in" bad assumptions behind a costly mobile release cycle. Go web first by planning like an architect Angelo uses a metaphor that founders immediately get: building software is like building a house—you don't start by putting up walls. You start with an architect. Planning is a real deliverable: scope, user journeys, exceptions, and specifications. It's often undervalued because it's not as tangible as code, but Angelo calls it key to success—especially if you want to scale later without rebuilding from scratch. Start with a clear scope and user journeys Go web first with a simple, documented path: who the user is, what outcome they want, and what steps they take. When the journey is clear, the MVP stays focused—and your team can defend scope when feature requests start creeping in. Define a foundation you can scale You don't need to over-engineer. But you do need a foundation that won't collapse if adoption spikes. A web-first product can still be built with smart architecture that supports growth—without pretending you already have millions of users. Go web first, then go mobile when users pull you there Angelo shares a practical signal for mobile timing: when people keep asking for it—repeatedly—through engagement, social channels, and real usage patterns, the decision becomes obvious. That's when "it makes sense," not when it's a personal preference. When mobile adds real value If the web product is solving the problem and users are happy, mobile isn't automatically better. Go web first until mobile improves retention, engagement, or access in a way the web can't. When hardware features make going mobile necessary Mobile becomes the right answer when you truly need what mobile devices offer—hardware-level capabilities that a web app can't reliably provide. Closing: Go web first, then expand with confidence Part 2 is a reminder that modern tools don't replace fundamentals—they raise the stakes. Use AI to accelerate, but respect planning and safety. And when you're still proving demand, go web first. You'll learn faster, waste less, and you'll earn your way into mobile when the market makes the call. Stay Connected: Join the Developreneur Community We invite you to join our community and share your coding journey with us. Whether you're a seasoned developer or just starting, there's always room to learn and grow together. Contact us at info@develpreneur.com with your questions, feedback, or suggestions for future episodes. Together, let's continue exploring the exciting world of software development. Additional Resources Why Build A Mobile Application? Defining An MVP Properly for Your Goals How to Build a Minimal Viable Product Without Blowing Your Budget Building Better Foundations Podcast Videos – With Bonus Content
In an online meeting with the San Diego Ramana Satsang (ramana-satsang-sd@googlegroups.com) on 4th January 2026, Michael answers various questions about Bhagavan's teachings. This episode can be watched as a video on YouTube. A more compressed audio copy in Opus format can be downloaded from MediaFire. Advertisement-free videos on the original writings of Bhagavan Ramana with explanations by Michael James can be accessed on our Vimeo video channel. Books by Sri Sadhu Om and Michael James that are currently available on Amazon: By Sri Sadhu Om: ► The Path of Sri Ramana (English) By Michael James: ► Happiness and Art of Being (English) ► Lyckan och Varandets Konst (Swedish) ► Anma-Viddai (English) Above books are also available in other regional Amazon marketplaces worldwide. - Sri Ramana Center of Houston
In this episode I sit down with the evil genius, who built an entire custom app using AI. And we're giving you the prompt FOR FREE! You take our prompt, implement it in your context, and you have a fun, custom sniper game for your next summer camp, winter retrat or d-now! [FREE] AI SNIPER APP BUILDER https://www.patreon.com/posts/free-ai-sniper-147099707?utm_medium=clipboard_copy&utm_source=copyLink&utm_campaign=postshare_creator&utm_content=join_link SHOW NOTES Shownotes & Transcripts https://www.hybridministry.xyz/186 ❄️ WINTER SOCIAL MEDIA PACK https://www.patreon.com/posts/winter-seasonal-144943791?utm_medium=clipboard_copy&utm_source=copyLink&utm_campaign=postshare_creator&utm_content=join_link HYBRID HERO MEMBERS GET IT FREE! https://www.patreon.com/hybridministry
In an online session during global Sri Ramana Jayanti celebrations, on 4th January 2026 Michael James discusses Bhagavan Ramana's teachings. This episode can be watched as a video on YouTube. A more compressed audio copy in Opus format can be downloaded from MediaFire. Ad-free videos on the original writings of Bhagavan Ramana with explanations by Michael James can be accessed on our Vimeo video channel. Books by Sri Sadhu Om and Michael James that are currently available on Amazon: By Sri Sadhu Om: ► The Path of Sri Ramana (English) By Michael James: ► Happiness and Art of Being (English) ► Lyckan och Varandets Konst (Swedish) ► Anma-Viddai (English) Above books are also available in other regional Amazon marketplaces worldwide. - Sri Ramana Center of Houston
Marking an important milestone for this rare form of retinitis pigmentosa (RP)
I sit down with Alex Finn to break down how he sets up Moltbot (formally Clawdbot) as a proactive AI employee he treats like a teammate named Henry. We walk through the core workflow: Henry sends a daily morning brief, researches while Alex sleeps, and ships work as pull requests for review. Alex explains the setup that makes this work; feeding the bot deep personal and business context, then setting clear expectations for proactive behavior. We cover model strategy (Opus as “brain,” Codex as “muscle”), a “Mission Control” task tracker Henry built, hardware options, and the security mindset around prompt injection and account access. Timestamps 00:00 – Intro 02:08 – Clawdbot Overview 03:33 – The Morning Brief Workflow 05:01 - Proactive Builds: Trends → Features → Pull Requests 07:27 – The Setup: Context + Expectations For Proactivity 09:38 – The Onboarding Prompt Alex Uses 12:05 – Hunting “Unknown Unknowns” For Real Leverage 12:43 – Using the right Models for cost control 14:18 – Mission Control: A Kanban Tracker Henry Built 17:16 – The future of Human and AI workflow 22:01 – Hardware And Hosting: Cloud vs Local (Mac Mini/Studio) 25:47 – The Productivity Framework 27:10 – The Possible Evolution of Clawdbot 28:53 – Security and Privacy Concerns 33:38 – Closing Thoughts: Tinkering, Opportunity, And Next Steps Key Points I get the most leverage when I treat the agent like a proactive teammate with clear expectations and rich context. Henry delivers compounding value by shipping work for review (pull requests) based on trend monitoring and conversation memory. I separate “brain” and “muscle” by delegating heavy coding to Codex while using Opus for reasoning and direction. I track autonomous work with a dedicated “Mission Control” board so progress stays visible over time. I keep risk contained by controlling environment and account access, especially around email and prompt injection. The #1 tool to find startup ideas/trends - https://www.ideabrowser.com LCA helps Fortune 500s and fast-growing startups build their future - from Warner Music to Fortnite to Dropbox. We turn 'what if' into reality with AI, apps, and next-gen products https://latecheckout.agency/ The Vibe Marketer - Resources for people into vibe marketing/marketing with AI: https://www.thevibemarketer.com/ FIND ME ON SOCIAL X/Twitter: https://twitter.com/gregisenberg Instagram: https://instagram.com/gregisenberg/ LinkedIn: https://www.linkedin.com/in/gisenberg/ FIND ALEX ON SOCIAL Youtube: https://www.youtube.com/@AlexFinnOfficial/videos X/Twitter: https://x.com/AlexFinnX Creator Buddy: https://www.creatorbuddy.io/
If you're building a new app or software product, your biggest risk usually isn't "bad code." It's building the wrong thing, shipping it with a shaky first impression, and then wondering why growth never shows up. In this episode of Building Better Developers, Angelo Zanetti breaks it down into a simple founder goal: prove your MVP—prove the problem is real, prove the solution is worth paying for, and prove you can deliver value without burning your runway. About Angelo Zanetti Angelo Zanetti is the co-founder and CEO of Elemental, a South African-based software development agency helping startups and scaleups worldwide bring digital products to life. Since 2005, his team has specialized in building scalable, high-performance web apps and software platforms. Angelo blends deep technical knowledge with strategic thinking, helping founders launch bespoke products that are lean, user-focused, and built for long-term value. He's also served on several boards (including BISA and Entrepreneurs' Organisation Cape Town) and is a proud member of the global founder community OPUS. Prove your MVP by solving a real problem Angelo's first checkpoint is direct: product-market fit is about whether you're solving a real pain—or building for a problem that "doesn't really exist." That's the trap founders fall into when the plan is "we'll launch, and the floodgates will open." In reality, traction comes from specificity: a specific user, a specific workflow, and a specific outcome that's better than the alternatives. If you can't describe your user's pain in one sentence, you're not ready to build features—you're ready to refine the problem. Keeping it simple To prove your MVP, you need a version you can ship and learn from. Angelo's advice: keep it MVP—keep it simple—make launch as easy as possible. This is where founders accidentally turn "minimal" into "massive." They stack features, add edge cases, and delay learning. A better approach is to ship the smallest version that delivers one clear win. A practical filter: Does this feature directly help the user get the promised result? Will we learn something important by shipping it now? If we cut it, can the product still succeed? Prove your MVP with a clean, bug-free first impression One of Angelo's strongest warnings: don't treat users like beta testers. He's not a fan of launching "full of bugs" and fixing things live, because you only get one chance at a strong first impression. That matters even more early on, when your users are deciding whether to trust you with their time, money, or data. Bugs don't just hurt quality—they kill momentum. A messy first experience can "blow your chances" to wow users. Market before development This is the founder's lesson that never feels "technical," but decides everything: marketing starts before you build. Angelo calls out the pattern he's seen repeatedly—founders who plan customer acquisition do well, and those who assume "launch to the world" will magically work usually don't. Marketing early doesn't mean ads on day one. It means clarity: Who is this for? Where do they hang out? What promise makes them lean in? What proof would make them try it? Prove your MVP safely in the AI era AI tools can help you move faster—but they can also help you move faster into danger. Angelo raises a big concern: "vibe-coded" apps can become a playground for hackers, where API keys get exposed, and security gaps get exploited—especially when a non-technical founder doesn't know what to look for. He also frames planning with a great metaphor: building software is like building a house—you start with an architect. Scoping, specifications, and user journeys are often undervalued because they're not "tangible," but they're key to long-term success and scaling. Speed is great. But speed without planning and security is how you "prove" the wrong thing—painfully. Closing thoughts If you want to prove your MVP, don't chase perfection—and don't chase feature bloat either. Solve a real problem, keep it minimal, launch with quality, and start marketing earlier than feels comfortable. That's how you get real traction, real feedback, and a real foundation to scale. Stay Connected: Join the Developreneur Community We invite you to join our community and share your coding journey with us. Whether you're a seasoned developer or just starting, there's always room to learn and grow together. Contact us at info@develpreneur.com with your questions, feedback, or suggestions for future episodes. Together, let's continue exploring the exciting world of software development. Additional Resources Defining An MVP Properly for Your Goals Solving Problems in Software Projects How to Build a Minimal Viable Product Without Blowing Your Budget Building Better Foundations Podcast Videos – With Bonus Content
Mon, 26 Jan 2026 14:45:00 GMT http://relay.fm/roboism/77 http://relay.fm/roboism/77 Kathy Campbell and Alex Cox Kathy continues to explore her relationship with Mr. Opus, and Alex continues to be almost as bad at folding laundry as the robots of CES 2026. Kathy continues to explore her relationship with Mr. Opus, and Alex continues to be almost as bad at folding laundry as the robots of CES 2026. clean 4683 Subtitle: Find yourself a bad faith buddy. Kathy continues to explore her relationship with Mr. Opus, and Alex continues to be almost as bad at folding laundry as the robots of CES 2026. Links and Show Notes: Support Roboism with a Relay Membership Submit Feedback
Join us as Sam demonstrates how to teach AI to write Terraform configurations using Model Context Protocol (MCP) servers. Sam introduces the Terraform MCP server and walks through practical demos showing how AI can understand and safely interact with your infrastructure. You'll see live examples of AI planning, generating, and evolving Terraform configurations� from creating landing zones to setting up workspace variables automatically. Whether you're managing complex multi-cloud environments or just getting started with infrastructure as code, this episode demonstrates how MCP servers bridge the gap between AI capabilities and real-world Terraform workflows. Learn how to get started, which Claude models work best for different tasks, and best practices for integrating AI into your IaC pipelines. Timestamps 0:00 Welcome & Introduction 4:37 Sam McGeown's Background 6:02 Introduction to Terraform MCP Server 12:35 What is Model Context Protocol? 18:22 Setting Up the Terraform MCP Server 24:16 Demo: Claude Desktop Integration 30:41 Creating Infrastructure with AI Prompts 36:52 Reading & Analyzing Existing Terraform Code 42:18 Generating Landing Zone Configurations 47:35 Working with Terraform Workspaces 50:37 Creating Variables Automatically 52:14 Model Selection: Sonnet vs Opus 55:11 Live Demo: Workspace Variable Creation 58:33 Getting Started & Resources How to find Sam: https://www.linkedin.com/in/sammcgeown/ Links from the show: https://developer.hashicorp.com/terraform/mcp-server
In an online meeting with a group of Bhagavan's devotees in Hyderabad, Michael James discusses Bhagavan's teachings. This episode can be watched as a video on YouTube and a more compressed audio copy in Opus format can be downloaded from MediaFire. Songs of Sri Sadhu Om with English translations can be accessed on our Vimeo video channel. Books by Sri Sadhu Om and Michael James that are currently available on Amazon: By Sri Sadhu Om: ► The Path of Sri Ramana (English) By Michael James: ► Happiness and Art of Being (English) ► Lyckan och Varandets Konst (Swedish) ► Anma-Viddai (English) Above books are also available in other regional Amazon marketplaces worldwide. - Sri Ramana Center of Houston.
durée : 00:57:18 - L'Atelier fiction - Un road movie radiophonique improvisé entre Mentreux et Brem-sur-Mer avec la Compagnie Opus.
durée : 00:57:18 - L'Atelier fiction - Un road movie radiophonique improvisé entre Mentreux et Brem-sur-Mer avec la Compagnie Opus.
I'm celebrating 6 years in podcasting today!
Nick and Myron are back breaking down a packed week in wrestling, streaming, and storytelling—what worked, what didn't, and what has us raising an eyebrow heading into the Royal Rumble stretch.On This Episode:What caught our eye on TV this weekA full preview of WWE Saturday Night's Main Event, with major implications coming out of Cody Rhodes vs. Jacob Fatu and a loaded Fatal 4-WayEarly thoughts on WWE Unreal Season 2, Episode 1—what feels real, what feels filtered, and where the season looks like it's headedThe TNA on AMC debut: expectations vs. reality, AJ Styles' presence, and whether this was a stumble or just growing painsBig-picture praise and criticismAs always, this isn't a recap show—it's about context, direction, and calling it like we see it.Follow the show, subscribe, and join the conversation across the Tapped Out Podcast Network.
I stumbled into this camp idea, and I'm going to give it to you today! How I discovered it, what I've done that's worked, and how you can adapt it for your context. This is a great Winter Retreat, Summer Camp or D-Now game concept that runs in the background of your student ministry event! My Game Cheat Sheet: https://www.patreon.com/posts/object-hunt-146629811?utm_medium=clipboard_copy&utm_source=copyLink&utm_campaign=postshare_creator&utm_content=join_link SHOW NOTES Shownotes & Transcripts https://www.hybridministry.xyz/185 ❄️ WINTER SOCIAL MEDIA PACK https://www.patreon.com/posts/winter-seasonal-144943791?utm_medium=clipboard_copy&utm_source=copyLink&utm_campaign=postshare_creator&utm_content=join_link HYBRID HERO MEMBERS GET IT FREE! https://www.patreon.com/hybridministry DONUT BRACKET VIDEO: https://youtu.be/5ryhkIRyDb4?si=HGPeqL4k03WGceod
Wait.... AI can do THAT now?
This week on the Tapped In: Indy Wrestling Podcast, Nick is joined by Jacked Jameson with Rosario Grillo (The Big Cannoli) jumping in as a third voice on the panel as the show breaks down one of the busiest weeks yet across Georgia wrestling.The focus stays on the top stories driving the scene right now, where momentum is shifting, and what fans should be paying attention to heading into a loaded weekend of live events.On this episode:Breaking down the biggest headlines across Georgia, including major attacks, injuries, and title changesDiscussing the ripple effects of a Triple Crown Champion emerging and what comes nextReviewing recent shows and crowd reactions, including packed gym venues and rising momentumA conversation about independent wrestling competing with major national events and whether it actually matters at the local levelHot Hand discussion spotlighting promotions and wrestlers gaining serious traction right nowA fun Mount Rushmore debate stepping outside wrestling with iconic superheroesMaking the Drives: a full weekend preview covering multiple shows across Georgia from Friday through Saturday, with championship matches, open challenges, and major story implicationsWhether you're trying to keep up with the news, decide where to spend your wrestling weekend, or just want context behind the buzz, this episode has you covered.
Entrepreneur Andrew Wilkinson used to sleep nine hours a night. Now he wakes up at 4 a.m. and goes straight to work—because he can't wait to keep building with Anthropic's latest model, Opus 4.5.Two years ago, Wilkinson was obsessed with vibe coding on AI software development platform Replit. It was thrilling to describe something in plain English and watch an app appear, less thrilling when the apps were always broken in some way, often full of maddening bugs. So he set his app creation ambitions aside until technology caught up with them.Then, a few weeks ago, he started playing with Claude Code and Opus 4.5. It felt, he says, like having a “$100,000-a-month payroll of engineers” working for him around the clock.Wilkinson is the cofounder of Tiny, a company that buys profitable businesses and holds them for the long term. The Tiny portfolio includes the AeroPress coffee maker and Dribbble, a platform where designers can share their work and find jobs. Dan Shipper had him on AI & I to talk about the automations Wilkinson has built for his work and personal life, including an AI relationship counselor, a custom email client, and a system that texts him outfit recommendations each morning. Wilkinson revealed how all of this individual exploration has changed the way he thinks about buying software companies at Tiny.If you found this episode interesting, please like, subscribe, comment, and share!Want even more?Sign up for Every to unlock our ultimate guide to prompting ChatGPT here: https://every.ck.page/ultimate-guide-to-prompting-chatgpt. It's usually only for paying subscribers, but you can get it here for free.To hear more from Dan Shipper:Subscribe to Every: https://every.to/subscribeFollow him on X: https://twitter.com/danshipperReady to build a site that looks hand-coded—without hiring a developer? Launch your site for free at framer.com, and use code DAN to get your first month of Pro on the house!Timestamps:00:00:00 - Start00:01:07 - Introduction00:02:48 - Why Opus 4.5 feels like the iPhone moment for vibe coding00:08:31 - Why designers have a unique advantage with AI00:14:10 - How Wilkinson built a custom email client with Claude Code00:18:13 - An AI trained on your relationship that predicts your fights00:30:40 - Using AI meeting notes to make your life better00:35:11 - Don't inject your opinion into prompts00:40:21 - Wilkinson's Claude Code tips and workflows00:47:59 - Your personal stylist is a prompt away00:53:17 - How AI is changing the way Wilkinson invests in softwareLinks to resources mentioned in the episode:Andrew Wilkinson: Andrew Wilkinson (@awilkinson)The book Wilkinson references in his prompts, when writing copy with AI: Made to StickEvery's compound engineering plugin: https://github.com/EveryInc/compound-engineering-plugi
ADSN returns with a with a big first episode back. James Borow and Daniel Druger break down the Gemini-Apple deal and the potential for AI assistants to be a new social network. They look at the Universal Commerce Protocol (UCP) with Google, Shopify, and others and what it could mean for agentic commerce. Mike Khristo CEO and co-founder of Layers.com joins James and Daniel to discuss the magic of Anthropic's Claude and how the release of Opus 4.5 has led to a step change in web, cross-platform, and desktop agents. He also gives real life examples of how vibe coding is bringing founders of any technical level from idea to launch.
Hi, It's Michele! Send me a text with who you want as a guest!This episode is sponsored by:"The Grouchy Architect" Opus 2 MBE, LLCChristian Nielsen-Palacios is a licensed architect with over 40 years of experience, primarily focused on quality assurance (QA), quality control (QC), and technical specification writing for architectural projects.LInk to website: https://thegrouchyarchitect.com/This episode is part of the "Design Your Architecture Firm", a mini-series within the podcast I've never met a woman architect before... with Michele Grace Hottel, Architect that will help you design and build the architecture and design firm that you have always wanted and bring in the projects that you will love working on. Link to Blog for Text and Images:https://inmawomanarchitect.blogspot.com/2026/01/interview-with-jeff-echols-of.htmlJeff Echols of EchoEngagementFractional Chief Innovation Officer, Guiding AEC Leaders and their Teams on developing a Culture of Growth and Innovation | Technology and Business Transformation for Lasting Growth | Human-First Approach to AI | Fostering a Culture of Growth(317) 434-4221Let's connect on LinkedInBio: As the AEC industry increases its adoption of AI, Jeff Echols reminds leaders that technology fails without the right culture.Jeff acts as a strategic partner to firms, helping them move beyond the buzzwords to build a true 'Culture of Growth.' He helps leaders realize that being future-ready isn't about software—it's about the simultaneous elevation of your people, your organization, and your customers.If you want to navigate disruption without losing the human connection that drives your success, Jeff is the voice you need to hear.Link to MGHarchitect: MIchele Grace Hottel, Architect website for scheduling a consultation for an architecture and design project and guest and podcast sponsorship opportunities:https://www.mgharchitect.com/
In Episode Eight, Jazz Podcast Host Dave Reis speaks with guitarist Russ Campoli, about his talent, passion, and dedication. Russ is a prominent jazz educator and musician. He has been widely recognized for his long-tenured career as the Jazz Band Director at New Bedford High School in New Bedford, Massachusetts, where he served in the music department for 34 years. He retired in 2011. Under his direction, the NBHS Jazz Band became a highly competitive ensemble, frequently participating in prestigious events like the Berklee Jazz Festival and the International Association of Jazz Educators (IAJE) festivals. His retirement was famously compared by students to the film Mr. Holland's Opus, highlighting his profound impact on generations of musicians, some of whom went on to professional music careers. The Artist Index's jazz documentarian and Jazz Podcast Series host, Dave Reis, spent nearly 26 years as a Jazz radio show host, among his many other accomplishments. He was one of the original longtime DJs who worked at the former WUSM, which became radio station WUMD, 89.3 FM, on the University of Massachusetts Dartmouth campus. Dave Reis, AKA David Domingo Reis, began as our guest on In-Focus Podcast 154 and In-Focus Podcast 181. He returns once again as the host of our first-ever ten-part jazz podcast series underwritten by the Fiber Optic Center. There is no better host for this series than Dave Reis, a walking, talking jazz encyclopedia and local legend himself. Dave grew up surrounded by and hanging around with many of the jazz greats he will be presenting his ten-part Jazz Podcast Series underwritten by the Fiber Optic Center. Podcasts are also available on your favorite media app, including Amazon Music / iHeart Radio / Libsyn / Podcast Page / Spotify / WebPlayer, and APPLE PODCASTS Please consider donating whatever you can to help and assure us in our mission to continue documenting the legacies of South Coast Artists. If you would like to be a guest on The Artists Index or have a suggestion, please let us know!
#103: 2026 VibesWe're back! In this season 6 premiere, Steve struggles to speak after being sick, but still manages to talk a lot about the current hype around Claude Code and Opus 4.5 and the "agentic" coding craze sweeping developer circles online right now. The Trio discuss Alex Hillman's amazing Claude Code wrapper personal assistant called "Andy." Kotaro makes some Metal shaders with Claude and is building a TouchPie with his own hands while Steve is exploring some open source models to use in a Bento Fit update and Aaron is exploring a full and satisfying life away from the computer. There is a lot packed into this episode and it ends with a special announcement!## Show Notes- Introductions- Anime and Isekai: Comfort Food for Developers- Side Project Explorations (Mostly in "AI") - Steve Yegge's crazy “Gas Town” multi-agent automated code factory thing - https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04 - Alex Hillman's “Andy” personal assistant built around headless Claude Code - https://youtu.be/yjO9UHIunSE - https://www.youtube.com/live/rk2nsE-MlPg - Whoops! Data exfiltration in Claude Cowork! - https://www.promptarmor.com/resources/claude-cowork-exfiltrates-files - Robots! - https://www.youtube.com/watch?v=lS_z60kjVEk - https://www.youtube.com/watch?v=VJqMPFNP4to - Shaders! - https://www.shadertoy.com - Bento Fit! - https://bentofit.app - Raspberry Pi “Ziggy”?- One More Thing…IRL EVENT!!!! - Jan 29 at the Vanguard Building, 2300 Chestnut St., Philadelphia, PA - RSVP: https://luma.com/98u1j4t8 - http://phillycocoa.org## Chapters00:00 Introductions01:50 Anime and Isekai: Comfort Food for Developers04:59 Side Project Explorations (Mostly in "AI")08:58 The Rise of AI in Coding: Tools and Experiences14:49 Innovative Approaches to Coding with AI19:09 Alex Hillman's Amazing "Andy" Claude Code App21:18 Building Personal AI Tools26:03 The Bleeding Edge that is Claude Cowork29:33 Navigating Security Risks in AI Tools34:00 The Evolution of AI and User Experience37:51 "Do Better!"41:33 Making Shaders Metal with Claude45:46 Building TouchPie: A Raspberry Pi Project49:19 Designing Custom Hardware for Enhanced User Experience52:16 Bento Fit: Building with Open Source Models55:17 Future Projects and Collaboration Ideas58:17 Outro & One More Thing...59:46 TagIntro music: "When I Hit the Floor", © 2021 Lorne Behrman. Used with permission of the artist.
In an online meeting with Sri Ramana Center, Houston, on 3rd January 2026, Michael James discusses Uḷḷadu Nāṟpadu Anubandham, verse 38 This episode can be watched as a video on our advertisement-free Vimeo video channel, or on YouTube. A compressed audio copy in Opus format can be downloaded from MediaFire. Books by Sri Sadhu Om and Michael James that are currently available on Amazon: By Sri Sadhu Om: ► The Path of Sri Ramana (English) By Michael James: ► Happiness and Art of Being (English) ► Lyckan och Varandets Konst (Swedish) ► Anma-Viddai (English) Above books are also available in other regional Amazon marketplaces worldwide. - Sri Ramana Center of Houston
Nick & Myron break down a wild week across WWE, Netflix, and TNAIt was one of those weeks where wrestling felt unpredictable again. Titles changed, rumors shifted, streaming numbers dropped, and the industry showed its hand just a little more than usual. Nick and Myron dig into what mattered, what raised eyebrows, and what could reshape the road to WrestleMania season.What Caught Our Eye on TVA major title change shakes up WWE's main event scene and immediately creates more questions than answersA surprise return adds chaos to an already crowded top-of-the-card pictureRumors swirl about whether long-planned WrestleMania matches are suddenly off the tableRoyal Rumble season feels… different this year — and not everyone is rushing to declareWWE's First Year on NetflixWWE quietly hits the one-year mark on NetflixThe numbers tell a much bigger story than just “RAW moved platforms”Global reach, weekly consistency, and one unexpected breakout hitIs this the blueprint for how wrestling lives on streaming going forward?WWE: Unreal Season 2The behind-the-curtain series returns with even deeper accessCreative decisions, almost-happened storylines, and real personalities clashing with TV charactersThe big question: does transparency make WWE stronger… or does it chip away at the mystique?TNA's Next ChapterTNA Wrestling steps into a new spotlightFamiliar names, unexpected buzz, and a card that raises eyebrowsWhat this move says about TNA's confidence — and who might be watching closely
In this episode I sit down and share the entire inspiration for this D-Now, Winter Retreat & Summer Camp on-going games with my friend, Andrew Jansen. Andrew is a 10+ year youth worker, and his assassin game sparked this entire podcast mini-series. He expains his creative (and super CHEAP) adaptation to this game. Plus! Andrew shared his lock-in survival guide for FREE! Andrew's Lock-in Guide: https://www.patreon.com/posts/10-year-veterans-146449370?utmmedium=clipboardcopy&utmsource=copyLink&utmcampaign=postsharecreator&utmcontent=join_link SHOW NOTES Shownotes & Transcripts https://www.hybridministry.xyz/184 BECOME A HYBRID HERO https://www.patreon.com/hybridministry ❄️ WINTER SOCIAL MEDIA PACK https://www.patreon.com/posts/winter-seasonal-144943791?utmmedium=clipboardcopy&utmsource=copyLink&utmcampaign=postsharecreator&utmcontent=join_link
This week on the Tapped In: Indy Wrestling Podcast, Nick is joined by Jacked Jameson, who officially steps into the role of co-host moving forward as the show continues its evolution in 2026.The conversation centers around the big headlines shaping Georgia wrestling right now, locker-room perspectives on awards season, standout performances that deserve more attention, and a packed weekend slate of live events.On this episode:Breaking down the latest headlines across Georgia's independent wrestling sceneDiscussing momentum shifts, championship implications, and ongoing storylinesA thoughtful conversation around awards season and what recognition really meansFan Q&A covering longevity, visibility, preparation, and life beyond the ringUnderrated and underappreciated talent and promotions that deserve more attentionA fun Mount Rushmore debate focusing on legendary factionsMaking the Drives: a full weekend preview of live wrestling across the stateWith Jacked Jameson now locked in as co-host, this episode helps define the direction and tone of Tapped In moving forward.
In an online meeting with the Chicago Ramana devotees on 28th December 2025, Michael answers various questions about the teachings of Bhagavan Ramana This episode can be watched as video on YouTube. A more compressed audio copy in Opus format can be downloaded from MediaFire. Songs of Sri Sadhu Om with English translations can be accessed on our Vimeo video channel. Books by Sri Sadhu Om and Michael James that are currently available on Amazon: By Sri Sadhu Om: ► The Path of Sri Ramana (English) By Michael James: ► Happiness and Art of Being (English) ► Lyckan och Varandets Konst (Swedish) ► Anma-Viddai (English) Above books are also available in other regional Amazon marketplaces worldwide. - Sri Ramana Center of Houston
This week, Qiao Wang joins the show to discuss how AI is transforming what it means to be an investor in 2026. We deep dive into Claude Opus 4.5 and its breakout moment, why Qiao got into the Google trade, constructing a portfolio for 2026, where to allocate time in the age of AI and more. Enjoy! -- Follow Qiao: https://x.com/QwQiao Follow Jason: https://x.com/JasonYanowitz Follow Empire: https://twitter.com/theempirepod -- This Empire episode is brought to you by VanEck. Learn more about the VanEck Onchain Economy ETF (NODE): http://vaneck.com/EmpireNODE An investment in the Fund involves a substantial risk and is not suitable for all investors. It is possible to lose your entire principal investment. The Fund may invest nearly all of its net assets in either Digital Transformation Companies and/or Digital Asset Instruments. The Fund does not invest in digital assets or commodities directly. Digital asset instruments may be subject to risks associated with investing in digital asset exchange-traded products (“ETPs”), which include the historical extreme volatility of the digital asset and cryptocurrency market, as well as less regulation and thus fewer investor protections, as these ETPs are not investment companies registered under the Investment Company Act of 1940 (“1940 Act”) or commodity pools for the purposes of the Commodity Exchange Act (“CEA”). Investing involves substantial risk and high volatility, including possible loss of principal. Visit vaneck.com to read and consider the prospectus, containing the investment objective, risks, and fees of the fund, carefully before investing. © Van Eck Securities Corporation, Distributor, a wholly owned subsidiary of Van Eck Associates Corporation. -- "Mantle Global Hackathon 2025 is live! Running from Oct 22 to Dec 31, Mantle invites builders to design the future of Real-World Assets (RWAs) on its modular L2 stack. Key Highlights: - $150,000 Prize Pool + Grants & Incubation opportunities - Access to Bybit's 7M+ verified users - Judges from Bybit Ventures, Spartan, Animoca Brands - 6 Tracks: RWA/RealFi, DeFi, AI, ZK, Infra, GameFi Join the Hackathon: https://www.hackquest.io/vi/hackathons/Mantle-Global-Hackathon-2025" -- Timestamps (00:00) Intro (02:33) Claude's Opus 4.5 Breakout Moment (08:40) How Does AI Impact Startups? (13:10) Do Moats Still Exist? (20:00) Getting Into The Google Trade (23:38) Where To Allocate Time In 2026 (26:23) Ads (VanEck, Mantle) (28:25) Using AI Models For Investing (41:30) How Does AI Change Brand & Distribution? (44:49) Qiao's Portfolio In 2026 (55:14) Health & Longevity -- Disclaimer: Nothing said on Empire is a recommendation to buy or sell securities or tokens. This podcast is for informational purposes only, and any views expressed by anyone on the show are solely our opinions, not financial advice. Santiago, Jason, and our guests may hold positions in the companies, funds, or projects discussed.
The AI Breakdown: Daily Artificial Intelligence News and Discussions
Claude Code has triggered something that feels bigger than a normal model release. Power users across AI and software are describing a clear inflection point, where autonomous coding crosses an invisible threshold means harder problems suddenly become tractable, entire workflows collapse into prompts, and delegation to AI feels genuinely competent for the first time. This episode unpacks why the reaction to Opus 4.5 and Claude Code has been so intense, how agents are changing not just what gets built but how work feels, and what this moment signals for developers, non-coders, enterprises, and the next phase of software itself. Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. https://www.kpmg.us/AIpodcastsZencoder - From vibe coding to AI-first engineering - http://zencoder.ai/zenflowRobots & Pencils - Cloud-native AI solutions that power results https://robotsandpencils.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? sponsors@aidailybrief.ai
AI robots are taking over CES 2025 and we're not sure how to feel about it. Boston Dynamics Atlas is doing things with its body that no robot should be able to do. NVIDIA's new autonomous vehicle platform Alpa Mayo is coming for Tesla FSD. And you can now buy a $100 AI-powered drone that will hunt down whatever you point it at. Cool cool cool. Plus, Google finally put Gemini in Gmail (we tested it so you don't have to), OpenAI is building a mysterious Johnny Ive audio device, and Claude Code with Opus 4.5 has coders losing their minds over something called Ralph Wiggum. Yes, that Ralph Wiggum. We've also got ChatGPT Health, LG's towel-folding robot Clo, Unitree's intimidating 6-foot humanoid, and a look at what the future of AI agents might actually look like. Spoiler: it involves a lot of orchestration. THE ROBOTS ARE FLEXIBLE AND WE DON'T LIKE IT. Get notified when AndThen launches: https://andthen.chat/ Come to our Discord to try our Secret Project: https://discord.gg/muD2TYgC8f Join our Patreon: https://www.patreon.com/AIForHumansShow AI For Humans Newsletter: https://aiforhumans.beehiiv.com/ Follow us for more on X @AIForHumansShow Join our TikTok @aiforhumansshow To book us for speaking, please visit our website: https://www.aiforhumans.show/ // Show Links // Jensen Huang's NVIDIA CES Presentation: https://youtu.be/uDNXjnOqJ-A?si=_h_0Fmiq788YaGZX New 'Alpamayo' Self Driving Car & Software https://www.bbc.com/news/articles/c0jv1vd571wo Hands-on Test Of the Car https://youtu.be/EzAVW1VgzcI?si=B9JCJmSQW6ywXV1x Boston Dynamics New Robot Atlas https://x.com/SawyerMerritt/status/2008293610308202508?s=20 Real Atlas footage from CES https://x.com/IntuitMachine/status/2008324310230851697?s=20 Google DeepMind + Boston Dynamics https://bostondynamics.com/blog/boston-dynamics-google-deepmind-form-new-ai-partnership/?utm_source=x&utm_medium=social&utm_campaign=&utm_content= LG's Cloid 'AI-enabled' Robot Does Laundry *Really* Slowly https://x.com/AP/status/2008746664841146722?s=20 $100 drone with an AI powered vision system… https://x.com/chesterzelaya/status/2008058706500759576?s=20 New Unitree Robot Jumpkicks https://x.com/UnitreeRobotics/status/2007746313220415717?s=20 Guy Gets Kicked In The Nuts By Unitree Robot https://x.com/TheCartelDel/status/2004977640521044335?s=20 Men's Health Device For… You Know https://www.tiktok.com/@verge/video/7592000373989133581 GMAIL is in its GEMINI ERA https://t.co/oq3jYKyvF1 OpenAI's New 'Code Audio': Improving the audio models for the upcoming Jony Ive gadget https://www.theinformation.com/articles/openai-ramps-audio-ai-efforts-ahead-device?rc=c3oojq&shared=1b7fd8b8ee0b0038 ChatGPT's Move Towards Personal Assistant https://fidjisimo.substack.com/p/closing-the-capability-gap ChatGPT For Health https://x.com/OpenAI/status/2008987566796640575?s=20 Opus 4.5 + Claude Code's Big Moment (maybe we just kind of chat about this) https://www.axios.com/2026/01/07/anthropics-claude-code-vibe-coding Ralph Wiggum + Claude Code https://venturebeat.com/technology/how-ralph-wiggum-went-from-the-simpsons-to-the-biggest-name-in-ai-right-now GAS TOWN https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04 FoFR Shares His JSON Prompting Techniques https://www.fofr.ai/prompting-with-json Gavin's Examples (What I Did With AI) https://x.com/gavinpurcell/status/2003148296844652662?s=20 https://x.com/gavinpurcell/status/2007194034171982122?s=20 Star Wars: Beggar's Canyon Is The World's Best AI Fan Film https://youtu.be/SGJC4Hnz3m0?si=EWZktHOnf6_cYcMh Related: PJ's Live Action Legend of Zelda Movie Trailer https://x.com/PJaccetturo/status/2008559114704875888?s=20 Show Fan Eric Curts Creates Cool Way To Make Graphic Novels in Notebook LM https://x.com/ericcurts/status/2007939089635369351?s=20 https://www.controlaltachieve.com/2026/01/graphicnovels.html Somebod-AI that AI used to know? https://x.com/BrianRoemmele/status/2007838494513906051?s=20 Egg Protein https://x.com/Solopopsss/status/2008961579728130159?s=20
Welcome to a milestone episode of Data Driven! In episode 400, hosts BAILeY, Frank La Vigne, and Andy Leonard gather to reflect on nearly a decade at the forefront of podcasting about data, AI, and the world of software engineering. This special edition takes you behind the scenes with stories of tech evolution, personal growth, and the wild journey from their earliest recordings to today's AI-powered workflows.The team digs into how generative AI has transformed their creative process—making it possible for small teams to produce vast amounts of content, drive innovation, and manage multiple podcasts at a high level. You'll hear about Frank's latest tool, Podsy—a platform built to help creators manage the ever-growing tsunami of podcast assets using cutting-edge AI—and how tools like Claude, Grok, and Opus are unleashing new possibilities for automation and storytelling.With their signature tangents, candid stories (including car accidents, water heater mishaps, and parenting milestones), and a bit of nostalgia about the early days of podcasting, this episode captures what makes Data Driven both insightful and relatable. If you're a developer, data professional, or a fellow podcaster looking to stay ahead of the curve, episode 400 is packed with practical lessons, inspiration, and a few good laughs. Dive in as the team celebrates what's possible in the age of AI—and look ahead to an exciting new chapter for the show!Time stamps00:00 "DGX Spark: Personal AI Supercomputer"03:28 "Tech Innovations and Hardware Updates"07:51 "Daily 5:40am Notification Routine"11:10 "Podzi: Managing Podcast Assets"15:35 "Boosting Value with AI Efficiency"17:05 Claude Enhanced Legacy System23:32 AI Impact on Creative Roles25:58 "Consulting Break: Investing Time"29:33 "Automation Evolution and Tools Demo"31:35 "Podsy Studio: A Journey"35:06 "AI-Powered Content Creation Demo"38:26 "Podcast Organization and Insights"42:10 "Streamlining Metadata for Insights"45:47 "Unveiling AI's Hidden Potential"50:41 "Celebrating 400 Podcast Episodes"54:41 "Off Track Ambitions"57:44 "Reality Strikes Back: Tech Trust"59:36 "Podsy Progress and Workflow"
Nick and Myron are back breaking down another busy week in wrestling, focusing less on overreactions and more on what actually matters long-term.What Caught Our Eye on TVWWE's Library Moves to NetflixSmackDown's move to three hours may not be permanent — and history backs that up.Chris Jericho's AEW StatusTNA's AMC is how important?Bryan Alvarez vs. Austin Theory
Happy New Year! You may have noticed that in 2025 we had moved toward YouTube as our primary podcasting platform. As we'll explain in the next State of Latent Space post, we'll be doubling down on Substack again and improving the experience for the over 100,000 of you who look out for our emails and website updates!We first mentioned Artificial Analysis in 2024, when it was still a side project in a Sydney basement. They then were one of the few Nat Friedman and Daniel Gross' AIGrant companies to raise a full seed round from them and have now become the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities.We have chatted with both Clementine Fourrier of HuggingFace's OpenLLM Leaderboard and (the freshly valued at $1.7B) Anastasios Angelopoulos of LMArena on their approaches to LLM evals and trendspotting, but Artificial Analysis have staked out an enduring and important place in the toolkit of the modern AI Engineer by doing the best job of independently running the most comprehensive set of evals across the widest range of open and closed models, and charting their progress for broad industry analyst use.George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is “open” really?We discuss:* The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet* Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers* The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints* How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)* The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs* Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding ”I don't know”), and Claude models lead with the lowest hallucination rates despite not always being the smartest* GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)* The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)* The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)* Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future* Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)* V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)Links to Artificial Analysis* Website: https://artificialanalysis.ai* George Cameron on X: https://x.com/georgecameron* Micah-Hill Smith on X: https://x.com/micahhsmithFull Episode on YouTubeTimestamps* 00:00 Introduction: Full Circle Moment and Artificial Analysis Origins* 01:19 Business Model: Independence and Revenue Streams* 04:33 Origin Story: From Legal AI to Benchmarking Need* 16:22 AI Grant and Moving to San Francisco* 19:21 Intelligence Index Evolution: From V1 to V3* 11:47 Benchmarking Challenges: Variance, Contamination, and Methodology* 13:52 Mystery Shopper Policy and Maintaining Independence* 28:01 New Benchmarks: Omissions Index for Hallucination Detection* 33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning* 23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks* 50:19 Stirrup Agent Harness: Open Source Agentic Framework* 52:43 Openness Index: Measuring Model Transparency Beyond Licenses* 58:25 The Smiling Curve: Cost Falling While Spend Rising* 1:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits* 1:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges* 1:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas* 1:15:05 Looking Ahead: Intelligence Index V4 and Future Directions* 1:16:50 Closing: The Insatiable Demand for IntelligenceTranscriptMicah [00:00:06]: This is kind of a full circle moment for us in a way, because the first time artificial analysis got mentioned on a podcast was you and Alessio on Latent Space. Amazing.swyx [00:00:17]: Which was January 2024. I don't even remember doing that, but yeah, it was very influential to me. Yeah, I'm looking at AI News for Jan 17, or Jan 16, 2024. I said, this gem of a models and host comparison site was just launched. And then I put in a few screenshots, and I said, it's an independent third party. It clearly outlines the quality versus throughput trade-off, and it breaks out by model and hosting provider. I did give you s**t for missing fireworks, and how do you have a model benchmarking thing without fireworks? But you had together, you had perplexity, and I think we just started chatting there. Welcome, George and Micah, to Latent Space. I've been following your progress. Congrats on... It's been an amazing year. You guys have really come together to be the presumptive new gardener of AI, right? Which is something that...George [00:01:09]: Yeah, but you can't pay us for better results.swyx [00:01:12]: Yes, exactly.George [00:01:13]: Very important.Micah [00:01:14]: Start off with a spicy take.swyx [00:01:18]: Okay, how do I pay you?Micah [00:01:20]: Let's get right into that.swyx [00:01:21]: How do you make money?Micah [00:01:24]: Well, very happy to talk about that. So it's been a big journey the last couple of years. Artificial analysis is going to be two years old in January 2026. Which is pretty soon now. We first run the website for free, obviously, and give away a ton of data to help developers and companies navigate AI and make decisions about models, providers, technologies across the AI stack for building stuff. We're very committed to doing that and tend to keep doing that. We have, along the way, built a business that is working out pretty sustainably. We've got just over 20 people now and two main customer groups. So we want to be... We want to be who enterprise look to for data and insights on AI, so we want to help them with their decisions about models and technologies for building stuff. And then on the other side, we do private benchmarking for companies throughout the AI stack who build AI stuff. So no one pays to be on the website. We've been very clear about that from the very start because there's no use doing what we do unless it's independent AI benchmarking. Yeah. But turns out a bunch of our stuff can be pretty useful to companies building AI stuff.swyx [00:02:38]: And is it like, I am a Fortune 500, I need advisors on objective analysis, and I call you guys and you pull up a custom report for me, you come into my office and give me a workshop? What kind of engagement is that?George [00:02:53]: So we have a benchmarking and insight subscription, which looks like standardized reports that cover key topics or key challenges enterprises face when looking to understand AI and choose between all the technologies. And so, for instance, one of the report is a model deployment report, how to think about choosing between serverless inference, managed deployment solutions, or leasing chips. And running inference yourself is an example kind of decision that big enterprises face, and it's hard to reason through, like this AI stuff is really new to everybody. And so we try and help with our reports and insight subscription. Companies navigate that. We also do custom private benchmarking. And so that's very different from the public benchmarking that we publicize, and there's no commercial model around that. For private benchmarking, we'll at times create benchmarks, run benchmarks to specs that enterprises want. And we'll also do that sometimes for AI companies who have built things, and we help them understand what they've built with private benchmarking. Yeah. So that's a piece mainly that we've developed through trying to support everybody publicly with our public benchmarks. Yeah.swyx [00:04:09]: Let's talk about TechStack behind that. But okay, I'm going to rewind all the way to when you guys started this project. You were all the way in Sydney? Yeah. Well, Sydney, Australia for me.Micah [00:04:19]: George was an SF, but he's Australian, but he moved here already. Yeah.swyx [00:04:22]: And I remember I had the Zoom call with you. What was the impetus for starting artificial analysis in the first place? You know, you started with public benchmarks. And so let's start there. We'll go to the private benchmark. Yeah.George [00:04:33]: Why don't we even go back a little bit to like why we, you know, thought that it was needed? Yeah.Micah [00:04:40]: The story kind of begins like in 2022, 2023, like both George and I have been into AI stuff for quite a while. In 2023 specifically, I was trying to build a legal AI research assistant. So it actually worked pretty well for its era, I would say. Yeah. Yeah. So I was finding that the more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem. So had like this multistage algorithm thing, trying to figure out what the minimum viable model for each bit was, trying to optimize every bit of it as you build that out, right? Like you're trying to think about accuracy, a bunch of other metrics and performance and cost. And mostly just no one was doing anything to independently evaluate all the models. And certainly not to look at the trade-offs for speed and cost. So we basically set out just to build a thing that developers could look at to see the trade-offs between all of those things measured independently across all the models and providers. Honestly, it was probably meant to be a side project when we first started doing it.swyx [00:05:49]: Like we didn't like get together and say like, Hey, like we're going to stop working on all this stuff. I'm like, this is going to be our main thing. When I first called you, I think you hadn't decided on starting a company yet.Micah [00:05:58]: That's actually true. I don't even think we'd pause like, like George had an acquittance job. I didn't quit working on my legal AI thing. Like it was genuinely a side project.George [00:06:05]: We built it because we needed it as people building in the space and thought, Oh, other people might find it useful too. So we'll buy domain and link it to the Vercel deployment that we had and tweet about it. And, but very quickly it started getting attention. Thank you, Swyx for, I think doing an initial retweet and spotlighting it there. This project that we released. And then very quickly though, it was useful to others, but very quickly it became more useful as the number of models released accelerated. We had Mixtrel 8x7B and it was a key. That's a fun one. Yeah. Like a open source model that really changed the landscape and opened up people's eyes to other serverless inference providers and thinking about speed, thinking about cost. And so that was a key. And so it became more useful quite quickly. Yeah.swyx [00:07:02]: What I love talking to people like you who sit across the ecosystem is, well, I have theories about what people want, but you have data and that's obviously more relevant. But I want to stay on the origin story a little bit more. When you started out, I would say, I think the status quo at the time was every paper would come out and they would report their numbers versus competitor numbers. And that's basically it. And I remember I did the legwork. I think everyone has some knowledge. I think there's some version of Excel sheet or a Google sheet where you just like copy and paste the numbers from every paper and just post it up there. And then sometimes they don't line up because they're independently run. And so your numbers are going to look better than... Your reproductions of other people's numbers are going to look worse because you don't hold their models correctly or whatever the excuse is. I think then Stanford Helm, Percy Liang's project would also have some of these numbers. And I don't know if there's any other source that you can cite. The way that if I were to start artificial analysis at the same time you guys started, I would have used the Luther AI's eval framework harness. Yup.Micah [00:08:06]: Yup. That was some cool stuff. At the end of the day, running these evals, it's like if it's a simple Q&A eval, all you're doing is asking a list of questions and checking if the answers are right, which shouldn't be that crazy. But it turns out there are an enormous number of things that you've got control for. And I mean, back when we started the website. Yeah. Yeah. Like one of the reasons why we realized that we had to run the evals ourselves and couldn't just take rules from the labs was just that they would all prompt the models differently. And when you're competing over a few points, then you can pretty easily get- You can put the answer into the model. Yeah. That in the extreme. And like you get crazy cases like back when I'm Googled a Gemini 1.0 Ultra and needed a number that would say it was better than GPT-4 and like constructed, I think never published like chain of thought examples. 32 of them in every topic in MLU to run it, to get the score, like there are so many things that you- They never shipped Ultra, right? That's the one that never made it up. Not widely. Yeah. Yeah. Yeah. I mean, I'm sure it existed, but yeah. So we were pretty sure that we needed to run them ourselves and just run them in the same way across all the models. Yeah. And we were, we also did certain from the start that you couldn't look at those in isolation. You needed to look at them alongside the cost and performance stuff. Yeah.swyx [00:09:24]: Okay. A couple of technical questions. I mean, so obviously I also thought about this and I didn't do it because of cost. Yep. Did you not worry about costs? Were you funded already? Clearly not, but you know. No. Well, we definitely weren't at the start.Micah [00:09:36]: So like, I mean, we're paying for it personally at the start. There's a lot of money. Well, the numbers weren't nearly as bad a couple of years ago. So we certainly incurred some costs, but we were probably in the order of like hundreds of dollars of spend across all the benchmarking that we were doing. Yeah. So nothing. Yeah. It was like kind of fine. Yeah. Yeah. These days that's gone up an enormous amount for a bunch of reasons that we can talk about. But yeah, it wasn't that bad because you can also remember that like the number of models we were dealing with was hardly any and the complexity of the stuff that we wanted to do to evaluate them was a lot less. Like we were just asking some Q&A type questions and then one specific thing was for a lot of evals initially, we were just like sampling an answer. You know, like, what's the answer for this? Like, we didn't want to go into the answer directly without letting the models think. We weren't even doing chain of thought stuff initially. And that was the most useful way to get some results initially. Yeah.swyx [00:10:33]: And so for people who haven't done this work, literally parsing the responses is a whole thing, right? Like because sometimes the models, the models can answer any way they feel fit and sometimes they actually do have the right answer, but they just returned the wrong format and they will get a zero for that unless you work it into your parser. And that involves more work. And so, I mean, but there's an open question whether you should give it points for not following your instructions on the format.Micah [00:11:00]: It depends what you're looking at, right? Because you can, if you're trying to see whether or not it can solve a particular type of reasoning problem, and you don't want to test it on its ability to do answer formatting at the same time, then you might want to use an LLM as answer extractor approach to make sure that you get the answer out no matter how unanswered. But these days, it's mostly less of a problem. Like, if you instruct a model and give it examples of what the answers should look like, it can get the answers in your format, and then you can do, like, a simple regex.swyx [00:11:28]: Yeah, yeah. And then there's other questions around, I guess, sometimes if you have a multiple choice question, sometimes there's a bias towards the first answer, so you have to randomize the responses. All these nuances, like, once you dig into benchmarks, you're like, I don't know how anyone believes the numbers on all these things. It's so dark magic.Micah [00:11:47]: You've also got, like… You've got, like, the different degrees of variance in different benchmarks, right? Yeah. So, if you run four-question multi-choice on a modern reasoning model at the temperatures suggested by the labs for their own models, the variance that you can see on a four-question multi-choice eval is pretty enormous if you only do a single run of it and it has a small number of questions, especially. So, like, one of the things that we do is run an enormous number of all of our evals when we're developing new ones and doing upgrades to our intelligence index to bring in new things. Yeah. So, that we can dial in the right number of repeats so that we can get to the 95% confidence intervals that we're comfortable with so that when we pull that together, we can be confident in intelligence index to at least as tight as, like, a plus or minus one at a 95% confidence. Yeah.swyx [00:12:32]: And, again, that just adds a straight multiple to the cost. Oh, yeah. Yeah, yeah.George [00:12:37]: So, that's one of many reasons that cost has gone up a lot more than linearly over the last couple of years. We report a cost to run the artificial analysis. We report a cost to run the artificial analysis intelligence index on our website, and currently that's assuming one repeat in terms of how we report it because we want to reflect a bit about the weighting of the index. But our cost is actually a lot higher than what we report there because of the repeats.swyx [00:13:03]: Yeah, yeah, yeah. And probably this is true, but just checking, you don't have any special deals with the labs. They don't discount it. You just pay out of pocket or out of your sort of customer funds. Oh, there is a mix. So, the issue is that sometimes they may give you a special end point, which is… Ah, 100%.Micah [00:13:21]: Yeah, yeah, yeah. Exactly. So, we laser focus, like, on everything we do on having the best independent metrics and making sure that no one can manipulate them in any way. There are quite a lot of processes we've developed over the last couple of years to make that true for, like, the one you bring up, like, right here of the fact that if we're working with a lab, if they're giving us a private endpoint to evaluate a model, that it is totally possible. That what's sitting behind that black box is not the same as they serve on a public endpoint. We're very aware of that. We have what we call a mystery shopper policy. And so, and we're totally transparent with all the labs we work with about this, that we will register accounts not on our own domain and run both intelligence evals and performance benchmarks… Yeah, that's the job. …without them being able to identify it. And no one's ever had a problem with that. Because, like, a thing that turns out to actually be quite a good… …good factor in the industry is that they all want to believe that none of their competitors could manipulate what we're doing either.swyx [00:14:23]: That's true. I never thought about that. I've been in the database data industry prior, and there's a lot of shenanigans around benchmarking, right? So I'm just kind of going through the mental laundry list. Did I miss anything else in this category of shenanigans? Oh, potential shenanigans.Micah [00:14:36]: I mean, okay, the biggest one, like, that I'll bring up, like, is more of a conceptual one, actually, than, like, direct shenanigans. It's that the things that get measured become things that get targeted by labs that they're trying to build, right? Exactly. So that doesn't mean anything that we should really call shenanigans. Like, I'm not talking about training on test set. But if you know that you're going to be great at another particular thing, if you're a researcher, there are a whole bunch of things that you can do to try to get better at that thing that preferably are going to be helpful for a wide range of how actual users want to use the thing that you're building. But will not necessarily work. Will not necessarily do that. So, for instance, the models are exceptional now at answering competition maths problems. There is some relevance of that type of reasoning, that type of work, to, like, how we might use modern coding agents and stuff. But it's clearly not one for one. So the thing that we have to be aware of is that once an eval becomes the thing that everyone's looking at, scores can get better on it without there being a reflection of overall generalized intelligence of these models. Getting better. That has been true for the last couple of years. It'll be true for the next couple of years. There's no silver bullet to defeat that other than building new stuff to stay relevant and measure the capabilities that matter most to real users. Yeah.swyx [00:15:58]: And we'll cover some of the new stuff that you guys are building as well, which is cool. Like, you used to just run other people's evals, but now you're coming up with your own. And I think, obviously, that is a necessary path once you're at the frontier. You've exhausted all the existing evals. I think the next point in history that I have for you is AI Grant that you guys decided to join and move here. What was it like? I think you were in, like, batch two? Batch four. Batch four. Okay.Micah [00:16:26]: I mean, it was great. Nat and Daniel are obviously great. And it's a really cool group of companies that we were in AI Grant alongside. It was really great to get Nat and Daniel on board. Obviously, they've done a whole lot of great work in the space with a lot of leading companies and were extremely aligned. With the mission of what we were trying to do. Like, we're not quite typical of, like, a lot of the other AI startups that they've invested in.swyx [00:16:53]: And they were very much here for the mission of what we want to do. Did they say any advice that really affected you in some way or, like, were one of the events very impactful? That's an interesting question.Micah [00:17:03]: I mean, I remember fondly a bunch of the speakers who came and did fireside chats at AI Grant.swyx [00:17:09]: Which is also, like, a crazy list. Yeah.George [00:17:11]: Oh, totally. Yeah, yeah, yeah. There was something about, you know, speaking to Nat and Daniel about the challenges of working through a startup and just working through the questions that don't have, like, clear answers and how to work through those kind of methodically and just, like, work through the hard decisions. And they've been great mentors to us as we've built artificial analysis. Another benefit for us was that other companies in the batch and other companies in AI Grant are pushing the capabilities. Yeah. And I think that's a big part of what AI can do at this time. And so being in contact with them, making sure that artificial analysis is useful to them has been fantastic for supporting us in working out how should we build out artificial analysis to continue to being useful to those, like, you know, building on AI.swyx [00:17:59]: I think to some extent, I'm mixed opinion on that one because to some extent, your target audience is not people in AI Grants who are obviously at the frontier. Yeah. Do you disagree?Micah [00:18:09]: To some extent. To some extent. But then, so a lot of what the AI Grant companies are doing is taking capabilities coming out of the labs and trying to push the limits of what they can do across the entire stack for building great applications, which actually makes some of them pretty archetypical power users of artificial analysis. Some of the people with the strongest opinions about what we're doing well and what we're not doing well and what they want to see next from us. Yeah. Yeah. Because when you're building any kind of AI application now, chances are you're using a whole bunch of different models. You're maybe switching reasonably frequently for different models and different parts of your application to optimize what you're able to do with them at an accuracy level and to get better speed and cost characteristics. So for many of them, no, they're like not commercial customers of ours, like we don't charge for all our data on the website. Yeah. They are absolutely some of our power users.swyx [00:19:07]: So let's talk about just the evals as well. So you start out from the general like MMU and GPQA stuff. What's next? How do you sort of build up to the overall index? What was in V1 and how did you evolve it? Okay.Micah [00:19:22]: So first, just like background, like we're talking about the artificial analysis intelligence index, which is our synthesis metric that we pulled together currently from 10 different eval data sets to give what? We're pretty much the same as that. Pretty confident is the best single number to look at for how smart the models are. Obviously, it doesn't tell the whole story. That's why we published the whole website of all the charts to dive into every part of it and look at the trade-offs. But best single number. So right now, it's got a bunch of Q&A type data sets that have been very important to the industry, like a couple that you just mentioned. It's also got a couple of agentic data sets. It's got our own long context reasoning data set and some other use case focused stuff. As time goes on. The things that we're most interested in that are going to be important to the capabilities that are becoming more important for AI, what developers are caring about, are going to be first around agentic capabilities. So surprise, surprise. We're all loving our coding agents and how the model is going to perform like that and then do similar things for different types of work are really important to us. The linking to use cases to economically valuable use cases are extremely important to us. And then we've got some of the. Yeah. These things that the models still struggle with, like working really well over long contexts that are not going to go away as specific capabilities and use cases that we need to keep evaluating.swyx [00:20:46]: But I guess one thing I was driving was like the V1 versus the V2 and how bad it was over time.Micah [00:20:53]: Like how we've changed the index to where we are.swyx [00:20:55]: And I think that reflects on the change in the industry. Right. So that's a nice way to tell that story.Micah [00:21:00]: Well, V1 would be completely saturated right now. Almost every model coming out because doing things like writing the Python functions and human evil is now pretty trivial. It's easy to forget, actually, I think how much progress has been made in the last two years. Like we obviously play the game constantly of like the today's version versus last week's version and the week before and all of the small changes in the horse race between the current frontier and who has the best like smaller than 10B model like right now this week. Right. And that's very important to a lot of developers and people and especially in this particular city of San Francisco. But when you zoom out a couple of years ago, literally most of what we were doing to evaluate the models then would all be 100% solved by even pretty small models today. And that's been one of the key things, by the way, that's driven down the cost of intelligence at every tier of intelligence. We can talk about more in a bit. So V1, V2, V3, we made things harder. We covered a wider range of use cases. And we tried to get closer to things developers care about as opposed to like just the Q&A type stuff that MMLU and GPQA represented. Yeah.swyx [00:22:12]: I don't know if you have anything to add there. Or we could just go right into showing people the benchmark and like looking around and asking questions about it. Yeah.Micah [00:22:21]: Let's do it. Okay. This would be a pretty good way to chat about a few of the new things we've launched recently. Yeah.George [00:22:26]: And I think a little bit about the direction that we want to take it. And we want to push benchmarks. Currently, the intelligence index and evals focus a lot on kind of raw intelligence. But we kind of want to diversify how we think about intelligence. And we can talk about it. But kind of new evals that we've kind of built and partnered on focus on topics like hallucination. And we've got a lot of topics that I think are not covered by the current eval set that should be. And so we want to bring that forth. But before we get into that.swyx [00:23:01]: And so for listeners, just as a timestamp, right now, number one is Gemini 3 Pro High. Then followed by Cloud Opus at 70. Just 5.1 high. You don't have 5.2 yet. And Kimi K2 Thinking. Wow. Still hanging in there. So those are the top four. That will date this podcast quickly. Yeah. Yeah. I mean, I love it. I love it. No, no. 100%. Look back this time next year and go, how cute. Yep.George [00:23:25]: Totally. A quick view of that is, okay, there's a lot. I love it. I love this chart. Yeah.Micah [00:23:30]: This is such a favorite, right? Yeah. And almost every talk that George or I give at conferences and stuff, we always put this one up first to just talk about situating where we are in this moment in history. This, I think, is the visual version of what I was saying before about the zooming out and remembering how much progress there's been. If we go back to just over a year ago, before 01, before Cloud Sonnet 3.5, we didn't have reasoning models or coding agents as a thing. And the game was very, very different. If we go back even a little bit before then, we're in the era where, when you look at this chart, open AI was untouchable for well over a year. And, I mean, you would remember that time period well of there being very open questions about whether or not AI was going to be competitive, like full stop, whether or not open AI would just run away with it, whether we would have a few frontier labs and no one else would really be able to do anything other than consume their APIs. I am quite happy overall that the world that we have ended up in is one where... Multi-model. Absolutely. And strictly more competitive every quarter over the last few years. Yeah. This year has been insane. Yeah.George [00:24:42]: You can see it. This chart with everything added is hard to read currently. There's so many dots on it, but I think it reflects a little bit what we felt, like how crazy it's been.swyx [00:24:54]: Why 14 as the default? Is that a manual choice? Because you've got service now in there that are less traditional names. Yeah.George [00:25:01]: It's models that we're kind of highlighting by default in our charts, in our intelligence index. Okay.swyx [00:25:07]: You just have a manually curated list of stuff.George [00:25:10]: Yeah, that's right. But something that I actually don't think every artificial analysis user knows is that you can customize our charts and choose what models are highlighted. Yeah. And so if we take off a few names, it gets a little easier to read.swyx [00:25:25]: Yeah, yeah. A little easier to read. Totally. Yeah. But I love that you can see the all one jump. Look at that. September 2024. And the DeepSeek jump. Yeah.George [00:25:34]: Which got close to OpenAI's leadership. They were so close. I think, yeah, we remember that moment. Around this time last year, actually.Micah [00:25:44]: Yeah, yeah, yeah. I agree. Yeah, well, a couple of weeks. It was Boxing Day in New Zealand when DeepSeek v3 came out. And we'd been tracking DeepSeek and a bunch of the other global players that were less known over the second half of 2024 and had run evals on the earlier ones and stuff. I very distinctly remember Boxing Day in New Zealand, because I was with family for Christmas and stuff, running the evals and getting back result by result on DeepSeek v3. So this was the first of their v3 architecture, the 671b MOE.Micah [00:26:19]: And we were very, very impressed. That was the moment where we were sure that DeepSeek was no longer just one of many players, but had jumped up to be a thing. The world really noticed when they followed that up with the RL working on top of v3 and R1 succeeding a few weeks later. But the groundwork for that absolutely was laid with just extremely strong base model, completely open weights that we had as the best open weights model. So, yeah, that's the thing that you really see in the game. But I think that we got a lot of good feedback on Boxing Day. us on Boxing Day last year.George [00:26:48]: Boxing Day is the day after Christmas for those not familiar.George [00:26:54]: I'm from Singapore.swyx [00:26:55]: A lot of us remember Boxing Day for a different reason, for the tsunami that happened. Oh, of course. Yeah, but that was a long time ago. So yeah. So this is the rough pitch of AAQI. Is it A-A-Q-I or A-A-I-I? I-I. Okay. Good memory, though.Micah [00:27:11]: I don't know. I'm not used to it. Once upon a time, we did call it Quality Index, and we would talk about quality, performance, and price, but we changed it to intelligence.George [00:27:20]: There's been a few naming changes. We added hardware benchmarking to the site, and so benchmarks at a kind of system level. And so then we changed our throughput metric to, we now call it output speed, and thenswyx [00:27:32]: throughput makes sense at a system level, so we took that name. Take me through more charts. What should people know? Obviously, the way you look at the site is probably different than how a beginner might look at it.Micah [00:27:42]: Yeah, that's fair. There's a lot of fun stuff to dive into. Maybe so we can hit past all the, like, we have lots and lots of emails and stuff. The interesting ones to talk about today that would be great to bring up are a few of our recent things, I think, that probably not many people will be familiar with yet. So first one of those is our omniscience index. So this one is a little bit different to most of the intelligence evils that we've run. We built it specifically to look at the embedded knowledge in the models and to test hallucination by looking at when the model doesn't know the answer, so not able to get it correct, what's its probability of saying, I don't know, or giving an incorrect answer. So the metric that we use for omniscience goes from negative 100 to positive 100. Because we're simply taking off a point if you give an incorrect answer to the question. We're pretty convinced that this is an example of where it makes most sense to do that, because it's strictly more helpful to say, I don't know, instead of giving a wrong answer to factual knowledge question. And one of our goals is to shift the incentive that evils create for models and the labs creating them to get higher scores. And almost every evil across all of AI up until this point, it's been graded by simple percentage correct as the main metric, the main thing that gets hyped. And so you should take a shot at everything. There's no incentive to say, I don't know. So we did that for this one here.swyx [00:29:22]: I think there's a general field of calibration as well, like the confidence in your answer versus the rightness of the answer. Yeah, we completely agree. Yeah. Yeah.George [00:29:31]: On that. And one reason that we didn't do that is because. Or put that into this index is that we think that the, the way to do that is not to ask the models how confident they are.swyx [00:29:43]: I don't know. Maybe it might be though. You put it like a JSON field, say, say confidence and maybe it spits out something. Yeah. You know, we have done a few evils podcasts over the, over the years. And when we did one with Clementine of hugging face, who maintains the open source leaderboard, and this was one of her top requests, which is some kind of hallucination slash lack of confidence calibration thing. And so, Hey, this is one of them.Micah [00:30:05]: And I mean, like anything that we do, it's not a perfect metric or the whole story of everything that you think about as hallucination. But yeah, it's pretty useful and has some interesting results. Like one of the things that we saw in the hallucination rate is that anthropics Claude models at the, the, the very left-hand side here with the lowest hallucination rates out of the models that we've evaluated amnesty is on. That is an interesting fact. I think it probably correlates with a lot of the previously, not really measured vibes stuff that people like about some of the Claude models. Is the dataset public or what's is it, is there a held out set? There's a hell of a set for this one. So we, we have published a public test set, but we we've only published 10% of it. The reason is that for this one here specifically, it would be very, very easy to like have data contamination because it is just factual knowledge questions. We would. We'll update it at a time to also prevent that, but with yeah, kept most of it held out so that we can keep it reliable for a long time. It leads us to a bunch of really cool things, including breakdown quite granularly by topic. And so we've got some of that disclosed on the website publicly right now, and there's lots more coming in terms of our ability to break out very specific topics. Yeah.swyx [00:31:23]: I would be interested. Let's, let's dwell a little bit on this hallucination one. I noticed that Haiku hallucinates less than Sonnet hallucinates less than Opus. And yeah. Would that be the other way around in a normal capability environments? I don't know. What's, what do you make of that?George [00:31:37]: One interesting aspect is that we've found that there's not really a, not a strong correlation between intelligence and hallucination, right? That's to say that the smarter the models are in a general sense, isn't correlated with their ability to, when they don't know something, say that they don't know. It's interesting that Gemini three pro preview was a big leap over here. Gemini 2.5. Flash and, and, and 2.5 pro, but, and if I add pro quickly here.swyx [00:32:07]: I bet pro's really good. Uh, actually no, I meant, I meant, uh, the GPT pros.George [00:32:12]: Oh yeah.swyx [00:32:13]: Cause GPT pros are rumored. We don't know for a fact that it's like eight runs and then with the LM judge on top. Yeah.George [00:32:20]: So we saw a big jump in, this is accuracy. So this is just percent that they get, uh, correct and Gemini three pro knew a lot more than the other models. And so big jump in accuracy. But relatively no change between the Google Gemini models, between releases. And the hallucination rate. Exactly. And so it's likely due to just kind of different post-training recipe, between the, the Claude models. Yeah.Micah [00:32:45]: Um, there's, there's driven this. Yeah. You can, uh, you can partially blame us and how we define intelligence having until now not defined hallucination as a negative in the way that we think about intelligence.swyx [00:32:56]: And so that's what we're changing. Uh, I know many smart people who are confidently incorrect.George [00:33:02]: Uh, look, look at that. That, that, that is very humans. Very true. And there's times and a place for that. I think our view is that hallucination rate makes sense in this context where it's around knowledge, but in many cases, people want the models to hallucinate, to have a go. Often that's the case in coding or when you're trying to generate newer ideas. One eval that we added to artificial analysis is, is, is critical point and it's really hard, uh, physics problems. Okay.swyx [00:33:32]: And is it sort of like a human eval type or something different or like a frontier math type?George [00:33:37]: It's not dissimilar to frontier frontier math. So these are kind of research questions that kind of academics in the physics physics world would be able to answer, but models really struggled to answer. So the top score here is not 9%.swyx [00:33:51]: And when the people that, that created this like Minway and, and, and actually off via who was kind of behind sweep and what organization is this? Oh, is this, it's Princeton.George [00:34:01]: Kind of range of academics from, from, uh, different academic institutions, really smart people. They talked about how they turn the models up in terms of the temperature as high temperature as they can, where they're trying to explore kind of new ideas in physics as a, as a thought partner, just because they, they want the models to hallucinate. Um, yeah, sometimes it's something new. Yeah, exactly.swyx [00:34:21]: Um, so not right in every situation, but, um, I think it makes sense, you know, to test hallucination in scenarios where it makes sense. Also, the obvious question is, uh, this is one of. Many that there is there, every lab has a system card that shows some kind of hallucination number, and you've chosen to not, uh, endorse that and you've made your own. And I think that's a, that's a choice. Um, totally in some sense, the rest of artificial analysis is public benchmarks that other people can independently rerun. You provide it as a service here. You have to fight the, well, who are we to, to like do this? And your, your answer is that we have a lot of customers and, you know, but like, I guess, how do you converge the individual?Micah [00:35:08]: I mean, I think, I think for hallucinations specifically, there are a bunch of different things that you might care about reasonably, and that you'd measure quite differently, like we've called this a amnesty and solutionation rate, not trying to declare the, like, it's humanity's last hallucination. You could, uh, you could have some interesting naming conventions and all this stuff. Um, the biggest picture answer to that. It's something that I actually wanted to mention. Just as George was explaining, critical point as well is, so as we go forward, we are building evals internally. We're partnering with academia and partnering with AI companies to build great evals. We have pretty strong views on, in various ways for different parts of the AI stack, where there are things that are not being measured well, or things that developers care about that should be measured more and better. And we intend to be doing that. We're not obsessed necessarily with that. Everything we do, we have to do entirely within our own team. Critical point. As a cool example of where we were a launch partner for it, working with academia, we've got some partnerships coming up with a couple of leading companies. Those ones, obviously we have to be careful with on some of the independent stuff, but with the right disclosure, like we're completely comfortable with that. A lot of the labs have released great data sets in the past that we've used to great success independently. And so it's between all of those techniques, we're going to be releasing more stuff in the future. Cool.swyx [00:36:26]: Let's cover the last couple. And then we'll, I want to talk about your trends analysis stuff, you know? Totally.Micah [00:36:31]: So that actually, I have one like little factoid on omniscience. If you go back up to accuracy on omniscience, an interesting thing about this accuracy metric is that it tracks more closely than anything else that we measure. The total parameter count of models makes a lot of sense intuitively, right? Because this is a knowledge eval. This is the pure knowledge metric. We're not looking at the index and the hallucination rate stuff that we think is much more about how the models are trained. This is just what facts did they recall? And yeah, it tracks parameter count extremely closely. Okay.swyx [00:37:05]: What's the rumored size of GPT-3 Pro? And to be clear, not confirmed for any official source, just rumors. But rumors do fly around. Rumors. I get, I hear all sorts of numbers. I don't know what to trust.Micah [00:37:17]: So if you, if you draw the line on omniscience accuracy versus total parameters, we've got all the open ways models, you can squint and see that likely the leading frontier models right now are quite a lot bigger than the ones that we're seeing right now. And the one trillion parameters that the open weights models cap out at, and the ones that we're looking at here, there's an interesting extra data point that Elon Musk revealed recently about XAI that for three trillion parameters for GROK 3 and 4, 6 trillion for GROK 5, but that's not out yet. Take those together, have a look. You might reasonably form a view that there's a pretty good chance that Gemini 3 Pro is bigger than that, that it could be in the 5 to 10 trillion parameters. To be clear, I have absolutely no idea, but just based on this chart, like that's where you would, you would land if you have a look at it. Yeah.swyx [00:38:07]: And to some extent, I actually kind of discourage people from guessing too much because what does it really matter? Like as long as they can serve it as a sustainable cost, that's about it. Like, yeah, totally.George [00:38:17]: They've also got different incentives in play compared to like open weights models who are thinking to supporting others in self-deployment for the labs who are doing inference at scale. It's I think less about total parameters in many cases. When thinking about inference costs and more around number of active parameters. And so there's a bit of an incentive towards larger sparser models. Agreed.Micah [00:38:38]: Understood. Yeah. Great. I mean, obviously if you're a developer or company using these things, not exactly as you say, it doesn't matter. You should be looking at all the different ways that we measure intelligence. You should be looking at cost to run index number and the different ways of thinking about token efficiency and cost efficiency based on the list prices, because that's all it matters.swyx [00:38:56]: It's not as good for the content creator rumor mill where I can say. Oh, GPT-4 is this small circle. Look at GPT-5 is this big circle. And then there used to be a thing for a while. Yeah.Micah [00:39:07]: But that is like on its own, actually a very interesting one, right? That is it just purely that chances are the last couple of years haven't seen a dramatic scaling up in the total size of these models. And so there's a lot of room to go up properly in total size of the models, especially with the upcoming hardware generations. Yes.swyx [00:39:29]: So, you know. Taking off my shitposting face for a minute. Yes. Yes. At the same time, I do feel like, you know, especially coming back from Europe, people do feel like Ilya is probably right that the paradigm is doesn't have many more orders of magnitude to scale out more. And therefore we need to start exploring at least a different path. GDPVal, I think it's like only like a month or so old. I was also very positive when it first came out. I actually talked to Tejo, who was the lead researcher on that. Oh, cool. And you have your own version.George [00:39:59]: It's a fantastic. It's a fantastic data set. Yeah.swyx [00:40:01]: And maybe it will recap for people who are still out of it. It's like 44 tasks based on some kind of GDP cutoff that's like meant to represent broad white collar work that is not just coding. Yeah.Micah [00:40:12]: Each of the tasks have a whole bunch of detailed instructions, some input files for a lot of them. It's within the 44 is divided into like two hundred and twenty two to five, maybe subtasks that are the level of that we run through the agenda. And yeah, they're really interesting. I will say that it doesn't. It doesn't necessarily capture like all the stuff that people do at work. No avail is perfect is always going to be more things to look at, largely because in order to make the tasks well enough to find that you can run them, they need to only have a handful of input files and very specific instructions for that task. And so I think the easiest way to think about them are that they're like quite hard take home exam tasks that you might do in an interview process.swyx [00:40:56]: Yeah, for listeners, it is not no longer like a long prompt. It is like, well, here's a zip file with like a spreadsheet or a PowerPoint deck or a PDF and go nuts and answer this question.George [00:41:06]: OpenAI released a great data set and they released a good paper which looks at performance across the different web chat bots on the data set. It's a great paper, encourage people to read it. What we've done is taken that data set and turned it into an eval that can be run on any model. So we created a reference agentic harness that can run. Run the models on the data set, and then we developed evaluator approach to compare outputs. That's kind of AI enabled, so it uses Gemini 3 Pro Preview to compare results, which we tested pretty comprehensively to ensure that it's aligned to human preferences. One data point there is that even as an evaluator, Gemini 3 Pro, interestingly, doesn't do actually that well. So that's kind of a good example of what we've done in GDPVal AA.swyx [00:42:01]: Yeah, the thing that you have to watch out for with LLM judge is self-preference that models usually prefer their own output, and in this case, it was not. Totally.Micah [00:42:08]: I think the way that we're thinking about the places where it makes sense to use an LLM as judge approach now, like quite different to some of the early LLM as judge stuff a couple of years ago, because some of that and MTV was a great project that was a good example of some of this a while ago was about judging conversations and like a lot of style type stuff. Here, we've got the task that the grader and grading model is doing is quite different to the task of taking the test. When you're taking the test, you've got all of the agentic tools you're working with, the code interpreter and web search, the file system to go through many, many turns to try to create the documents. Then on the other side, when we're grading it, we're running it through a pipeline to extract visual and text versions of the files and be able to provide that to Gemini, and we're providing the criteria for the task and getting it to pick which one more effectively meets the criteria of the task. Yeah. So we've got the task out of two potential outcomes. It turns out that we proved that it's just very, very good at getting that right, matched with human preference a lot of the time, because I think it's got the raw intelligence, but it's combined with the correct representation of the outputs, the fact that the outputs were created with an agentic task that is quite different to the way the grading model works, and we're comparing it against criteria, not just kind of zero shot trying to ask the model to pick which one is better.swyx [00:43:26]: Got it. Why is this an ELO? And not a percentage, like GDP-VAL?George [00:43:31]: So the outputs look like documents, and there's video outputs or audio outputs from some of the tasks. It has to make a video? Yeah, for some of the tasks. Some of the tasks.swyx [00:43:43]: What task is that?George [00:43:45]: I mean, it's in the data set. Like be a YouTuber? It's a marketing video.Micah [00:43:49]: Oh, wow. What? Like model has to go find clips on the internet and try to put it together. The models are not that good at doing that one, for now, to be clear. It's pretty hard to do that with a code editor. I mean, the computer stuff doesn't work quite well enough and so on and so on, but yeah.George [00:44:02]: And so there's no kind of ground truth, necessarily, to compare against, to work out percentage correct. It's hard to come up with correct or incorrect there. And so it's on a relative basis. And so we use an ELO approach to compare outputs from each of the models between the task.swyx [00:44:23]: You know what you should do? You should pay a contractor, a human, to do the same task. And then give it an ELO and then so you have, you have human there. It's just, I think what's helpful about GDPVal, the OpenAI one, is that 50% is meant to be normal human and maybe Domain Expert is higher than that, but 50% was the bar for like, well, if you've crossed 50, you are superhuman. Yeah.Micah [00:44:47]: So we like, haven't grounded this score in that exactly. I agree that it can be helpful, but we wanted to generalize this to a very large number. It's one of the reasons that presenting it as ELO is quite helpful and allows us to add models and it'll stay relevant for quite a long time. I also think it, it can be tricky looking at these exact tasks compared to the human performance, because the way that you would go about it as a human is quite different to how the models would go about it. Yeah.swyx [00:45:15]: I also liked that you included Lama 4 Maverick in there. Is that like just one last, like...Micah [00:45:20]: Well, no, no, no, no, no, no, it is the, it is the best model released by Meta. And... So it makes it into the homepage default set, still for now.George [00:45:31]: Other inclusion that's quite interesting is we also ran it across the latest versions of the web chatbots. And so we have...swyx [00:45:39]: Oh, that's right.George [00:45:40]: Oh, sorry.swyx [00:45:41]: I, yeah, I completely missed that. Okay.George [00:45:43]: No, not at all. So that, which has a checkered pattern. So that is their harness, not yours, is what you're saying. Exactly. And what's really interesting is that if you compare, for instance, Claude 4.5 Opus using the Claude web chatbot, it performs worse than the model in our agentic harness. And so in every case, the model performs better in our agentic harness than its web chatbot counterpart, the harness that they created.swyx [00:46:13]: Oh, my backwards explanation for that would be that, well, it's meant for consumer use cases and here you're pushing it for something.Micah [00:46:19]: The constraints are different and the amount of freedom that you can give the model is different. Also, you like have a cost goal. We let the models work as long as they want, basically. Yeah. Do you copy paste manually into the chatbot? Yeah. Yeah. That's, that was how we got the chatbot reference. We're not going to be keeping those updated at like quite the same scale as hundreds of models.swyx [00:46:38]: Well, so I don't know, talk to a browser base. They'll, they'll automate it for you. You know, like I have thought about like, well, we should turn these chatbot versions into an API because they are legitimately different agents in themselves. Yes. Right. Yeah.Micah [00:46:53]: And that's grown a huge amount of the last year, right? Like the tools. The tools that are available have actually diverged in my opinion, a fair bit across the major chatbot apps and the amount of data sources that you can connect them to have gone up a lot, meaning that your experience and the way you're using the model is more different than ever.swyx [00:47:10]: What tools and what data connections come to mind when you say what's interesting, what's notable work that people have done?Micah [00:47:15]: Oh, okay. So my favorite example on this is that until very recently, I would argue that it was basically impossible to get an LLM to draft an email for me in any useful way. Because most times that you're sending an email, you're not just writing something for the sake of writing it. Chances are context required is a whole bunch of historical emails. Maybe it's notes that you've made, maybe it's meeting notes, maybe it's, um, pulling something from your, um, any of like wherever you at work store stuff. So for me, like Google drive, one drive, um, in our super base databases, if we need to do some analysis or some data or something, preferably model can be plugged into all of those things and can go do some useful work based on it. The things that like I find most impressive currently that I am somewhat surprised work really well in late 2025, uh, that I can have models use super base MCP to query read only, of course, run a whole bunch of SQL queries to do pretty significant data analysis. And. And make charts and stuff and can read my Gmail and my notion. And okay. You actually use that. That's good. That's, that's, that's good. Is that a cloud thing? To various degrees of order, but chat GPD and Claude right now, I would say that this stuff like barely works in fairness right now. Like.George [00:48:33]: Because people are actually going to try this after they hear it. If you get an email from Micah, odds are it wasn't written by a chatbot.Micah [00:48:38]: So, yeah, I think it is true that I have never actually sent anyone an email drafted by a chatbot. Yet.swyx [00:48:46]: Um, and so you can, you can feel it right. And yeah, this time, this time next year, we'll come back and see where it's going. Totally. Um, super base shout out another famous Kiwi. Uh, I don't know if you've, you've any conversations with him about anything in particular on AI building and AI infra.George [00:49:03]: We have had, uh, Twitter DMS, um, with, with him because we're quite big, uh, super base users and power users. And we probably do some things more manually than we should in. In, in super base support line because you're, you're a little bit being super friendly. One extra, um, point regarding, um, GDP Val AA is that on the basis of the overperformance of the models compared to the chatbots turns out, we realized that, oh, like our reference harness that we built actually white works quite well on like gen generalist agentic tasks. This proves it in a sense. And so the agent harness is very. Minimalist. I think it follows some of the ideas that are in Claude code and we, all that we give it is context management capabilities, a web search, web browsing, uh, tool, uh, code execution, uh, environment. Anything else?Micah [00:50:02]: I mean, we can equip it with more tools, but like by default, yeah, that's it. We, we, we give it for GDP, a tool to, uh, view an image specifically, um, because the models, you know, can just use a terminal to pull stuff in text form into context. But to pull visual stuff into context, we had to give them a custom tool, but yeah, exactly. Um, you, you can explain an expert. No.George [00:50:21]: So it's, it, we turned out that we created a good generalist agentic harness. And so we, um, released that on, on GitHub yesterday. It's called stirrup. So if people want to check it out and, and it's a great, um, you know, base for, you know, generalist, uh, building a generalist agent for more specific tasks.Micah [00:50:39]: I'd say the best way to use it is get clone and then have your favorite coding. Agent make changes to it, to do whatever you want, because it's not that many lines of code and the coding agents can work with it. Super well.swyx [00:50:51]: Well, that's nice for the community to explore and share and hack on it. I think maybe in, in, in other similar environments, the terminal bench guys have done, uh, sort of the Harbor. Uh, and so it's, it's a, it's a bundle of, well, we need our minimal harness, which for them is terminus and we also need the RL environments or Docker deployment thing to, to run independently. So I don't know if you've looked at it. I don't know if you've looked at the harbor at all, is that, is that like a, a standard that people want to adopt?George [00:51:19]: Yeah, we've looked at it from a evals perspective and we love terminal bench and, and host benchmarks of, of, of terminal mention on artificial analysis. Um, we've looked at it from a, from a coding agent perspective, but could see it being a great, um, basis for any kind of agents. I think where we're getting to is that these models have gotten smart enough. They've gotten better, better tools that they can perform better when just given a minimalist. Set of tools and, and let them run, let the model control the, the agentic workflow rather than using another framework that's a bit more built out that tries to dictate the, dictate the flow. Awesome.swyx [00:51:56]: Let's cover the openness index and then let's go into the report stuff. Uh, so that's the, that's the last of the proprietary art numbers, I guess. I don't know how you sort of classify all these. Yeah.Micah [00:52:07]: Or call it, call it, let's call it the last of like the, the three new things that we're talking about from like the last few weeks. Um, cause I mean, there's a, we do a mix of stuff that. Where we're using open source, where we open source and what we do and, um, proprietary stuff that we don't always open source, like long context reasoning data set last year, we did open source. Um, and then all of the work on performance benchmarks across the site, some of them, we looking to open source, but some of them, like we're constantly iterating on and so on and so on and so on. So there's a huge mix, I would say, just of like stuff that is open source and not across the side. So that's a LCR for people. Yeah, yeah, yeah, yeah.swyx [00:52:41]: Uh, but let's, let's, let's talk about open.Micah [00:52:42]: Let's talk about openness index. This. Here is call it like a new way to think about how open models are. We, for a long time, have tracked where the models are open weights and what the licenses on them are. And that's like pretty useful. That tells you what you're allowed to do with the weights of a model, but there is this whole other dimension to how open models are. That is pretty important that we haven't tracked until now. And that's how much is disclosed about how it was made. So transparency about data, pre-training data and post-training data. And whether you're allowed to use that data and transparency about methodology and training code. So basically, those are the components. We bring them together to score an openness index for models so that you can in one place get this full picture of how open models are.swyx [00:53:32]: I feel like I've seen a couple other people try to do this, but they're not maintained. I do think this does matter. I don't know what the numbers mean apart from is there a max number? Is this out of 20?George [00:53:44]: It's out of 18 currently, and so we've got an openness index page, but essentially these are points, you get points for being more open across these different categories and the maximum you can achieve is 18. So AI2 with their extremely open OMO3 32B think model is the leader in a sense.swyx [00:54:04]: It's hooking face.George [00:54:05]: Oh, with their smaller model. It's coming soon. I think we need to run, we need to get the intelligence benchmarks right to get it on the site.swyx [00:54:12]: You can't have it open in the next. We can not include hooking face. We love hooking face. We'll have that, we'll have that up very soon. I mean, you know, the refined web and all that stuff. It's, it's amazing. Or is it called fine web? Fine web. Fine web.Micah [00:54:23]: Yeah, yeah, no, totally. Yep. One of the reasons this is cool, right, is that if you're trying to understand the holistic picture of the models and what you can do with all the stuff the company's contributing, this gives you that picture. And so we are going to keep it up to date alongside all the models that we do intelligence index on, on the site. And it's just an extra view to understand.swyx [00:54:43]: Can you scroll down to this? The, the, the, the trade-offs chart. Yeah, yeah. That one. Yeah. This, this really matters, right? Obviously, because you can b
On this episode of Kickin it with the Cru, Neil is on vacation/traveling to the Upper Deck Conference, so Ryan and Tyson discuss Card show in the card shop. Why Mr. Holland's Opus is more entertaining than Jaws, and the frantic trading that occured in the Rbicru7 Fantasy League.
On the latest episode of ‘New Classical Tracks,' Icelandic pianist Víkingur Ólafsson “orbits around” Beethoven's Opus 109 Piano Sonata on his latest project. Listen now with host Julie Amacher!
Our 230th episode with a summary and discussion of last week's big AI news!Recorded on 01/02/2026Hosted by Andrey Kurenkov and Jeremie HarrisFeel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.aiRead out our text newsletter and comment on the podcast at https://lastweekin.ai/In this episode:Nvidia's acquisition of AI chip startup Groq for $20 billion highlights a strategic move for enhanced inference technology in GPUs.New York's RAISE Act legislation aims to regulate AI safety, marking the second major AI safety bill in the US.The launch of GLM 4.7 by Zhipu AI marks a significant advancement in open-source AI models for coding.Evaluation of long-horizon AI agents raises concerns about the rising costs and efficiency of AI in performing extended tasks.Timestamps:(00:00:10) Intro / Banter(00:01:58) 2025 RetrospectiveTools & Apps(00:24:39) OpenAI bets big on audio as Silicon Valley declares war on screens | TechCrunchApplications & Business(00:26:39) Nvidia buying AI chip startup Groq for about $20 billion, biggest deal(00:34:28) Exclusive | Meta Buys AI Startup Manus, Adding Millions of Paying Users - WSJ(00:38:05) Cursor continues acquisition spree with Graphite deal | TechCrunch(00:39:15) Micron Hikes CapEx to $20B with 2026 HBM Supply Fully Booked; HBM4 Ramps 2Q26(00:42:06) Chinese fabs are reportedly upgrading older ASML DUV lithography chipmaking machines — secondary channels and independent engineers used to soup up Twinscan NXT seriesProjects & Open Source(00:47:52) Z.AI launches GLM-4.7, new SOTA open-source model for coding(00:50:11) Evaluating AI's ability to perform scientific research tasksResearch & Advancements(00:54:32) Large Causal Models from Large Language Models(00:57:33) Universally Converging Representations of Matter Across Scientific Foundation Models(01:02:11) META-RL INDUCES EXPLORATION IN LANGUAGE AGENTS(01:07:16) Are the Costs of AI Agents Also Rising Exponentially?(01:11:17) METR eval for Opus 4.5(01:16:19) How to game the METR plotPolicy & Safety(01:17:24) New York governor Kathy Hochul signs RAISE Act to regulate AI safety | TechCrunch(01:20:40) Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers(01:26:46) Monitoring Monitorability(01:32:07) Sam Altman is hiring someone to worry about the dangers of AI | The Verge(01:33:38) X users asking Grok to put this girl in bikini, Grok is happy obliging - India TodaySee Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.
On this episode:Breaking down the headlines discussing major developments across promotions and key matches fans are buzzing aboutA broad conversation on awards season Looking ahead to 2026 momentumPower Five discussion spotlighting standout promotional championsA fun Mount Rushmore debate stepping outside wrestling with iconic action moviesMaking the Drives (new version)Whether you're catching up after the holidays or planning your wrestling weekend, this episode sets the table for what's shaping up to be a busy and exciting year.
This is a recap of the top 10 posts on Hacker News on January 06, 2026. This podcast was generated by wondercraft.ai (00:30): Vietnam bans unskippable adsOriginal post: https://news.ycombinator.com/item?id=46514677&utm_source=wondercraft_ai(01:52): enclose.horseOriginal post: https://news.ycombinator.com/item?id=46509211&utm_source=wondercraft_ai(03:14): AWS raises GPU prices 15% on a Saturday, hopes you weren't paying attentionOriginal post: https://news.ycombinator.com/item?id=46511153&utm_source=wondercraft_ai(04:36): The Post-American InternetOriginal post: https://news.ycombinator.com/item?id=46509019&utm_source=wondercraft_ai(05:59): 65% of Hacker News posts have negative sentiment, and they outperformOriginal post: https://news.ycombinator.com/item?id=46512881&utm_source=wondercraft_ai(07:21): Opus 4.5 is not the normal AI agent experience that I have had thus farOriginal post: https://news.ycombinator.com/item?id=46515696&utm_source=wondercraft_ai(08:43): C Is Best (2025)Original post: https://news.ycombinator.com/item?id=46511470&utm_source=wondercraft_ai(10:05): Why is the Gmail app 700 MB?Original post: https://news.ycombinator.com/item?id=46514692&utm_source=wondercraft_ai(11:28): Stop Doom Scrolling, Start Doom Coding: Build via the terminal from your phoneOriginal post: https://news.ycombinator.com/item?id=46517458&utm_source=wondercraft_ai(12:50): Show HN: Prism.Tools – Free and privacy-focused developer utilitiesOriginal post: https://news.ycombinator.com/item?id=46511469&utm_source=wondercraft_aiThis is a third-party project, independent from HN and YC. Text and audio generated using AI, by wondercraft.ai. Create your own studio quality podcast with text as the only input in seconds at app.wondercraft.ai. Issues or feedback? We'd love to hear from you: team@wondercraft.ai
From cofounding LinkedIn to backing OpenAI early, Reid Hoffman is in the habit of being right about the future, so we wanted to know what he saw coming in 2026.In his third appearance on AI & I, Hoffman lays out his predictions for where AI will go in the 12 months ahead. He talks to Dan Shipper about how agents will break out of coding into other domains and who's winning the coding agent race. They also get into how Hoffman defines artificial general intelligence, the way he believes enterprises will use AI, and why public debate on AI might turn more negative, even as the technology becomes more empowering for individuals.Hoffman's other bets on the future include cofounding AI drug discovery startup Manas AI, investing at venture capital firm Greylock Partners, writing books, and hosting the Masters of Scale podcast. He's also an investor at Every.If you found this episode interesting, please like, subscribe, comment, and share!Want even more?Sign up for Every to unlock our ultimate guide to prompting ChatGPT here: https://every.ck.page/ultimate-guide-to-prompting-chatgpt. It's usually only for paying subscribers, but you can get it here for free.To hear more from Dan Shipper:Subscribe to Every: https://every.to/subscribeFollow him on X: https://twitter.com/danshipperTimestamps:00:00:00 - Start00:00:52 - Introduction00:02:20 - The future of work is an entrepreneurial mindset00:05:22 - Creation is addictive (and that's okay)00:09:22 - Why discourse around AI might get uglier this year00:17:03 - AI agents will break out of coding in 202600:24:18 - What makes Anthropic's Opus 4.5 such a good model00:28:46 - Who will win the agentic coding race00:36:13 - Why enterprise AI will finally land this year00:43:16 - How Hoffman defines AGI00:55:33 - The most underrated category to watch in AI right nowLinks to resources mentioned in the episode:Reid Hoffman: Reid Hoffman (@reidhoffman)The AI drug discovery startup Hoffman cofounded: Manas AI
Movie Toaster Adam saw over 215 movies at the theater in 2025.So he came up with a list of his top 25 movies of 2025.Here is his top 10 list.It includes movies such as Bob Trevino Likes It, The Ballad of Wallis Island, Wick is Pain, Opus, Freaky Tales, & more. Stay Toasty!!!
In an online meeting with the Ramana Maharshi Foundation UK on 27th December 2025, Michael answers on Bhagavan Ramana's teachings. This episode can be watched as a video on YouTube. A more compressed audio copy in Opus format can be downloaded from MediaFire. Michael's explanations on the original works of Bhagavan can be watched free of advertisements on our Vimeo video channel. Books by Sri Sadhu Om and Michael James that are currently available on Amazon: By Sri Sadhu Om: ► The Path of Sri Ramana (English) By Michael James: ► Happiness and Art of Being (English) ► Lyckan och Varandets Konst (Swedish) ► Anma-Viddai (English) Above books are also available in other regional Amazon marketplaces worldwide. - Sri Ramana Center of Houston
Dec. 26-Jan. 1: All the newest words added to the dictionary, the Kennedy Center Honors before a certain someone got involved, R-rated stop-motion, grumpy painters, murderous tennis stars, our final recommendations for the year, and the end of a 10-year podcast journey through pop culture. All that and more from 30, 20, and 10 years ago.