POPULARITY
Categories
Mushrooms. So cute. So tasty. So very toxic. Humans have loved and feared mushrooms since before written history. In this episode, Hallie discusses the tumultuous history of this relationship, and a few of the most deadly mushrooms in the world. For patrons of the MCP, this episode is ad-free and extended. Join the MCP Patreon community
We discuss the evolution of 4 Dice Attacks in MCP and where they fall short and where they are exceptional.
Lil AI productivity secret: we've become the duct tape for AI.
In this episode, the discussion revolves around Breez's innovative SDK and its nodeless implementation, which simplifies the integration of Bitcoin and Lightning into applications. The guests share their experiences from the ‘Time to Build' challenge, highlighting the ease of use and the potential for new applications in the Bitcoin ecosystem. Brianna discusses her social events platform, Evento, and how it leverages the Breez SDK to facilitate peer-to-peer value exchange. Aljaz shares insights on developing a BTC Pay plugin that enhances payment processing without the need for a full Lightning node. The conversation also touches on user experience design, the role of vibe coding in development, and the growing excitement around Bitcoin and Layer 2 solutions.Takeaways:
Scott and Wes sit down with Kent C. Dodds to break down MCP, context engineering, and what it really takes to build effective AI-powered tools. They dig into practical examples, UI patterns, performance tradeoffs, and whether the future of the web lives in chat or the browser. Show Notes 00:00 Welcome to Syntax! 00:44 Introduction to Kent C. Dodds 02:44 What is MCP? 03:28 Context Engineering in AI 04:49 Practical Examples of MCP 06:33 Challenges with Context Bloat 08:08 Brought to you by Sentry.io 09:37 Why not give AI API access directly? 12:28 How is an MCP different from Skills 14:58 MCP optimizations and efficiency levers 16:24 MCP UI and Its Importance 19:18 Where are we at today with MCP 24:06 What is the development flow for building MCP servers? 27:17 Building out an MCP UI. 29:29 Returning HTML, when to render. 36:17 Calling tools from your UI 37:25 What is Goose? 38:42 Are browsers cooked? Is everything via chat? 43:25 Remix3 47:21 Sick Picks & Shameless Plugs Sick Picks Kent: OneWheel Shameless Plugs Kent: http://EpicAI.pro,http://EpicWeb.dev,http://EpicReact.dev Hit us up on Socials! Syntax: X Instagram Tiktok LinkedIn Threads Wes: X Instagram Tiktok LinkedIn Threads Scott: X Instagram Tiktok LinkedIn Threads Randy: X Instagram YouTube Threads
TR is joined by a team of MCP staff to debrief after visits with over 40 NYC schools to talk about the incredible work that's happening in modern classrooms throughout the city Show Notes MCP Podcast Episode 156: Progress Vs. Grades (Paul's previous appearance on the podcast; here's a shortcast version) Contact us, follow us online, and learn more: Email us questions and feedback at: podcast@modernclassrooms.org Listen to this podcast on Youtube Modern Classrooms: @modernclassproj on Twitter and facebook.com/modernclassproj Kareem: @kareemfarah23 on Twitter Toni Rose: @classroomflex on Twitter and Instagram The Modern Classroom Project Modern Classrooms Online Course Take our free online course, or sign up for our mentorship program to receive personalized guidance from a Modern Classrooms mentor as you implement your own modern classroom! The Modern Classrooms Podcast is edited by Zach Diamond: Learning to TeachSpecial Guests: Kimberly Nichols, Lauren Lutz-Coleman, Paul O'Donoghue, and Tony Xiao.
Join us as Sam demonstrates how to teach AI to write Terraform configurations using Model Context Protocol (MCP) servers. Sam introduces the Terraform MCP server and walks through practical demos showing how AI can understand and safely interact with your infrastructure. You'll see live examples of AI planning, generating, and evolving Terraform configurations� from creating landing zones to setting up workspace variables automatically. Whether you're managing complex multi-cloud environments or just getting started with infrastructure as code, this episode demonstrates how MCP servers bridge the gap between AI capabilities and real-world Terraform workflows. Learn how to get started, which Claude models work best for different tasks, and best practices for integrating AI into your IaC pipelines. Timestamps 0:00 Welcome & Introduction 4:37 Sam McGeown's Background 6:02 Introduction to Terraform MCP Server 12:35 What is Model Context Protocol? 18:22 Setting Up the Terraform MCP Server 24:16 Demo: Claude Desktop Integration 30:41 Creating Infrastructure with AI Prompts 36:52 Reading & Analyzing Existing Terraform Code 42:18 Generating Landing Zone Configurations 47:35 Working with Terraform Workspaces 50:37 Creating Variables Automatically 52:14 Model Selection: Sonnet vs Opus 55:11 Live Demo: Workspace Variable Creation 58:33 Getting Started & Resources How to find Sam: https://www.linkedin.com/in/sammcgeown/ Links from the show: https://developer.hashicorp.com/terraform/mcp-server
Podcasting 2.0 January 23d 2026 Episode 248: "Thingamajigger" Adam & Dave talk re-decentralization, MCP's and Slop ShowNotes We are LIT Vibe Coding the playout system for Rodecaster Rodecaster integration The Playout System PI-Monitor Congrats to Sam on closing his funding round James' podcast comments Transcript Search What is Value4Value? - Read all about it at Value4Value.info V4V Stats ast Modified 01/23/2026 14:11:02 by Freedom Controller
Join us as Gautam breaks down the evolution of tool use in generative AI and dives deep into MCP. Gautam walks through the progression from simple prompt engineering to function calling, structured outputs, and now MCP—explaining why MCP matters and how it's changing the way AI systems interact with external tools and data. You'll learn about the differences between MCP and traditional API integrations, how to build your first MCP server, best practices for implementation, and where the ecosystem is heading. Whether you're building AI-powered applications, integrating AI into your infrastructure workflows, or just trying to keep up with the latest developments, this episode provides the practical knowledge you need. Gautam also shares real-world examples and discusses the competitive landscape between various AI workflow approaches. Subscribe to vBrownBag for weekly tech education covering AI, cloud, DevOps, and more! ⸻ Timestamps 0:00 Introduction & Welcome 7:28 Gautam's Background & Journey to AI Product Management 12:45 The Evolution of Tool Use in AI 18:32 What is Model Context Protocol (MCP)? 24:16 MCP vs Traditional API Integrations 30:41 Building Your First MCP Server 36:52 MCP Server Discovery & Architecture 42:18 Real-World Use Cases & Examples 47:35 Best Practices & Implementation Tips 51:12 The Competitive Landscape: Skills, Extensions, & More 52:14 Q&A: AI Agents & Infrastructure Predictions 55:09 Closing & Giveaway How to find Gautam: https://gautambaghel.com/ https://www.linkedin.com/in/gautambaghel/ Links from the show: https://www.hashicorp.com/en/blog/build-secure-ai-driven-workflows-with-new-terraform-and-vault-mcp-servers Presentation from HashiConf: https://youtu.be/eamE18_WrW0?si=9AJ9HUBOy7-HlQOK Kiro Powers: https://www.hashicorp.com/en/blog/hashicorp-is-a-kiro-powers-launch-partner Slides: https://docs.google.com/presentation/d/11dZZUO2w7ObjwYtf1At4WnL-ZPW1QyaWnNjzSQKQEe0/edit?usp=sharing
The Model Context Protocol is your USB-C for AI API tool calling. Ed and Clay talk about MCP introduced by Anthropic, the rapid adoption of the protocol across AI vendors, the opportunities it creates, as well as the headaches it induces for IT.
The January 22 edition of the AgNet News Hour tackled a growing concern many California growers know all too well—foreign competition flooding the market during peak domestic seasons. Hosts Nick Papagni and Josh McGill focused the conversation on California pears, featuring an interview with Chris Zanobini, Executive Director of the California Pear Advisory Board, who laid out why the state's pear industry is fighting for survival. Zanobini explained that California's pear industry is relatively small, with only about 60 growers remaining—many of them fifth- and sixth-generation family farmers. Pear orchards can remain productive for decades, meaning these farms represent long-term investment and deep roots in rural communities. But now, he says, the industry's short and carefully managed marketing window is being disrupted by imported pears arriving at the worst possible time. California pears typically harvest beginning in early July and aim to finish shipping by late October to avoid competing with other domestic pear-growing regions like Oregon and Washington. The problem, Zanobini said, is that Argentine pears are coming into the U.S. in heavy volume during June, July, and even into September, right when California is trying to sell its crop. The result is a market that starts the season already flooded, with retailers delaying California programs by weeks. One major concern Zanobini highlighted is a product commonly used overseas called 1-MCP, a ripening inhibitor that allows pears to store for an extremely long time, but often prevents them from ripening properly. That can lead to a poor consumer experience—hard, disappointing fruit that hurts pear demand overall. California, he noted, made a commitment years ago not to use 1-MCP because of its impact on eating quality. The competitive imbalance comes down to cost. Zanobini said California growers face the highest production standards in the world—labor, chemical restrictions, water requirements, and environmental compliance—yet they aren't paid extra for meeting those standards. Imported pears, meanwhile, can arrive cheaper by $5 to $10 per box, making them attractive to retailers focused on price and margins. Zanobini also shared a jaw-dropping stat: Argentina imported more than 1.3 million boxes of pears, which exceeded California's production of Bartlett pears this year—California's primary variety. He said the industry can't tolerate that trajectory much longer, and without change, more multi-generation pear farms could disappear. Papagni and McGill pointed out that this isn't just a pear problem—it's a California agriculture problem, impacting everything from citrus to tomatoes to raisins. Their message to listeners was simple: California growers need a fair playing field, and consumers can help by asking for domestic fruit and supporting local farmers when it's in season.
One of the biggest mistakes in AI? Thinking that your company's AI use is noteworthy. Or, even a competitive advantage. It's not. We break it down in Volume 3 of our 'Start Here Series.' AI as an Operating System: LLMs Are the Internet Now -- An Everyday AI Chat with Jordan WilsonNewsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion on LinkedIn: Thoughts on this? Join the convo on LinkedIn and connect with other AI leaders.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:AI As An Operating System ExplainedLarge Language Models Replace Traditional AppsAI Integration in Knowledge Work PlatformsChoosing the Right AI Operating SystemMicrosoft Copilot vs. Google Gemini vs. Claude vs. ChatGPTAgentic Browsers Powering Autonomous WorkflowsModel Context Protocol (MCP) for AI AgentsOrchestration Layer and Agent CollaborationChatGPT Apps Merging AI and InternetEnterprise Data Integration with AI ToolsContext Switching Reduction Through AI AgentsStrategic AI Adoption and Platform RedundancyTimestamps:00:00 "AI: A New Operating System"03:58 "AI Transforming Work Interfaces"06:41 "Collaborating in AI-Native Workspaces"12:25 Anthropic's Innovations in AI Tools13:46 "OpenAI's Strategy and Market Focus"18:02 "Cognitive Evolution Through AI"20:57 "Agentic Browsers: Key 2025 Advancement"25:12 Improving Content Through Data Insights26:42 "Anthropic's MCP: The AI Connector"32:19 "AI Tools for Productivity Integration"34:20 "AI: Unlocking Context and Efficiency"36:32 AI Governance and System Portability39:35 "AI Operating System Insights"Keywords: AI operating system, large language models, LLMs, AI as infrastructure, enterprise AI, AI adoption, agentic workflows, AI agents, orchestration layer, Copilot, Microsoft 365 Copilot, Google Gemini, Gemini business, Gemini enterprise, Anthropic Claude, Claude cowork, MCP, model context protocol, OpenAI, ChatGPT, ChatGPT apps, ChatGPT business, ChatGPT enterprise, AI native, dynamic data integration, productivity with AI, collaboration tools, agentic browsers, autonomous AI agents, context window, memory and personalization, expert-driven loops, app hop tax, context switching, AI integration in business, AI tools for teams, AI platform selection, data governance, modular AI workflows, permissions and audit logs, backup and redundancy in AI, competitive advantage with AI, Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Ready for ROI on GenAI? Go to youreverydayai.com/partner
Carla the Ogre, extensions, Crashfix, Gemini, ChatGPT Health, Dark AI, MCP, Joshua Marpet, and More on the Security Weekly News. Visit https://www.securityweekly.com/swn for all the latest episodes! Show Notes: https://securityweekly.com/swn-548
Carla the Ogre, extensions, Crashfix, Gemini, ChatGPT Health, Dark AI, MCP, Joshua Marpet, and More on the Security Weekly News. Show Notes: https://securityweekly.com/swn-548
Carla the Ogre, extensions, Crashfix, Gemini, ChatGPT Health, Dark AI, MCP, Joshua Marpet, and More on the Security Weekly News. Visit https://www.securityweekly.com/swn for all the latest episodes! Show Notes: https://securityweekly.com/swn-548
Topics covered in this episode: Better Django management commands with django-click and django-typer PSF Lands a $1.5 million sponsorship from Anthropic How uv got so fast PyView Web Framework Extras Joke Watch on YouTube About the show Sponsored by us! Support our work through: Our courses at Talk Python Training The Complete pytest Course Patreon Supporters Connect with the hosts Michael: @mkennedy@fosstodon.org / @mkennedy.codes (bsky) Brian: @brianokken@fosstodon.org / @brianokken.bsky.social Show: @pythonbytes@fosstodon.org / @pythonbytes.fm (bsky) Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Monday at 11am PT. Older video versions available there too. Finally, if you want an artisanal, hand-crafted digest of every week of the show notes in email form? Add your name and email to our friends of the show list, we'll never share it. Brian #1: Better Django management commands with django-click and django-typer Lacy Henschel Extend Django manage.py commands for your own project, for things like data operations API integrations complex data transformations development and debugging Extending is built into Django, but it looks easier, less code, and more fun with either django-click or django-typer, two projects supported through Django Commons Michael #2: PSF Lands a $1.5 million sponsorship from Anthropic Anthropic is partnering with the Python Software Foundation in a landmark funding commitment to support both security initiatives and the PSF's core work. The funds will enable new automated tools for proactively reviewing all packages uploaded to PyPI, moving beyond the current reactive-only review process. The PSF plans to build a new dataset of known malware for capability analysis The investment will sustain programs like the Developer in Residence initiative, community grants, and infrastructure like PyPI. Brian #3: How uv got so fast Andrew Nesbitt It's not just be cause “it's written in Rust”. Recent-ish standards, PEPs 518 (2016), 517 (2017), 621 (2020), and 658 (2022) made many uv design decisions possible And uv drops many backwards compatible decisions kept by pip. Dropping functionality speeds things up. “Speed comes from elimination. Every code path you don't have is a code path you don't wait for.” Some of what uv does could be implemented in pip. Some cannot. Andrew discusses different speedups, why they could be done in Python also, or why they cannot. I read this article out of interest. But it gives me lots of ideas for tools that could be written faster just with Python by making design and support decisions that eliminate whole workflows. Michael #4: PyView Web Framework PyView brings the Phoenix LiveView paradigm to Python Recently interviewed Larry on Talk Python Build dynamic, real-time web applications using server-rendered HTML Check out the examples. See the Maps demo for some real magic How does this possibly work? See the LiveView Lifecycle. Extras Brian: Upgrade Django, has a great discussion of how to upgrade version by version and why you might want to do that instead of just jumping ahead to the latest version. And also who might want to save time by leapfrogging Also has all the versions and dates of release and end of support. The Lean TDD book 1st draft is done. Now available through both pythontest and LeanPub I set it as 80% done because of future drafts planned. I'm working through a few submitted suggestions. Not much feedback, so the 2nd pass might be fast and mostly my own modifications. It's possible. I'm re-reading it myself and already am disappointed with page 1 of the introduction. I gotta make it pop more. I'll work on that. Trying to decide how many suggestions around using AI I should include. It's not mentioned in the book yet, but I think I need to incorporate some discussion around it. Michael: Python: What's Coming in 2026 Python Bytes rewritten in Quart + async (very similar to Talk Python's journey) Added a proper MCP server at Talk Python To Me (you don't need a formal MCP framework btw) Example one: latest-episodes-mcp.png Example two: which-episodes-mcp.webp Implmented /llms.txt for Talk Python To Me (see talkpython.fm/llms.txt ) Joke: Reverse Superman
In this episode, I sit down with Professor Ras Mic for a beginner-friendly crash course on using Claude Code (and AI coding agents in general) without feeling overwhelmed by the terminal. We break down why your output is only as good as your inputs and how thinking in features + tests turns “vague app ideas” into real, shippable products. Was walks me through a better planning workflow using Claude Code's Ask User Question Tool, which forces clarity on UI/UX decisions, trade-offs, and technical constraints before you build. We also talk about when not to use “Ralph” automation, why context windows matter, and how taste + audacity are the real differentiators in 2026 software. Timestamps 00:00 – Intro 01:22 – Claude Code Best Practices 05:31 – Claude Code Plan Mode 09:30 – The Ask User Question Tool 14:52 – Don't start with Ralph automation (get reps first) 16:36 – What are “Ralph loops” and why plans and documentation matter most 18:41 – Ras's Ralph setup: progress tracking + tests + linting 23:48 – Tips & tricks: don't obsess over MCP/skills/plugins 27:44 – Scroll-stopping software wins Key Points Your results improve fast when you treat AI agents like junior engineers: clear inputs → clean outputs. The biggest unlock is planning in features + tests, not broad product descriptions. Claude Code's Ask User Question Tool forces real clarity on workflow, UI/UX, costs, and technical decisions. If you haven't shipped anything, don't hide behind automation—build manually before using “Ralph.” Context management matters: long sessions can degrade quality, so restart earlier than you think. Numbered Section Summaries The Real Reason People Get “AI Slop” I frame the episode around a simple idea: if you feed agents sloppy instructions, you'll get sloppy output. Ras explains that models are now good enough that the failure mode is usually unclear inputs, not model quality. How To Think Like A Product Builder (Features First): Ras pushes a practical mindset: don't describe “the product,” describe the features that make the product real. If you can list the core features clearly, you can actually direct an agent to build them correctly. The Missing Piece: Tests Between Features: We talk about the shift from “generate code” to “build something serious.” The move is writing and running tests after each feature, so you don't stack feature two on top of a broken feature one. Why Default Planning Mode Isn't Enough: Ras shows the standard flow: open plan mode, ask Claude to write a PRD, and get a basic roadmap. The issue is it leaves too many assumptions—especially around UI/UX and workflow details. The Ask User Question Tool (The Planning Upgrade): This is the big unlock. Ras demonstrates how the Ask User Question Tool interrogates you with increasingly specific questions (workflow, cost handling, database/hosting, UI style, storage, etc.) so the plan becomes dramatically more precise. Spend Time Upfront Or Pay For It Later: We connect the dots: better planning reduces back-and-forth, reduces token burn, and prevents “I built the app but it's not what I wanted.” The interview-style planning forces trade-offs early instead of late. Don't Use Ralph Until You've Built Without It: Ras makes a strong case for reps: if you can't ship something end-to-end yet, automation won't save you—it'll just move faster in the wrong direction. Build feature-by-feature manually first, then graduate to loops. Practical Tips: Context Discipline + Taste Wins: Ras shares a few operational habits: don't obsess over tools like MCP/plugins, keep context usage under control, and restart sessions before quality degrades. We wrap on a bigger point: in 2026, “audacity + taste” is what makes software stand out. The #1 tool to find startup ideas/trends - https://www.ideabrowser.com LCA helps Fortune 500s and fast-growing startups build their future - from Warner Music to Fortnite to Dropbox. We turn 'what if' into reality with AI, apps, and next-gen products https://latecheckout.agency/ The Vibe Marketer - Resources for people into vibe marketing/marketing with AI: https://www.thevibemarketer.com/ FIND ME ON SOCIAL X/Twitter: https://twitter.com/gregisenberg Instagram: https://instagram.com/gregisenberg/ LinkedIn: https://www.linkedin.com/in/gisenberg/ FIND MIC ON SOCIAL X/Twitter: https://x.com/Rasmic Youtube: https://www.youtube.com/@rasmic
Join Simtheory: https://simtheory.ai---Join the most average AI LinkedIn group: https://www.linkedin.com/groups/16562039/It's 2026 and everyone's having an existential crisis. In this episode, we unpack the two camps dominating AI C/Twitter: hype boys claiming "Claude Code can do my washing" vs. software developers doom-scrolling themselves into career panic. We put the agentic hype to the test and discover that no, you can't actually run 8 agents recreating your local business ecosystem while you sleep. Plus, we reflect on why MCP is exhausting, why Gemini 3 Pro is somehow worse than Gemini 2.5 Pro, and why Geoffrey Hinton would rather write his book than answer questions in Tasmania. Also featuring: the $200,000/month enterprise AI problem, why SaaS isn't dead (but it's scared), and our prediction that AI workspaces will become the everything app.CHAPTERS:00:00 Intro - Unpacking the 2026 AI Vibes02:21 Putting Claude Code and Agentic Hype to the Test05:57 Why Twitter AI Demos Never Show the Receipts07:03 Honest Assessment of Where Frontier Models Are At11:19 Building the Everything App with Email, Calendar and Files16:47 Collaborative Mode vs Agentic Delegation in Practice21:29 The Real Cost of Enterprise AI at Scale24:32 Why Cheaper Models Like Haiku and Gemini Flash Matter29:25 Is SaaS Actually Dead or Just Disrupted38:11 The Future of AI Platforms, SDKs and App Stores43:35 The Untapped Opportunity in Paid Proprietary MCPs51:21 Geoffrey Hinton Refuses to Take Questions in Tasmania55:05 2026 Plans and the Still Relevant Tour AnnouncementThanks for listening. Like & Sub. xoxox
AI doesn't break security, it exposes where it was already fragile. When automation starts making decisions faster than humans can audit, AppSec becomes the only thing standing between scale and catastrophe. In this episode, Ron sits down with Joshua Bregler, Senior Security Manager at McKinsey's QuantumBlack, to dissect how AI agents, pipelines, and dynamic permissions are reshaping application security. From prompt chaining attacks and MCP server sprawl to why static IAM is officially obsolete, this conversation gets brutally honest about what works, what doesn't, and where security teams are fooling themselves. Impactful Moments 00:00 – Introduction 02:15 – AI agents create identity chaos 04:00 – Static permissions officially dead 07:05 – AI security is still AppSec 09:30 – Prompt chaining becomes invisible attack 12:23 – Solving problems vs solving AI 15:03 – Ethics becomes an AI blind spot 17:47 – Identity is the next security failure 20:07 – Frameworks no longer enough alone 26:38– AI fixing insecure code in real time 32:15 – Secure pipelines before production Connect with our Guest Joshua Bregler on LinkedIn: https://www.linkedin.com/in/breglercissp/ Our Links Check out our upcoming events: https://www.hackervalley.com/livestreams Join our creative mastermind and stand out as a cybersecurity professional: https://www.patreon.com/hackervalleystudio Love Hacker Valley Studio? Pick up some swag: https://store.hackervalley.com Continue the conversation by joining our Discord: https://hackervalley.com/discord Become a sponsor of the show to amplify your brand: https://hackervalley.com/work-with-us/
The AI gap will kill companies.What is it? it's the large divide between AI's crazy impressive capabilities and what most companies are actually using them for. And one of the biggest reasons for the AI gap? Talking. Like... no one understands how to talk about AI because the technology changes faster than Usain Bolt in Beijing. You wanna talk to your AI team about LLMs? PFT. They're running Ralph Wiggum loops in Claude Code and just kinda reading the code before it hits production. Yeah, the divide is WIIIIIDE. So we're gonna tackle it together on the second volume of our Starter Series: AI Without the Jargon: The AI Language Every Business Leader Needs to live by in 2026 -- An Everyday AI Chat with Jordan WilsonNewsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion on LinkedIn: Thoughts on this? Join the convo on LinkedIn and connect with other AI leaders.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:AI Jargon Barrier for Business LeadersGenerative AI Basics and Lingo BreakdownChatGPT, Claude, and Gemini Model ComparisonUnderstanding Tokens and Context WindowsLarge Language Models: Prompt to OutcomeParameters, Model Power, and Cost ImplicationsRetrieval Augmented Generation (RAG) ExplainedEmbeddings, Vector Databases, and ChunkingAgentic Models vs. Transformer ModelsAI Risks: Hallucinations, Prompt Injection, GuardrailsModel Context Protocol (MCP) and ConnectorsScaffolding for Complex AI WorkflowsAI Success: ROI, Risk, and Implementation StrategiesTimestamps:00:00 "Join Start Here Series Community"03:21 Bridging AI and Business Leaders09:35 Partnering for Generative AI Success12:26 "AI Models Operate Using Tokens"15:41 "Shift to Smaller AI Models"18:56 "Understanding RAG and Its Impact"22:38 "AI Tools Connecting via MCP"24:56 "Minimizing AI Hallucinations Effectively"28:45 "Fast, Careful AI Implementation"30:53 "AI Guide for Business Leaders"Keywords: AI language, AI jargon, artificial intelligence terminology, AI lingo, large language model, generative AI, prompt engineering, context engineering, context window, tokens, tokenization, model architecture, model parameters, neural network connections, Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Ready for ROI on GenAI? Go to youreverydayai.com/partner
WhoJimmy Ackerson, General Manager of Corralco, ChileRecorded onJuly 24, 2025About CorralcoClick here for a mountain stats overviewLocated in: Curacautín, Araucanía, ChileYear founded: 2003, by Enrique BascurPass affiliations: Indy Pass, Indy+ Pass – 2 days, no blackoutsBase elevation: 4,724 feet (1,440 meters)Summit elevation: 7,874 feet (2,400 meters) top of lifts; 9,400 feet (2,865 meters) hike-toVertical drop: 3,150 feet (960 meters) lift-served; 4,676 feet (1,425 meters) hike-toSkiable acres: 2,475 acres lift served; 4,448 acres (1,800 hectares), including hike-to terrainAverage annual snowfall: 354 inches (899 cm)Trail count: 34Lift count: 7 (1 high-speed quad, 1 double, 5 J-bars)Why I interviewed himThe Andes run the length of South America, 4,300 miles from the southern tip of Argentina north to Venezuela. It is the longest continental mountain range on Earth, nearly six times the length of the Alps and 1,300 miles longer than the Rockies. It is the highest mountain range outside of Asia, topping out at 22,841 feet on Mount Aconcagua, more than a mile higher than the tallest point in the Rockies (14,439-foot Mount Elbert) or Alps (15,772-foot Mont Blanc).So this ought to be one hell of a ski region, right? If the Alps house more than 500 ski areas and the Rockies several hundred, then the Andes ought to at least be in the triple digits?Surprisingly, no. Of the seven nations transected by the Andes, only Argentina and Chile host outdoor, lift-served ski areas. Between the two countries, I'm only able to assemble a list of 37 ski areas, 33 of which skiresort.info categorizes as “temporarily closed” – a designation the site typically reserves for outfits that have not operated over the past several seasons.For skiers hoping to live eternal winter by commuting to the Upside Down each May through October, this roster may be a bit of a record scratch. There just aren't that many ski areas in the Southern Hemisphere. Outside of South America, the balance – another few dozen total - sit in Australia and New Zealand, with scattered novelties such as Afriski lodged at the top of Lesotho. There are probably more ski areas in New England than there are south of the equator.That explains why the U.S.-based multimountain ski passes have been slow to move into the Southern Hemisphere – there isn't much there to move into. Ikon and Mountain Collective each have just one destination on the continent, and it's the same destination: Valle Nevado. Epic offers absolutely nothing in South America.Even with few options, Vail moved south a decade ago with its purchase of Perisher, Australia's largest ski area. That English-speaking nation was a logical first pass frontier, but the five Kangaroo resorts claimed by the Epic and Ikon passes are by far the five largest in the country, and they're a 45-year flight from America. New Zealand is similarly remote, with more but generally less-developed ski areas, and Ikon has established a small presence there.But South America remains mostly wide open, despite its obvious appeal to North Americans: the majesty of the Andes, the novelty of summer skiing, and direct flights with no major timezone hopping required. Mountain Capital Partners has dropped anchor in Chile, purchasing Valle Nevado in 2023, neighboring La Parva the following year, and bidding for also-neighboring El Colorado in 2025 (that sale is pending regulatory review).But perhaps it's time for a broader invasion. Last March, Indy Pass added Corralco as its first South American – and first Southern Hemisphere – ski area. That, as Ackerson and I discuss in the podcast, could be just the start of Indy's ambitions for a continent-spanning (or at least, Argentina- and Chile-spanning) resort network.So this is a good time to start getting to know Chilean skiing. And Ackerson, longtime head of the Chilean Ski Areas Association, former leader of Chilean giants Portillo and Valle Nevado, and a Connecticut-born transplant who has been living the upside-down life for more than 50 years, is probably better suited than anyone on the planet to give us that intro.What we talked aboutReverse ski seasons; why Corralco draws (and retains) so much more snow than any other ski area in Chile; no snowmaking; Corralco as training ground for national ski teams; the logistics of moving a high-speed quad from Holiday Valley, New York to the Chilean Andes; rebuilding a lift as a longer machine; how that lift transformed Corralco; new lift, new alignment; the business impact of replacing a double chair with a high-speed quad; how a dude who grew up in Connecticut with non-skiing parents ended up running a ski area in South America; Chile's allure; Portillo; Chilean skiing past and present; Corralco's founding and evolution; shrinking South American ski areas; Mountain Capital Partners (MCP) buying four more ski areas in Chile after purchasing Valle Nevado in 2023 and La Parva in 2024; the Americans are coming; why La Parva, Valle Nevado, and El Colorado “have to be consolidated” for the benefit of future skiing in Chile; MCP's impact on Chilean skiing so far; “the culture is very different here” both on the hill and off; MCP's challenges as they settle into Chilean skiing; why Corralco joined Indy Pass; a potential Indy Pass network in South America; and getting to Corralco from the U.S., from airplane to access road – “we have no switchbacks.”What I got wrong* In the intro, I said that it was the “heart of ski season in South America.” This was true when we recorded this conversation in July 2025. It's not true in January 2026, when the Chilean ski season is long over.* I said the highest peak in Chile only received a few inches of snow per year and didn't retain it, but I couldn't remember the name of the peak – it is 22,615-foot Ojos del Salado.* I gave new stats for Corralco's high-speed quad, but did not mention where those stats came from – my source was skiresort.info, which catalogues a 4,921-foot length and 1,148-foot vertical drop for the lift, both substantially longer than the 4,230-foot length and 688-foot vertical rise that Lift Blog documents for the antecedent Mardi Gras lift at Holiday Valley, New York. We discuss the logistics and mechanics of moving this machine from North to South America and extending it in the pod. Here are a few pics of this machine I took in New York in January 2022:Podcast NotesOn Corralco's evolving footprintCorralco is a new-ish ski area, at least insofar as public access goes. The 2008 trailmap shows a modest vertical drop served by surface lifts:But growth has been rapid, and by 2022, the ski area resembled modern Corralco, which is now an international training center for athletes:On Camp Jewel, ConnecticutAckerson learned to ski on a two-tow bump called Camp Jewell, a YMCA center in Connecticut. NELSAP has some fun info on this defunct ski area, including photos of what's left of the lifts.On Sigi GrottendorderAckerson's conduit to South American skiing came in the form of Austrian-born Sigi Grottendorfer, who led the ski schools at both Sugarbush, Vermont and Portillo, Chile. He passed away in 2023 – The Valley Reporter ran an obituary with more info on Grottendorfer's expansive and colorful life.On Chile “five years after the coup had occurred”We reference past political instability in Chile, referring to the 1973 coup that launched the military dictatorship of the notorious Augusto Pinochet. The nation transitioned back to democracy in 1990 and is considered safe and stable for tourists by the U.S. State Department.On PortilloWe discuss Portillo, a Chilean ski area whose capacity limits and weeklong ski-and-stay packages result in Windham-is-private-style (it's not) confusion. Skiers can visit Portillo on a day pass. Lift tickets are all of $68. Still, the hotel experience is, by all accounts, pretty rad. Here's the bump:On previous podcastsWe mention a few previous podcast guests who had parallels to Ackerson's story. Bogus Basin GM Brad Wilson also left skiing for several years to run a non-ski resort:Longtime Valle Nevado GM Ricardo Margolis appeared on this podcast in 2023:On the shrinking of Volcán Osorno and PillánI won't reset the entire history here, but I broke down the slow shrinkage of Volcán Osorno and Pillán ski areas when Mountain Capital Partners bid to purchase them last year:On Kamori Kankō buying HeavenlyFor a brief period, Japanese company Kamori Kankō owned Steamboat and Heavenly. The company sold both to American Skiing Company in 1997, and they eventually split owners, with Heavenly joining Vail's roster in 2002, and Steamboat now part of Alterra by way of Intrawest. Today, Kamori Kankō appears to operate five ski areas in Japan, all in Hokkaido, most notably Epic Pass partner Rusutsu:On MCP's free season passes for kids 12 and underOne pretty cool thing that Mountain Capital Partners has brought to Chile from its U.S. HQ is free season passes for kids 12 and under. It's pretty incredible:On Sugarbush Ackerson worked for a long time at Sugarbush, an Alterra staple and one of the best overall ski areas in New England. It's a fully modern resort, with the exception of the knockout Castle Rock terrain, which still spins a double chair on all-natural snow:On skiing El ColoradoWe discuss the insane, switchbacking access road up to El Colorado/La Parva/Valle Nevado from Santiago:The route up to Corralco is far more suited to mortals:The Storm explores the world of lift-served skiing year-round. Join us. Get full access to The Storm Skiing Journal and Podcast at www.stormskiing.com/subscribe
In this Episode, Will is joined by Rightmad (Ben) for a look at the recently released Spectrum and Blue Marvel! The guys start by looking at Monica Rambeau and all she brings to the tabletop, including a new Leadership for Marvel: Crisis Protocol in the form of The Mighty Avengers. Then the crew takes a look at the one and only Blue Marvel and discuss if this might just be the most perfectly balanced character in MCP. Lastly, Will announces a MASSIVE #giveaway so make sure to check that out! Enjoy! Baron of Dice - HouseParty for 5% off! Patreon and Merch and more! Krydrufi Hobby Station Thing USE CODE: KRYDRUFI-HPP Connect with us on Facebook @housepartyprotocol HPP on Youtube Discord - HPP_Will Email us - housepartyprotocolpod@gmail.com BattleKiwi - PARTYKIWI The Gamer's Guild
Mike & Tommy explore Microsoft's new MCP servers for Power BI and Fabric, questioning whether teams should immediately enable these AI agent integrations or take a more measured, governed approach. They break down the security risks, governance implications, and practical rollout strategies for letting AI agents interact with semantic models.https://blog.fabric.microsoft.com/en-US/blog/introducing-fabric-mcp-public-preview/](https://blog.fabric.microsoft.com/en-US/blog/introducing-fabric-mcp-public-preview/">https://blog.fabric.microsoft.com/en-US/blog/introducing-fabric-mcp-public-preview/)https://learn.microsoft.com/en-us/power-bi/developer/mcp/mcp-servers-overview](https://learn.microsoft.com/en-us/power-bi/developer/mcp/mcp-servers-overview">https://learn.microsoft.com/en-us/power-bi/developer/mcp/mcp-servers-overview)Get in touch:Send in your questions or topics you want us to discuss by tweeting to @PowerBITips](https://twitter.com/PowerBITips">@PowerBITips) with the hashtag #empMailbag or submit on the PowerBI.tips](https://powerbi.tips/explicit-measures-power-bi-podcast/">PowerBI.tips) Podcast Page.Visit [PowerBI.tips](http://PowerBI.tips): https://powerbi.tips/](https://powerbi.tips/">https://powerbi.tips/)Watch the episodes live every Tuesday and Thursday morning at 730am CST on YouTube: [https://www.youtube.com/powerbitips](https://www.youtube.com/powerbitips)Subscribe on Spotify: https://open.spotify.com/show/230fp78XmHHRXTiYICRLVv](https://spoti.fi/3pyzcbJ">https://open.spotify.com/show/230fp78XmHHRXTiYICRLVv)Subscribe on Apple: https://podcasts.apple.com/us/podcast/explicit-measures-podcast/id1568944083](http://bit.ly/applepbiemp">https://podcasts.apple.com/us/podcast/explicit-measures-podcast/id1568944083)Check Out Community Jam: [https://jam.powerbi.tips](https://jam.powerbi.tips)Follow Mike: https://www.linkedin.com/in/michaelcarlo/](https://www.linkedin.com/in/michaelcarlo/">https://www.linkedin.com/in/michaelcarlo/)Follow Tommy: https://www.linkedin.com/in/tommypuglia/](https://www.linkedin.com/in/tommypuglia/">https://www.linkedin.com/in/tommypuglia/)
Amir (Co-Founder at Humblytics) shares how he builds an “AI-native” company by focusing less on shiny tools and more on change management: assessing AI fluency across roles, setting the right success metrics, and creating shared context so AI can reliably ship work. The big theme is convergence—engineering, product, and design are collapsing into tighter loops thanks to tools like Cursor, MCP connectors, and Figma Make. Amir demos workflows like: AI-generated context files + auto-updated documentation, scraping customer domains to infer ICPs, turning screenshots into layered Figma designs, then converting Figma to working React code in minutes, and even running an “AI co-founder” Slack bot that files Linear tickets and can hand work to agents.Timestamps0:00 Introduction0:06 Amir's stance: “no AI experts” — it's constant learning in a fast-changing field.1:59 Cursor as the unlock: not just coding, but PM/strategy/design work via MCPs.4:17 The real problem: AI adoption is mostly change management + fluency assessment.5:18 The AI fluency rubric (helper → automator → augmentor → agentic) and why it matters.8:13 Cursor analytics: measuring AI-generated code and usage across the team.9:24 “New code is ~99% AI-generated” + how they keep quality via tight review + incremental changes.10:58 Docs workflow: GitBook connected to repo → AI edits docs and pushes live fast.14:02 ICP building: export Stripe customers → scrape domains with Firecrawl → cluster personas.17:45 Hallucination in the wild: AI misclassifies a company; human correction loop matters.34:43 Wild move: they often design in code and use an AI-generated style guide to stay consistent.38:10 Best demo: screenshot → Figma Make → layered design → Figma MCP → React code in minutes.45:29 “AI co-founder” Slack bot (Pixel): turns a bug report into a Linear ticket and can hand off to agents.48:46 Amir's wish list: we “solved dev”; now we need Cursor for marketing/sales → path to $1M ARR.Tools & technologies mentionedCursor — AI-first IDE used for coding and product/design/strategy workflows; includes team analytics.MCP (Model Context Protocol) — “connector” layer (Anthropic-origin) that lets LLMs interface with external tools/services.ChatGPT — used as a common baseline tool; discussed in the context of prompting practices and workflows.Microsoft Copilot — referenced via the law firm incentive story; used as an example of “usage metrics” gone wrong.Anthropic (AI fluency framework) — inspiration source for the helper/automator/augmentor/agentic rubric.GitBook — documentation platform connected to the repo so docs can be updated and published quickly.Firecrawl (MCP) — agentic web scraper used to analyze customer domains and infer ICP/personas.Stripe — source of customer export data (domains) to build ICP clustering.Figma — design collaboration tool; used here with Make + MCP to move from design → code.Figma Make — feature to recreate UI from an image/screenshot into editable, layered designs.Figma MCP — connector that allows Cursor/LLMs to pull Figma components/designs and generate code.React — front-end framework used in the demo for generating functional UI components.Supabase — mentioned as part of a sample stack when generating a PRD.React Router — mentioned as part of the sample stack in PRD generation.Slack — where Amir runs internal agents (including the “AI co-founder” bot).Linear — project management tool used for creating tickets from Slack/agent workflows.CI/CD — their deployment/review pipeline; emphasized as the human accountability layer.Subscribe at thisnewway.com to get the step-by-step playbooks, tools, and workflows.
SHOW: 992SHOW TRANSCRIPT: The Cloudcast #992 TranscriptSHOW VIDEO: https://youtube.com/@TheCloudcastNET NEW TO CLOUD? CHECK OUT OUR OTHER PODCAST - "CLOUDCAST BASICS" SHOW NOTES:Tonic.ai websiteTonic Validate Product PageTonic Validate GitHubTopic 1 - Adam, welcome to the show. Give everyone a brief introduction.Topic 2: Our topic today is RAG systems, specifically RAG in production. Let's start with customization sources and types. When it comes to customizing off-the-shelf LLMs, RAG is one option, as is an MCP connection to a SQL database, and there is pre- and post-training, as well as fine-tuning. How does an organization decide what path is best for customization?Topic 3 - RAG came on the scene as the savior for organizations that want to use customer AI without the need for fine-tuning and additional training. It has either gone through or is currently still in the trough of disillusionment. What are your thoughts on RAG's evolution and the challenges it faces?Topic 4 - Let's walk through the basics of validation. Once you set up RAG, how would an organization know it works? How is accuracy measured and validated? Are you looking for hallucinations? Context quality?Topic 5 - What is Tonic Validate, and where does it fit into this stack? Is it in band? Out of band? Built into the CI workflow?Topic 6 - Accuracy is one aspect, but we hear more and more about ROI for Enterprises. How should ROI, risk, and compliance be measured?Topic 7 - Where and how does security fit into all of this? Also, your thoughts on synthetic data for training vs. real data?Topic 8 - If anyone is interested, what's the best way to get started?FEEDBACK?Email: show at the cloudcast dot netBluesky: @cloudcastpod.bsky.socialTwitter/X: @cloudcastpodInstagram: @cloudcastpodTikTok: @cloudcastpod
In this episode, my guest is industry veteran and Wav Group president Kevin Hawkins, also known as the Real AI Guy, about how agents and brokerages can succeed in the new age of AI. Kevin covers GEO, building AI visibility, paid vs free tools, practical agent workflows, and why brokers must prepare for MCP based systems and safer AI adoption across their organizations. Book Recommendation: The Real AI Guide for Real Estate Agents by Kevin Hawkins - https://a.co/d/jdQnNwz Guest: Kevin Hawkins Facebook - https://www.facebook.com/wavgroup LinkedIn - https://www.linkedin.com/in/kevhawkins/ Website - https://www.wavgroup.com AI Newsletter - https://realai.blog Host: Rajeev Sajja Website: http://www.realestateaiflash.com Facebook: https://www.facebook.com/rsajja Instagram: http://www.instagram.com/rajeev_sajja LinkedIn: http://www.linkedIn.com/in/rsajja Resources: AI Playbook - http://www.realestateaiflash.com $10 off for Plaud AI Notetakers for Podcast listeners - https://realestateaiflash.com/partners/ Join our Instagram Real Estate AI Insiders Channel - https://ig.me/j/AbZCJG37DqBPPtxi/ Subscribe to our weekly AI Newsletter: https://realestateai-flash.beehiiv.com/subscribe Join the Real Estate AI Academy Wait list - https://realestateaiflash.com/academy
Google just made two massive moves in 48 hours, and together, they could reshape how AI interacts with commerce forever.First: the Universal Commerce Protocol (UCP), an open standard that lets AI agents shop on your behalf. Discovery, checkout, payments, post-purchase, the whole journey, with one common language. Backed by Shopify, Walmart, Target, Visa, Mastercard, and 20+ others.Second: a multi-year deal with Apple. Gemini will power the next generation of Apple Intelligence, including Siri. That's Google's AI running on 2 billion Apple devices.In this episode, I break down what UCP actually is, how it works, why the Apple deal matters, and what this means for merchants, developers, and anyone building for the agentic web.CHAPTERS00:00 – The Anthony Joshua smile meme (and what it has to do with Google) 02:59 – The landscape: AI agents, fragmentation, and the assistant wars 06:36 – What is UCP? Universal Commerce Protocol explained 11:04 – Who's backing UCP and what it enables today 14:21 – The Apple-Gemini deal: what it means 17:55 – Why Apple chose Google (and what happens to OpenAI) 21:00 – Connecting the dots: Google's full strategy 24:00 – What this means for merchants and developers 26:30 – The bigger picture: who controls the agentic web? 29:07 – Closing thoughtsLINKSUCP Documentation: https://ucp.devUCP GitHub: https://github.com/Universal-Commerce-Protocol/ucpGoogle's UCP Announcement: https://blog.google/products/ads-commerce/agentic-commerce-ai-tools-protocol-retailers-platforms/Google-Apple Joint Statement: https://blog.google/company-news/inside-google/company-announcements/joint-statement-google-apple/Shopify's UCP Deep-Dive: https://shopify.engineering/UCPKEYWORDS/TAGSGoogle, UCP, Universal Commerce Protocol, AI agents, agentic commerce, e-commerce, Apple Intelligence, Gemini, Siri, AI shopping, MCP, Shopify, OpenAI, retail technology---If you want to go deeper on this, I write about it weekly at No Hacks Substack. Subscribe!
On the latest Security Visionaries podcast, host Emily Wearmouth invites Steve Riley back to demystify another key acronym in the AI world: MCP, or model context protocol. They break down what MCP actually is, how it functions, and why it is critical to understanding AI interactions. From there, they pivot into a discussion about the critical security implications of MCP, covering the risks of unauthenticated servers and the necessity of building both an authorization layer and an external policy layer to ensure granular access, data security, and compliance as AI agents proliferate. As always, this conversation with Steve offers illuminating context and perspective on one of the key security terms going into 2026. You won't want to miss it.
Smersh Pod LIVE from The Cheerful Earful Podcast Festival way back in October of 2025.We'll be travelling back to the exciting days of 1982 and uploading ourselves into a ‘computer' to indulge in a spot of cycling, a bit of frisbee, and a whole lot of RAM BAM THANK YOU MAM, while also hoping no one minds if we nip out for a quick MCP.Yes, it's TRONAnd joining me to whip out his eight inch floppy, is Paul Litchfield. Hosted on Acast. See acast.com/privacy for more information.
Context is Everything On this episode host Adam Turinas dives into the Model Context Protocol (MCP). While it sounds technical, understanding MCP is actually a critical business strategy. It acts as the rules of engagement that allow your AI to access the right data, tools, and brand guidelines safely and effectively. He explain why context architecture is going to matter much more than which specific AI model you use and how getting this right is essential for risk management and accuracy in the healthcare industry. If you want to move beyond generic AI outputs to creating a true competitive advantage, this episode breaks down exactly how to do it. Find all of our network podcasts on your favorite podcast platforms and be sure to subscribe and like us. Learn more at www.healthcarenowradio.com/listen
This week on the PHP Podcast, Eric and John talk about Welcome to 2026, Denmark stops postal services, New Laravel employee, PHPTek Early Bird ending soon, the pains of making a living off open source, PHP is Back according to Nuno, and more… Links from the show: Danish postal service to stop delivering letters after 400 years | Denmark | The Guardian MergePHP: Mastering Agentic PHP Development with MCP, Thu, Jan 8, 2026, 5:00 PM | Meetup https://x.com/wendell_adriel/status/2008168133362618776 PHP TEK 2026 Adam’s Morning Walk | We had six months left AI Code Reviews | CodeRabbit | Try for Free feat: add llms.txt endpoint for LLM-optimized documentation by quantizor · Pull Request #2388 · tailwindlabs/tailwindcss.com · GitHub Why PHP in 2026? https://laravelfortherestofus.com/ The PHP Podcast streams the recording of this podcast live, typically every Thursday at 3 PM PT. Come join us and subscribe to our YouTube channel. X: https://x.com/phparch Mastodon: https://phparch.social/@phparch Bluesky: https://bsky.app/profile/phparch.com Discord: https://discord.phparch.com Subscribe to our magazine: https://www.phparch.com/subscribe/ Host: Eric Van Johnson X: @shocm Mastodon: @eric@phparch.social Bluesky: @ericvanjohnson.bsky.social John Congdon X: @johncongdon Mastodon: @john@phparch.social Bluesky: @johncongdon.bsky.social Streams: Youtube Channel Twitch Partner This podcast is made a little better thanks to our partners Displace Infrastructure Management, Simplified Automate Kubernetes deployments across any cloud provider or bare metal with a single command. Deploy, manage, and scale your infrastructure with ease. https://displace.tech/ PHPScore Put Your Technical Debt on Autopay with PHPScore CodeRabbit Cut code review time & bugs in half instantly with CodeRabbit. Honeybadger.io Honeybadger helps you deploy with confidence and be your team's DevOps hero by combining error, uptime, and performance monitoring in one simple platform. Check it out at honeybadger.io Music Provided by Epidemic Sound https://www.epidemicsound.com/ The post The PHP Podcast 2026.01.08 appeared first on PHP Architect.
Move Move Throw is a Marvel Crisis Protocol Podcast hosted by Charles (Omnus) and Danny.We are now posting all episodes on YouTube!We talk about what a Danny and Charles cop show would be like to warm things up and then the main top is about 10,000 hours to become an expert but how its more complicated than that and how it all applies to a game like MCP.
AI isn't quietly changing software development… it's rewriting the rules while most security programs are still playing defense. When agents write code at machine speed, the real risk isn't velocity, it's invisible security debt compounding faster than teams can see it. In this episode, Ron Eddings sits down with Varun Badhwar, Co-Founder & CEO of Endor Labs, and Henrik Plate, Principal Security Researcher of Endor Labs, to break down how AI-assisted development is reshaping the software supply chain in real time. From MCP servers exploding across GitHub to agents trained on insecure code patterns, they analyze why traditional AppSec controls fail in an agent-driven world and what must replace them. This conversation pulls directly from Endor Labs' 2025 State of Dependency Management Report, revealing why most AI-generated code is functionally correct yet fundamentally unsafe, how malicious packages are already exploiting agent workflows, and why security has to exist inside the IDE, not after the pull request. Impactful Moments 00:00 – Introduction 02:00 – Star Wars meets cybersecurity culture 03:00 – Why this report matters now 04:00 – MCP adoption explodes overnight 10:00 – Can you trust MCP servers 12:00 – Malicious packages weaponize agents 14:00 – Code works, security fails 22:00 – Hooks expose agent behavior 28:30 – 2026 means longer lunches 33:00 – How Endor Labs fixes this Links Connect with our Varun on LinkedIn: https://www.linkedin.com/in/vbadhwar/ Connect with our Henrik on LinkedIn: https://www.linkedin.com/in/henrikplate/ Check out Endor Labs State of Dependency Management 2025: https://www.endorlabs.com/lp/state-of-dependency-management-2025 Check out our upcoming events: https://www.hackervalley.com/livestreams Join our creative mastermind and stand out as a cybersecurity professional: https://www.patreon.com/hackervalleystudio Love Hacker Valley Studio? Pick up some swag: https://store.hackervalley.com Continue the conversation by joining our Discord: https://hackervalley.com/discord Become a sponsor of the show to amplify your brand: https://hackervalley.com/work-with-us/
Happy New Year! You may have noticed that in 2025 we had moved toward YouTube as our primary podcasting platform. As we'll explain in the next State of Latent Space post, we'll be doubling down on Substack again and improving the experience for the over 100,000 of you who look out for our emails and website updates!We first mentioned Artificial Analysis in 2024, when it was still a side project in a Sydney basement. They then were one of the few Nat Friedman and Daniel Gross' AIGrant companies to raise a full seed round from them and have now become the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities.We have chatted with both Clementine Fourrier of HuggingFace's OpenLLM Leaderboard and (the freshly valued at $1.7B) Anastasios Angelopoulos of LMArena on their approaches to LLM evals and trendspotting, but Artificial Analysis have staked out an enduring and important place in the toolkit of the modern AI Engineer by doing the best job of independently running the most comprehensive set of evals across the widest range of open and closed models, and charting their progress for broad industry analyst use.George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is “open” really?We discuss:* The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet* Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers* The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints* How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)* The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs* Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding ”I don't know”), and Claude models lead with the lowest hallucination rates despite not always being the smartest* GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)* The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)* The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)* Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future* Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)* V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)Links to Artificial Analysis* Website: https://artificialanalysis.ai* George Cameron on X: https://x.com/georgecameron* Micah-Hill Smith on X: https://x.com/micahhsmithFull Episode on YouTubeTimestamps* 00:00 Introduction: Full Circle Moment and Artificial Analysis Origins* 01:19 Business Model: Independence and Revenue Streams* 04:33 Origin Story: From Legal AI to Benchmarking Need* 16:22 AI Grant and Moving to San Francisco* 19:21 Intelligence Index Evolution: From V1 to V3* 11:47 Benchmarking Challenges: Variance, Contamination, and Methodology* 13:52 Mystery Shopper Policy and Maintaining Independence* 28:01 New Benchmarks: Omissions Index for Hallucination Detection* 33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning* 23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks* 50:19 Stirrup Agent Harness: Open Source Agentic Framework* 52:43 Openness Index: Measuring Model Transparency Beyond Licenses* 58:25 The Smiling Curve: Cost Falling While Spend Rising* 1:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits* 1:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges* 1:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas* 1:15:05 Looking Ahead: Intelligence Index V4 and Future Directions* 1:16:50 Closing: The Insatiable Demand for IntelligenceTranscriptMicah [00:00:06]: This is kind of a full circle moment for us in a way, because the first time artificial analysis got mentioned on a podcast was you and Alessio on Latent Space. Amazing.swyx [00:00:17]: Which was January 2024. I don't even remember doing that, but yeah, it was very influential to me. Yeah, I'm looking at AI News for Jan 17, or Jan 16, 2024. I said, this gem of a models and host comparison site was just launched. And then I put in a few screenshots, and I said, it's an independent third party. It clearly outlines the quality versus throughput trade-off, and it breaks out by model and hosting provider. I did give you s**t for missing fireworks, and how do you have a model benchmarking thing without fireworks? But you had together, you had perplexity, and I think we just started chatting there. Welcome, George and Micah, to Latent Space. I've been following your progress. Congrats on... It's been an amazing year. You guys have really come together to be the presumptive new gardener of AI, right? Which is something that...George [00:01:09]: Yeah, but you can't pay us for better results.swyx [00:01:12]: Yes, exactly.George [00:01:13]: Very important.Micah [00:01:14]: Start off with a spicy take.swyx [00:01:18]: Okay, how do I pay you?Micah [00:01:20]: Let's get right into that.swyx [00:01:21]: How do you make money?Micah [00:01:24]: Well, very happy to talk about that. So it's been a big journey the last couple of years. Artificial analysis is going to be two years old in January 2026. Which is pretty soon now. We first run the website for free, obviously, and give away a ton of data to help developers and companies navigate AI and make decisions about models, providers, technologies across the AI stack for building stuff. We're very committed to doing that and tend to keep doing that. We have, along the way, built a business that is working out pretty sustainably. We've got just over 20 people now and two main customer groups. So we want to be... We want to be who enterprise look to for data and insights on AI, so we want to help them with their decisions about models and technologies for building stuff. And then on the other side, we do private benchmarking for companies throughout the AI stack who build AI stuff. So no one pays to be on the website. We've been very clear about that from the very start because there's no use doing what we do unless it's independent AI benchmarking. Yeah. But turns out a bunch of our stuff can be pretty useful to companies building AI stuff.swyx [00:02:38]: And is it like, I am a Fortune 500, I need advisors on objective analysis, and I call you guys and you pull up a custom report for me, you come into my office and give me a workshop? What kind of engagement is that?George [00:02:53]: So we have a benchmarking and insight subscription, which looks like standardized reports that cover key topics or key challenges enterprises face when looking to understand AI and choose between all the technologies. And so, for instance, one of the report is a model deployment report, how to think about choosing between serverless inference, managed deployment solutions, or leasing chips. And running inference yourself is an example kind of decision that big enterprises face, and it's hard to reason through, like this AI stuff is really new to everybody. And so we try and help with our reports and insight subscription. Companies navigate that. We also do custom private benchmarking. And so that's very different from the public benchmarking that we publicize, and there's no commercial model around that. For private benchmarking, we'll at times create benchmarks, run benchmarks to specs that enterprises want. And we'll also do that sometimes for AI companies who have built things, and we help them understand what they've built with private benchmarking. Yeah. So that's a piece mainly that we've developed through trying to support everybody publicly with our public benchmarks. Yeah.swyx [00:04:09]: Let's talk about TechStack behind that. But okay, I'm going to rewind all the way to when you guys started this project. You were all the way in Sydney? Yeah. Well, Sydney, Australia for me.Micah [00:04:19]: George was an SF, but he's Australian, but he moved here already. Yeah.swyx [00:04:22]: And I remember I had the Zoom call with you. What was the impetus for starting artificial analysis in the first place? You know, you started with public benchmarks. And so let's start there. We'll go to the private benchmark. Yeah.George [00:04:33]: Why don't we even go back a little bit to like why we, you know, thought that it was needed? Yeah.Micah [00:04:40]: The story kind of begins like in 2022, 2023, like both George and I have been into AI stuff for quite a while. In 2023 specifically, I was trying to build a legal AI research assistant. So it actually worked pretty well for its era, I would say. Yeah. Yeah. So I was finding that the more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem. So had like this multistage algorithm thing, trying to figure out what the minimum viable model for each bit was, trying to optimize every bit of it as you build that out, right? Like you're trying to think about accuracy, a bunch of other metrics and performance and cost. And mostly just no one was doing anything to independently evaluate all the models. And certainly not to look at the trade-offs for speed and cost. So we basically set out just to build a thing that developers could look at to see the trade-offs between all of those things measured independently across all the models and providers. Honestly, it was probably meant to be a side project when we first started doing it.swyx [00:05:49]: Like we didn't like get together and say like, Hey, like we're going to stop working on all this stuff. I'm like, this is going to be our main thing. When I first called you, I think you hadn't decided on starting a company yet.Micah [00:05:58]: That's actually true. I don't even think we'd pause like, like George had an acquittance job. I didn't quit working on my legal AI thing. Like it was genuinely a side project.George [00:06:05]: We built it because we needed it as people building in the space and thought, Oh, other people might find it useful too. So we'll buy domain and link it to the Vercel deployment that we had and tweet about it. And, but very quickly it started getting attention. Thank you, Swyx for, I think doing an initial retweet and spotlighting it there. This project that we released. And then very quickly though, it was useful to others, but very quickly it became more useful as the number of models released accelerated. We had Mixtrel 8x7B and it was a key. That's a fun one. Yeah. Like a open source model that really changed the landscape and opened up people's eyes to other serverless inference providers and thinking about speed, thinking about cost. And so that was a key. And so it became more useful quite quickly. Yeah.swyx [00:07:02]: What I love talking to people like you who sit across the ecosystem is, well, I have theories about what people want, but you have data and that's obviously more relevant. But I want to stay on the origin story a little bit more. When you started out, I would say, I think the status quo at the time was every paper would come out and they would report their numbers versus competitor numbers. And that's basically it. And I remember I did the legwork. I think everyone has some knowledge. I think there's some version of Excel sheet or a Google sheet where you just like copy and paste the numbers from every paper and just post it up there. And then sometimes they don't line up because they're independently run. And so your numbers are going to look better than... Your reproductions of other people's numbers are going to look worse because you don't hold their models correctly or whatever the excuse is. I think then Stanford Helm, Percy Liang's project would also have some of these numbers. And I don't know if there's any other source that you can cite. The way that if I were to start artificial analysis at the same time you guys started, I would have used the Luther AI's eval framework harness. Yup.Micah [00:08:06]: Yup. That was some cool stuff. At the end of the day, running these evals, it's like if it's a simple Q&A eval, all you're doing is asking a list of questions and checking if the answers are right, which shouldn't be that crazy. But it turns out there are an enormous number of things that you've got control for. And I mean, back when we started the website. Yeah. Yeah. Like one of the reasons why we realized that we had to run the evals ourselves and couldn't just take rules from the labs was just that they would all prompt the models differently. And when you're competing over a few points, then you can pretty easily get- You can put the answer into the model. Yeah. That in the extreme. And like you get crazy cases like back when I'm Googled a Gemini 1.0 Ultra and needed a number that would say it was better than GPT-4 and like constructed, I think never published like chain of thought examples. 32 of them in every topic in MLU to run it, to get the score, like there are so many things that you- They never shipped Ultra, right? That's the one that never made it up. Not widely. Yeah. Yeah. Yeah. I mean, I'm sure it existed, but yeah. So we were pretty sure that we needed to run them ourselves and just run them in the same way across all the models. Yeah. And we were, we also did certain from the start that you couldn't look at those in isolation. You needed to look at them alongside the cost and performance stuff. Yeah.swyx [00:09:24]: Okay. A couple of technical questions. I mean, so obviously I also thought about this and I didn't do it because of cost. Yep. Did you not worry about costs? Were you funded already? Clearly not, but you know. No. Well, we definitely weren't at the start.Micah [00:09:36]: So like, I mean, we're paying for it personally at the start. There's a lot of money. Well, the numbers weren't nearly as bad a couple of years ago. So we certainly incurred some costs, but we were probably in the order of like hundreds of dollars of spend across all the benchmarking that we were doing. Yeah. So nothing. Yeah. It was like kind of fine. Yeah. Yeah. These days that's gone up an enormous amount for a bunch of reasons that we can talk about. But yeah, it wasn't that bad because you can also remember that like the number of models we were dealing with was hardly any and the complexity of the stuff that we wanted to do to evaluate them was a lot less. Like we were just asking some Q&A type questions and then one specific thing was for a lot of evals initially, we were just like sampling an answer. You know, like, what's the answer for this? Like, we didn't want to go into the answer directly without letting the models think. We weren't even doing chain of thought stuff initially. And that was the most useful way to get some results initially. Yeah.swyx [00:10:33]: And so for people who haven't done this work, literally parsing the responses is a whole thing, right? Like because sometimes the models, the models can answer any way they feel fit and sometimes they actually do have the right answer, but they just returned the wrong format and they will get a zero for that unless you work it into your parser. And that involves more work. And so, I mean, but there's an open question whether you should give it points for not following your instructions on the format.Micah [00:11:00]: It depends what you're looking at, right? Because you can, if you're trying to see whether or not it can solve a particular type of reasoning problem, and you don't want to test it on its ability to do answer formatting at the same time, then you might want to use an LLM as answer extractor approach to make sure that you get the answer out no matter how unanswered. But these days, it's mostly less of a problem. Like, if you instruct a model and give it examples of what the answers should look like, it can get the answers in your format, and then you can do, like, a simple regex.swyx [00:11:28]: Yeah, yeah. And then there's other questions around, I guess, sometimes if you have a multiple choice question, sometimes there's a bias towards the first answer, so you have to randomize the responses. All these nuances, like, once you dig into benchmarks, you're like, I don't know how anyone believes the numbers on all these things. It's so dark magic.Micah [00:11:47]: You've also got, like… You've got, like, the different degrees of variance in different benchmarks, right? Yeah. So, if you run four-question multi-choice on a modern reasoning model at the temperatures suggested by the labs for their own models, the variance that you can see on a four-question multi-choice eval is pretty enormous if you only do a single run of it and it has a small number of questions, especially. So, like, one of the things that we do is run an enormous number of all of our evals when we're developing new ones and doing upgrades to our intelligence index to bring in new things. Yeah. So, that we can dial in the right number of repeats so that we can get to the 95% confidence intervals that we're comfortable with so that when we pull that together, we can be confident in intelligence index to at least as tight as, like, a plus or minus one at a 95% confidence. Yeah.swyx [00:12:32]: And, again, that just adds a straight multiple to the cost. Oh, yeah. Yeah, yeah.George [00:12:37]: So, that's one of many reasons that cost has gone up a lot more than linearly over the last couple of years. We report a cost to run the artificial analysis. We report a cost to run the artificial analysis intelligence index on our website, and currently that's assuming one repeat in terms of how we report it because we want to reflect a bit about the weighting of the index. But our cost is actually a lot higher than what we report there because of the repeats.swyx [00:13:03]: Yeah, yeah, yeah. And probably this is true, but just checking, you don't have any special deals with the labs. They don't discount it. You just pay out of pocket or out of your sort of customer funds. Oh, there is a mix. So, the issue is that sometimes they may give you a special end point, which is… Ah, 100%.Micah [00:13:21]: Yeah, yeah, yeah. Exactly. So, we laser focus, like, on everything we do on having the best independent metrics and making sure that no one can manipulate them in any way. There are quite a lot of processes we've developed over the last couple of years to make that true for, like, the one you bring up, like, right here of the fact that if we're working with a lab, if they're giving us a private endpoint to evaluate a model, that it is totally possible. That what's sitting behind that black box is not the same as they serve on a public endpoint. We're very aware of that. We have what we call a mystery shopper policy. And so, and we're totally transparent with all the labs we work with about this, that we will register accounts not on our own domain and run both intelligence evals and performance benchmarks… Yeah, that's the job. …without them being able to identify it. And no one's ever had a problem with that. Because, like, a thing that turns out to actually be quite a good… …good factor in the industry is that they all want to believe that none of their competitors could manipulate what we're doing either.swyx [00:14:23]: That's true. I never thought about that. I've been in the database data industry prior, and there's a lot of shenanigans around benchmarking, right? So I'm just kind of going through the mental laundry list. Did I miss anything else in this category of shenanigans? Oh, potential shenanigans.Micah [00:14:36]: I mean, okay, the biggest one, like, that I'll bring up, like, is more of a conceptual one, actually, than, like, direct shenanigans. It's that the things that get measured become things that get targeted by labs that they're trying to build, right? Exactly. So that doesn't mean anything that we should really call shenanigans. Like, I'm not talking about training on test set. But if you know that you're going to be great at another particular thing, if you're a researcher, there are a whole bunch of things that you can do to try to get better at that thing that preferably are going to be helpful for a wide range of how actual users want to use the thing that you're building. But will not necessarily work. Will not necessarily do that. So, for instance, the models are exceptional now at answering competition maths problems. There is some relevance of that type of reasoning, that type of work, to, like, how we might use modern coding agents and stuff. But it's clearly not one for one. So the thing that we have to be aware of is that once an eval becomes the thing that everyone's looking at, scores can get better on it without there being a reflection of overall generalized intelligence of these models. Getting better. That has been true for the last couple of years. It'll be true for the next couple of years. There's no silver bullet to defeat that other than building new stuff to stay relevant and measure the capabilities that matter most to real users. Yeah.swyx [00:15:58]: And we'll cover some of the new stuff that you guys are building as well, which is cool. Like, you used to just run other people's evals, but now you're coming up with your own. And I think, obviously, that is a necessary path once you're at the frontier. You've exhausted all the existing evals. I think the next point in history that I have for you is AI Grant that you guys decided to join and move here. What was it like? I think you were in, like, batch two? Batch four. Batch four. Okay.Micah [00:16:26]: I mean, it was great. Nat and Daniel are obviously great. And it's a really cool group of companies that we were in AI Grant alongside. It was really great to get Nat and Daniel on board. Obviously, they've done a whole lot of great work in the space with a lot of leading companies and were extremely aligned. With the mission of what we were trying to do. Like, we're not quite typical of, like, a lot of the other AI startups that they've invested in.swyx [00:16:53]: And they were very much here for the mission of what we want to do. Did they say any advice that really affected you in some way or, like, were one of the events very impactful? That's an interesting question.Micah [00:17:03]: I mean, I remember fondly a bunch of the speakers who came and did fireside chats at AI Grant.swyx [00:17:09]: Which is also, like, a crazy list. Yeah.George [00:17:11]: Oh, totally. Yeah, yeah, yeah. There was something about, you know, speaking to Nat and Daniel about the challenges of working through a startup and just working through the questions that don't have, like, clear answers and how to work through those kind of methodically and just, like, work through the hard decisions. And they've been great mentors to us as we've built artificial analysis. Another benefit for us was that other companies in the batch and other companies in AI Grant are pushing the capabilities. Yeah. And I think that's a big part of what AI can do at this time. And so being in contact with them, making sure that artificial analysis is useful to them has been fantastic for supporting us in working out how should we build out artificial analysis to continue to being useful to those, like, you know, building on AI.swyx [00:17:59]: I think to some extent, I'm mixed opinion on that one because to some extent, your target audience is not people in AI Grants who are obviously at the frontier. Yeah. Do you disagree?Micah [00:18:09]: To some extent. To some extent. But then, so a lot of what the AI Grant companies are doing is taking capabilities coming out of the labs and trying to push the limits of what they can do across the entire stack for building great applications, which actually makes some of them pretty archetypical power users of artificial analysis. Some of the people with the strongest opinions about what we're doing well and what we're not doing well and what they want to see next from us. Yeah. Yeah. Because when you're building any kind of AI application now, chances are you're using a whole bunch of different models. You're maybe switching reasonably frequently for different models and different parts of your application to optimize what you're able to do with them at an accuracy level and to get better speed and cost characteristics. So for many of them, no, they're like not commercial customers of ours, like we don't charge for all our data on the website. Yeah. They are absolutely some of our power users.swyx [00:19:07]: So let's talk about just the evals as well. So you start out from the general like MMU and GPQA stuff. What's next? How do you sort of build up to the overall index? What was in V1 and how did you evolve it? Okay.Micah [00:19:22]: So first, just like background, like we're talking about the artificial analysis intelligence index, which is our synthesis metric that we pulled together currently from 10 different eval data sets to give what? We're pretty much the same as that. Pretty confident is the best single number to look at for how smart the models are. Obviously, it doesn't tell the whole story. That's why we published the whole website of all the charts to dive into every part of it and look at the trade-offs. But best single number. So right now, it's got a bunch of Q&A type data sets that have been very important to the industry, like a couple that you just mentioned. It's also got a couple of agentic data sets. It's got our own long context reasoning data set and some other use case focused stuff. As time goes on. The things that we're most interested in that are going to be important to the capabilities that are becoming more important for AI, what developers are caring about, are going to be first around agentic capabilities. So surprise, surprise. We're all loving our coding agents and how the model is going to perform like that and then do similar things for different types of work are really important to us. The linking to use cases to economically valuable use cases are extremely important to us. And then we've got some of the. Yeah. These things that the models still struggle with, like working really well over long contexts that are not going to go away as specific capabilities and use cases that we need to keep evaluating.swyx [00:20:46]: But I guess one thing I was driving was like the V1 versus the V2 and how bad it was over time.Micah [00:20:53]: Like how we've changed the index to where we are.swyx [00:20:55]: And I think that reflects on the change in the industry. Right. So that's a nice way to tell that story.Micah [00:21:00]: Well, V1 would be completely saturated right now. Almost every model coming out because doing things like writing the Python functions and human evil is now pretty trivial. It's easy to forget, actually, I think how much progress has been made in the last two years. Like we obviously play the game constantly of like the today's version versus last week's version and the week before and all of the small changes in the horse race between the current frontier and who has the best like smaller than 10B model like right now this week. Right. And that's very important to a lot of developers and people and especially in this particular city of San Francisco. But when you zoom out a couple of years ago, literally most of what we were doing to evaluate the models then would all be 100% solved by even pretty small models today. And that's been one of the key things, by the way, that's driven down the cost of intelligence at every tier of intelligence. We can talk about more in a bit. So V1, V2, V3, we made things harder. We covered a wider range of use cases. And we tried to get closer to things developers care about as opposed to like just the Q&A type stuff that MMLU and GPQA represented. Yeah.swyx [00:22:12]: I don't know if you have anything to add there. Or we could just go right into showing people the benchmark and like looking around and asking questions about it. Yeah.Micah [00:22:21]: Let's do it. Okay. This would be a pretty good way to chat about a few of the new things we've launched recently. Yeah.George [00:22:26]: And I think a little bit about the direction that we want to take it. And we want to push benchmarks. Currently, the intelligence index and evals focus a lot on kind of raw intelligence. But we kind of want to diversify how we think about intelligence. And we can talk about it. But kind of new evals that we've kind of built and partnered on focus on topics like hallucination. And we've got a lot of topics that I think are not covered by the current eval set that should be. And so we want to bring that forth. But before we get into that.swyx [00:23:01]: And so for listeners, just as a timestamp, right now, number one is Gemini 3 Pro High. Then followed by Cloud Opus at 70. Just 5.1 high. You don't have 5.2 yet. And Kimi K2 Thinking. Wow. Still hanging in there. So those are the top four. That will date this podcast quickly. Yeah. Yeah. I mean, I love it. I love it. No, no. 100%. Look back this time next year and go, how cute. Yep.George [00:23:25]: Totally. A quick view of that is, okay, there's a lot. I love it. I love this chart. Yeah.Micah [00:23:30]: This is such a favorite, right? Yeah. And almost every talk that George or I give at conferences and stuff, we always put this one up first to just talk about situating where we are in this moment in history. This, I think, is the visual version of what I was saying before about the zooming out and remembering how much progress there's been. If we go back to just over a year ago, before 01, before Cloud Sonnet 3.5, we didn't have reasoning models or coding agents as a thing. And the game was very, very different. If we go back even a little bit before then, we're in the era where, when you look at this chart, open AI was untouchable for well over a year. And, I mean, you would remember that time period well of there being very open questions about whether or not AI was going to be competitive, like full stop, whether or not open AI would just run away with it, whether we would have a few frontier labs and no one else would really be able to do anything other than consume their APIs. I am quite happy overall that the world that we have ended up in is one where... Multi-model. Absolutely. And strictly more competitive every quarter over the last few years. Yeah. This year has been insane. Yeah.George [00:24:42]: You can see it. This chart with everything added is hard to read currently. There's so many dots on it, but I think it reflects a little bit what we felt, like how crazy it's been.swyx [00:24:54]: Why 14 as the default? Is that a manual choice? Because you've got service now in there that are less traditional names. Yeah.George [00:25:01]: It's models that we're kind of highlighting by default in our charts, in our intelligence index. Okay.swyx [00:25:07]: You just have a manually curated list of stuff.George [00:25:10]: Yeah, that's right. But something that I actually don't think every artificial analysis user knows is that you can customize our charts and choose what models are highlighted. Yeah. And so if we take off a few names, it gets a little easier to read.swyx [00:25:25]: Yeah, yeah. A little easier to read. Totally. Yeah. But I love that you can see the all one jump. Look at that. September 2024. And the DeepSeek jump. Yeah.George [00:25:34]: Which got close to OpenAI's leadership. They were so close. I think, yeah, we remember that moment. Around this time last year, actually.Micah [00:25:44]: Yeah, yeah, yeah. I agree. Yeah, well, a couple of weeks. It was Boxing Day in New Zealand when DeepSeek v3 came out. And we'd been tracking DeepSeek and a bunch of the other global players that were less known over the second half of 2024 and had run evals on the earlier ones and stuff. I very distinctly remember Boxing Day in New Zealand, because I was with family for Christmas and stuff, running the evals and getting back result by result on DeepSeek v3. So this was the first of their v3 architecture, the 671b MOE.Micah [00:26:19]: And we were very, very impressed. That was the moment where we were sure that DeepSeek was no longer just one of many players, but had jumped up to be a thing. The world really noticed when they followed that up with the RL working on top of v3 and R1 succeeding a few weeks later. But the groundwork for that absolutely was laid with just extremely strong base model, completely open weights that we had as the best open weights model. So, yeah, that's the thing that you really see in the game. But I think that we got a lot of good feedback on Boxing Day. us on Boxing Day last year.George [00:26:48]: Boxing Day is the day after Christmas for those not familiar.George [00:26:54]: I'm from Singapore.swyx [00:26:55]: A lot of us remember Boxing Day for a different reason, for the tsunami that happened. Oh, of course. Yeah, but that was a long time ago. So yeah. So this is the rough pitch of AAQI. Is it A-A-Q-I or A-A-I-I? I-I. Okay. Good memory, though.Micah [00:27:11]: I don't know. I'm not used to it. Once upon a time, we did call it Quality Index, and we would talk about quality, performance, and price, but we changed it to intelligence.George [00:27:20]: There's been a few naming changes. We added hardware benchmarking to the site, and so benchmarks at a kind of system level. And so then we changed our throughput metric to, we now call it output speed, and thenswyx [00:27:32]: throughput makes sense at a system level, so we took that name. Take me through more charts. What should people know? Obviously, the way you look at the site is probably different than how a beginner might look at it.Micah [00:27:42]: Yeah, that's fair. There's a lot of fun stuff to dive into. Maybe so we can hit past all the, like, we have lots and lots of emails and stuff. The interesting ones to talk about today that would be great to bring up are a few of our recent things, I think, that probably not many people will be familiar with yet. So first one of those is our omniscience index. So this one is a little bit different to most of the intelligence evils that we've run. We built it specifically to look at the embedded knowledge in the models and to test hallucination by looking at when the model doesn't know the answer, so not able to get it correct, what's its probability of saying, I don't know, or giving an incorrect answer. So the metric that we use for omniscience goes from negative 100 to positive 100. Because we're simply taking off a point if you give an incorrect answer to the question. We're pretty convinced that this is an example of where it makes most sense to do that, because it's strictly more helpful to say, I don't know, instead of giving a wrong answer to factual knowledge question. And one of our goals is to shift the incentive that evils create for models and the labs creating them to get higher scores. And almost every evil across all of AI up until this point, it's been graded by simple percentage correct as the main metric, the main thing that gets hyped. And so you should take a shot at everything. There's no incentive to say, I don't know. So we did that for this one here.swyx [00:29:22]: I think there's a general field of calibration as well, like the confidence in your answer versus the rightness of the answer. Yeah, we completely agree. Yeah. Yeah.George [00:29:31]: On that. And one reason that we didn't do that is because. Or put that into this index is that we think that the, the way to do that is not to ask the models how confident they are.swyx [00:29:43]: I don't know. Maybe it might be though. You put it like a JSON field, say, say confidence and maybe it spits out something. Yeah. You know, we have done a few evils podcasts over the, over the years. And when we did one with Clementine of hugging face, who maintains the open source leaderboard, and this was one of her top requests, which is some kind of hallucination slash lack of confidence calibration thing. And so, Hey, this is one of them.Micah [00:30:05]: And I mean, like anything that we do, it's not a perfect metric or the whole story of everything that you think about as hallucination. But yeah, it's pretty useful and has some interesting results. Like one of the things that we saw in the hallucination rate is that anthropics Claude models at the, the, the very left-hand side here with the lowest hallucination rates out of the models that we've evaluated amnesty is on. That is an interesting fact. I think it probably correlates with a lot of the previously, not really measured vibes stuff that people like about some of the Claude models. Is the dataset public or what's is it, is there a held out set? There's a hell of a set for this one. So we, we have published a public test set, but we we've only published 10% of it. The reason is that for this one here specifically, it would be very, very easy to like have data contamination because it is just factual knowledge questions. We would. We'll update it at a time to also prevent that, but with yeah, kept most of it held out so that we can keep it reliable for a long time. It leads us to a bunch of really cool things, including breakdown quite granularly by topic. And so we've got some of that disclosed on the website publicly right now, and there's lots more coming in terms of our ability to break out very specific topics. Yeah.swyx [00:31:23]: I would be interested. Let's, let's dwell a little bit on this hallucination one. I noticed that Haiku hallucinates less than Sonnet hallucinates less than Opus. And yeah. Would that be the other way around in a normal capability environments? I don't know. What's, what do you make of that?George [00:31:37]: One interesting aspect is that we've found that there's not really a, not a strong correlation between intelligence and hallucination, right? That's to say that the smarter the models are in a general sense, isn't correlated with their ability to, when they don't know something, say that they don't know. It's interesting that Gemini three pro preview was a big leap over here. Gemini 2.5. Flash and, and, and 2.5 pro, but, and if I add pro quickly here.swyx [00:32:07]: I bet pro's really good. Uh, actually no, I meant, I meant, uh, the GPT pros.George [00:32:12]: Oh yeah.swyx [00:32:13]: Cause GPT pros are rumored. We don't know for a fact that it's like eight runs and then with the LM judge on top. Yeah.George [00:32:20]: So we saw a big jump in, this is accuracy. So this is just percent that they get, uh, correct and Gemini three pro knew a lot more than the other models. And so big jump in accuracy. But relatively no change between the Google Gemini models, between releases. And the hallucination rate. Exactly. And so it's likely due to just kind of different post-training recipe, between the, the Claude models. Yeah.Micah [00:32:45]: Um, there's, there's driven this. Yeah. You can, uh, you can partially blame us and how we define intelligence having until now not defined hallucination as a negative in the way that we think about intelligence.swyx [00:32:56]: And so that's what we're changing. Uh, I know many smart people who are confidently incorrect.George [00:33:02]: Uh, look, look at that. That, that, that is very humans. Very true. And there's times and a place for that. I think our view is that hallucination rate makes sense in this context where it's around knowledge, but in many cases, people want the models to hallucinate, to have a go. Often that's the case in coding or when you're trying to generate newer ideas. One eval that we added to artificial analysis is, is, is critical point and it's really hard, uh, physics problems. Okay.swyx [00:33:32]: And is it sort of like a human eval type or something different or like a frontier math type?George [00:33:37]: It's not dissimilar to frontier frontier math. So these are kind of research questions that kind of academics in the physics physics world would be able to answer, but models really struggled to answer. So the top score here is not 9%.swyx [00:33:51]: And when the people that, that created this like Minway and, and, and actually off via who was kind of behind sweep and what organization is this? Oh, is this, it's Princeton.George [00:34:01]: Kind of range of academics from, from, uh, different academic institutions, really smart people. They talked about how they turn the models up in terms of the temperature as high temperature as they can, where they're trying to explore kind of new ideas in physics as a, as a thought partner, just because they, they want the models to hallucinate. Um, yeah, sometimes it's something new. Yeah, exactly.swyx [00:34:21]: Um, so not right in every situation, but, um, I think it makes sense, you know, to test hallucination in scenarios where it makes sense. Also, the obvious question is, uh, this is one of. Many that there is there, every lab has a system card that shows some kind of hallucination number, and you've chosen to not, uh, endorse that and you've made your own. And I think that's a, that's a choice. Um, totally in some sense, the rest of artificial analysis is public benchmarks that other people can independently rerun. You provide it as a service here. You have to fight the, well, who are we to, to like do this? And your, your answer is that we have a lot of customers and, you know, but like, I guess, how do you converge the individual?Micah [00:35:08]: I mean, I think, I think for hallucinations specifically, there are a bunch of different things that you might care about reasonably, and that you'd measure quite differently, like we've called this a amnesty and solutionation rate, not trying to declare the, like, it's humanity's last hallucination. You could, uh, you could have some interesting naming conventions and all this stuff. Um, the biggest picture answer to that. It's something that I actually wanted to mention. Just as George was explaining, critical point as well is, so as we go forward, we are building evals internally. We're partnering with academia and partnering with AI companies to build great evals. We have pretty strong views on, in various ways for different parts of the AI stack, where there are things that are not being measured well, or things that developers care about that should be measured more and better. And we intend to be doing that. We're not obsessed necessarily with that. Everything we do, we have to do entirely within our own team. Critical point. As a cool example of where we were a launch partner for it, working with academia, we've got some partnerships coming up with a couple of leading companies. Those ones, obviously we have to be careful with on some of the independent stuff, but with the right disclosure, like we're completely comfortable with that. A lot of the labs have released great data sets in the past that we've used to great success independently. And so it's between all of those techniques, we're going to be releasing more stuff in the future. Cool.swyx [00:36:26]: Let's cover the last couple. And then we'll, I want to talk about your trends analysis stuff, you know? Totally.Micah [00:36:31]: So that actually, I have one like little factoid on omniscience. If you go back up to accuracy on omniscience, an interesting thing about this accuracy metric is that it tracks more closely than anything else that we measure. The total parameter count of models makes a lot of sense intuitively, right? Because this is a knowledge eval. This is the pure knowledge metric. We're not looking at the index and the hallucination rate stuff that we think is much more about how the models are trained. This is just what facts did they recall? And yeah, it tracks parameter count extremely closely. Okay.swyx [00:37:05]: What's the rumored size of GPT-3 Pro? And to be clear, not confirmed for any official source, just rumors. But rumors do fly around. Rumors. I get, I hear all sorts of numbers. I don't know what to trust.Micah [00:37:17]: So if you, if you draw the line on omniscience accuracy versus total parameters, we've got all the open ways models, you can squint and see that likely the leading frontier models right now are quite a lot bigger than the ones that we're seeing right now. And the one trillion parameters that the open weights models cap out at, and the ones that we're looking at here, there's an interesting extra data point that Elon Musk revealed recently about XAI that for three trillion parameters for GROK 3 and 4, 6 trillion for GROK 5, but that's not out yet. Take those together, have a look. You might reasonably form a view that there's a pretty good chance that Gemini 3 Pro is bigger than that, that it could be in the 5 to 10 trillion parameters. To be clear, I have absolutely no idea, but just based on this chart, like that's where you would, you would land if you have a look at it. Yeah.swyx [00:38:07]: And to some extent, I actually kind of discourage people from guessing too much because what does it really matter? Like as long as they can serve it as a sustainable cost, that's about it. Like, yeah, totally.George [00:38:17]: They've also got different incentives in play compared to like open weights models who are thinking to supporting others in self-deployment for the labs who are doing inference at scale. It's I think less about total parameters in many cases. When thinking about inference costs and more around number of active parameters. And so there's a bit of an incentive towards larger sparser models. Agreed.Micah [00:38:38]: Understood. Yeah. Great. I mean, obviously if you're a developer or company using these things, not exactly as you say, it doesn't matter. You should be looking at all the different ways that we measure intelligence. You should be looking at cost to run index number and the different ways of thinking about token efficiency and cost efficiency based on the list prices, because that's all it matters.swyx [00:38:56]: It's not as good for the content creator rumor mill where I can say. Oh, GPT-4 is this small circle. Look at GPT-5 is this big circle. And then there used to be a thing for a while. Yeah.Micah [00:39:07]: But that is like on its own, actually a very interesting one, right? That is it just purely that chances are the last couple of years haven't seen a dramatic scaling up in the total size of these models. And so there's a lot of room to go up properly in total size of the models, especially with the upcoming hardware generations. Yes.swyx [00:39:29]: So, you know. Taking off my shitposting face for a minute. Yes. Yes. At the same time, I do feel like, you know, especially coming back from Europe, people do feel like Ilya is probably right that the paradigm is doesn't have many more orders of magnitude to scale out more. And therefore we need to start exploring at least a different path. GDPVal, I think it's like only like a month or so old. I was also very positive when it first came out. I actually talked to Tejo, who was the lead researcher on that. Oh, cool. And you have your own version.George [00:39:59]: It's a fantastic. It's a fantastic data set. Yeah.swyx [00:40:01]: And maybe it will recap for people who are still out of it. It's like 44 tasks based on some kind of GDP cutoff that's like meant to represent broad white collar work that is not just coding. Yeah.Micah [00:40:12]: Each of the tasks have a whole bunch of detailed instructions, some input files for a lot of them. It's within the 44 is divided into like two hundred and twenty two to five, maybe subtasks that are the level of that we run through the agenda. And yeah, they're really interesting. I will say that it doesn't. It doesn't necessarily capture like all the stuff that people do at work. No avail is perfect is always going to be more things to look at, largely because in order to make the tasks well enough to find that you can run them, they need to only have a handful of input files and very specific instructions for that task. And so I think the easiest way to think about them are that they're like quite hard take home exam tasks that you might do in an interview process.swyx [00:40:56]: Yeah, for listeners, it is not no longer like a long prompt. It is like, well, here's a zip file with like a spreadsheet or a PowerPoint deck or a PDF and go nuts and answer this question.George [00:41:06]: OpenAI released a great data set and they released a good paper which looks at performance across the different web chat bots on the data set. It's a great paper, encourage people to read it. What we've done is taken that data set and turned it into an eval that can be run on any model. So we created a reference agentic harness that can run. Run the models on the data set, and then we developed evaluator approach to compare outputs. That's kind of AI enabled, so it uses Gemini 3 Pro Preview to compare results, which we tested pretty comprehensively to ensure that it's aligned to human preferences. One data point there is that even as an evaluator, Gemini 3 Pro, interestingly, doesn't do actually that well. So that's kind of a good example of what we've done in GDPVal AA.swyx [00:42:01]: Yeah, the thing that you have to watch out for with LLM judge is self-preference that models usually prefer their own output, and in this case, it was not. Totally.Micah [00:42:08]: I think the way that we're thinking about the places where it makes sense to use an LLM as judge approach now, like quite different to some of the early LLM as judge stuff a couple of years ago, because some of that and MTV was a great project that was a good example of some of this a while ago was about judging conversations and like a lot of style type stuff. Here, we've got the task that the grader and grading model is doing is quite different to the task of taking the test. When you're taking the test, you've got all of the agentic tools you're working with, the code interpreter and web search, the file system to go through many, many turns to try to create the documents. Then on the other side, when we're grading it, we're running it through a pipeline to extract visual and text versions of the files and be able to provide that to Gemini, and we're providing the criteria for the task and getting it to pick which one more effectively meets the criteria of the task. Yeah. So we've got the task out of two potential outcomes. It turns out that we proved that it's just very, very good at getting that right, matched with human preference a lot of the time, because I think it's got the raw intelligence, but it's combined with the correct representation of the outputs, the fact that the outputs were created with an agentic task that is quite different to the way the grading model works, and we're comparing it against criteria, not just kind of zero shot trying to ask the model to pick which one is better.swyx [00:43:26]: Got it. Why is this an ELO? And not a percentage, like GDP-VAL?George [00:43:31]: So the outputs look like documents, and there's video outputs or audio outputs from some of the tasks. It has to make a video? Yeah, for some of the tasks. Some of the tasks.swyx [00:43:43]: What task is that?George [00:43:45]: I mean, it's in the data set. Like be a YouTuber? It's a marketing video.Micah [00:43:49]: Oh, wow. What? Like model has to go find clips on the internet and try to put it together. The models are not that good at doing that one, for now, to be clear. It's pretty hard to do that with a code editor. I mean, the computer stuff doesn't work quite well enough and so on and so on, but yeah.George [00:44:02]: And so there's no kind of ground truth, necessarily, to compare against, to work out percentage correct. It's hard to come up with correct or incorrect there. And so it's on a relative basis. And so we use an ELO approach to compare outputs from each of the models between the task.swyx [00:44:23]: You know what you should do? You should pay a contractor, a human, to do the same task. And then give it an ELO and then so you have, you have human there. It's just, I think what's helpful about GDPVal, the OpenAI one, is that 50% is meant to be normal human and maybe Domain Expert is higher than that, but 50% was the bar for like, well, if you've crossed 50, you are superhuman. Yeah.Micah [00:44:47]: So we like, haven't grounded this score in that exactly. I agree that it can be helpful, but we wanted to generalize this to a very large number. It's one of the reasons that presenting it as ELO is quite helpful and allows us to add models and it'll stay relevant for quite a long time. I also think it, it can be tricky looking at these exact tasks compared to the human performance, because the way that you would go about it as a human is quite different to how the models would go about it. Yeah.swyx [00:45:15]: I also liked that you included Lama 4 Maverick in there. Is that like just one last, like...Micah [00:45:20]: Well, no, no, no, no, no, no, it is the, it is the best model released by Meta. And... So it makes it into the homepage default set, still for now.George [00:45:31]: Other inclusion that's quite interesting is we also ran it across the latest versions of the web chatbots. And so we have...swyx [00:45:39]: Oh, that's right.George [00:45:40]: Oh, sorry.swyx [00:45:41]: I, yeah, I completely missed that. Okay.George [00:45:43]: No, not at all. So that, which has a checkered pattern. So that is their harness, not yours, is what you're saying. Exactly. And what's really interesting is that if you compare, for instance, Claude 4.5 Opus using the Claude web chatbot, it performs worse than the model in our agentic harness. And so in every case, the model performs better in our agentic harness than its web chatbot counterpart, the harness that they created.swyx [00:46:13]: Oh, my backwards explanation for that would be that, well, it's meant for consumer use cases and here you're pushing it for something.Micah [00:46:19]: The constraints are different and the amount of freedom that you can give the model is different. Also, you like have a cost goal. We let the models work as long as they want, basically. Yeah. Do you copy paste manually into the chatbot? Yeah. Yeah. That's, that was how we got the chatbot reference. We're not going to be keeping those updated at like quite the same scale as hundreds of models.swyx [00:46:38]: Well, so I don't know, talk to a browser base. They'll, they'll automate it for you. You know, like I have thought about like, well, we should turn these chatbot versions into an API because they are legitimately different agents in themselves. Yes. Right. Yeah.Micah [00:46:53]: And that's grown a huge amount of the last year, right? Like the tools. The tools that are available have actually diverged in my opinion, a fair bit across the major chatbot apps and the amount of data sources that you can connect them to have gone up a lot, meaning that your experience and the way you're using the model is more different than ever.swyx [00:47:10]: What tools and what data connections come to mind when you say what's interesting, what's notable work that people have done?Micah [00:47:15]: Oh, okay. So my favorite example on this is that until very recently, I would argue that it was basically impossible to get an LLM to draft an email for me in any useful way. Because most times that you're sending an email, you're not just writing something for the sake of writing it. Chances are context required is a whole bunch of historical emails. Maybe it's notes that you've made, maybe it's meeting notes, maybe it's, um, pulling something from your, um, any of like wherever you at work store stuff. So for me, like Google drive, one drive, um, in our super base databases, if we need to do some analysis or some data or something, preferably model can be plugged into all of those things and can go do some useful work based on it. The things that like I find most impressive currently that I am somewhat surprised work really well in late 2025, uh, that I can have models use super base MCP to query read only, of course, run a whole bunch of SQL queries to do pretty significant data analysis. And. And make charts and stuff and can read my Gmail and my notion. And okay. You actually use that. That's good. That's, that's, that's good. Is that a cloud thing? To various degrees of order, but chat GPD and Claude right now, I would say that this stuff like barely works in fairness right now. Like.George [00:48:33]: Because people are actually going to try this after they hear it. If you get an email from Micah, odds are it wasn't written by a chatbot.Micah [00:48:38]: So, yeah, I think it is true that I have never actually sent anyone an email drafted by a chatbot. Yet.swyx [00:48:46]: Um, and so you can, you can feel it right. And yeah, this time, this time next year, we'll come back and see where it's going. Totally. Um, super base shout out another famous Kiwi. Uh, I don't know if you've, you've any conversations with him about anything in particular on AI building and AI infra.George [00:49:03]: We have had, uh, Twitter DMS, um, with, with him because we're quite big, uh, super base users and power users. And we probably do some things more manually than we should in. In, in super base support line because you're, you're a little bit being super friendly. One extra, um, point regarding, um, GDP Val AA is that on the basis of the overperformance of the models compared to the chatbots turns out, we realized that, oh, like our reference harness that we built actually white works quite well on like gen generalist agentic tasks. This proves it in a sense. And so the agent harness is very. Minimalist. I think it follows some of the ideas that are in Claude code and we, all that we give it is context management capabilities, a web search, web browsing, uh, tool, uh, code execution, uh, environment. Anything else?Micah [00:50:02]: I mean, we can equip it with more tools, but like by default, yeah, that's it. We, we, we give it for GDP, a tool to, uh, view an image specifically, um, because the models, you know, can just use a terminal to pull stuff in text form into context. But to pull visual stuff into context, we had to give them a custom tool, but yeah, exactly. Um, you, you can explain an expert. No.George [00:50:21]: So it's, it, we turned out that we created a good generalist agentic harness. And so we, um, released that on, on GitHub yesterday. It's called stirrup. So if people want to check it out and, and it's a great, um, you know, base for, you know, generalist, uh, building a generalist agent for more specific tasks.Micah [00:50:39]: I'd say the best way to use it is get clone and then have your favorite coding. Agent make changes to it, to do whatever you want, because it's not that many lines of code and the coding agents can work with it. Super well.swyx [00:50:51]: Well, that's nice for the community to explore and share and hack on it. I think maybe in, in, in other similar environments, the terminal bench guys have done, uh, sort of the Harbor. Uh, and so it's, it's a, it's a bundle of, well, we need our minimal harness, which for them is terminus and we also need the RL environments or Docker deployment thing to, to run independently. So I don't know if you've looked at it. I don't know if you've looked at the harbor at all, is that, is that like a, a standard that people want to adopt?George [00:51:19]: Yeah, we've looked at it from a evals perspective and we love terminal bench and, and host benchmarks of, of, of terminal mention on artificial analysis. Um, we've looked at it from a, from a coding agent perspective, but could see it being a great, um, basis for any kind of agents. I think where we're getting to is that these models have gotten smart enough. They've gotten better, better tools that they can perform better when just given a minimalist. Set of tools and, and let them run, let the model control the, the agentic workflow rather than using another framework that's a bit more built out that tries to dictate the, dictate the flow. Awesome.swyx [00:51:56]: Let's cover the openness index and then let's go into the report stuff. Uh, so that's the, that's the last of the proprietary art numbers, I guess. I don't know how you sort of classify all these. Yeah.Micah [00:52:07]: Or call it, call it, let's call it the last of like the, the three new things that we're talking about from like the last few weeks. Um, cause I mean, there's a, we do a mix of stuff that. Where we're using open source, where we open source and what we do and, um, proprietary stuff that we don't always open source, like long context reasoning data set last year, we did open source. Um, and then all of the work on performance benchmarks across the site, some of them, we looking to open source, but some of them, like we're constantly iterating on and so on and so on and so on. So there's a huge mix, I would say, just of like stuff that is open source and not across the side. So that's a LCR for people. Yeah, yeah, yeah, yeah.swyx [00:52:41]: Uh, but let's, let's, let's talk about open.Micah [00:52:42]: Let's talk about openness index. This. Here is call it like a new way to think about how open models are. We, for a long time, have tracked where the models are open weights and what the licenses on them are. And that's like pretty useful. That tells you what you're allowed to do with the weights of a model, but there is this whole other dimension to how open models are. That is pretty important that we haven't tracked until now. And that's how much is disclosed about how it was made. So transparency about data, pre-training data and post-training data. And whether you're allowed to use that data and transparency about methodology and training code. So basically, those are the components. We bring them together to score an openness index for models so that you can in one place get this full picture of how open models are.swyx [00:53:32]: I feel like I've seen a couple other people try to do this, but they're not maintained. I do think this does matter. I don't know what the numbers mean apart from is there a max number? Is this out of 20?George [00:53:44]: It's out of 18 currently, and so we've got an openness index page, but essentially these are points, you get points for being more open across these different categories and the maximum you can achieve is 18. So AI2 with their extremely open OMO3 32B think model is the leader in a sense.swyx [00:54:04]: It's hooking face.George [00:54:05]: Oh, with their smaller model. It's coming soon. I think we need to run, we need to get the intelligence benchmarks right to get it on the site.swyx [00:54:12]: You can't have it open in the next. We can not include hooking face. We love hooking face. We'll have that, we'll have that up very soon. I mean, you know, the refined web and all that stuff. It's, it's amazing. Or is it called fine web? Fine web. Fine web.Micah [00:54:23]: Yeah, yeah, no, totally. Yep. One of the reasons this is cool, right, is that if you're trying to understand the holistic picture of the models and what you can do with all the stuff the company's contributing, this gives you that picture. And so we are going to keep it up to date alongside all the models that we do intelligence index on, on the site. And it's just an extra view to understand.swyx [00:54:43]: Can you scroll down to this? The, the, the, the trade-offs chart. Yeah, yeah. That one. Yeah. This, this really matters, right? Obviously, because you can b
Software Engineering Radio - The Podcast for Professional Software Developers
Derick Schaefer, author of CLI: A Practical Guide to Creating Modern Command-Line Interfaces, talks with host Robert Blumen about command-line interfaces old and new. Starting with a short review of the origin of commands in the early unix systems, they trace the evolution of commands into modern CLIs. Following the historic rise, fall, and re-emergence of CLIs, they consider innovative examples such as git, github, WordPress, and warp. Schaefer clarifies whether commands are the same as CLIs and then discusses a range of topics, including implementation languages, packages in the golang ecosystem for CLI development, CLIs and APIs, CLIs and AIs, AI tooling versus MCP, the object-command pattern, command flags, API authentication, whether CLIs should be stateless, and output formats - json, rich text. Brought to you by IEEE Computer Society and IEEE Software magazine.
If there's one thing that we absolutely knew would be coming along with the increased interest and use of AI, it would be… more acronyms! And, along with the acronyms, we pretty much could predict that we see a lot of online flexing through casual dropping of said acronyms as though they're deeply understood by everyone who's anyone. We tackled one such acronym on this episode: MCP! That's "model context protocol" for those who like their acronyms written out, and Sam Redfern joined us to help us wrap our heads around the topic. You see, MCP is kinda' like some other more familiar acronyms like API and XML. But, it's also like… fingers? Sam's enthusiasm and explanation certainly had us ready to dive in! This episode's Measurement Bite from show sponsor Recast is an explanation of model robustness from Michael Kaminsky! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.
This week we have Jeppe Reinhold, a core contributor to Storybook working at Chromatic. Jeppe shares how Storybook has evolved from a slow, complex tool to a fast, modern development environment through major architectural changes like Vite integration, ESM migration, and dependency reduction. We talk about the Component Story Format evolution, framework agnosticism challenges, local testing improvements with vitest integration, and how Storybook is integrating with AI and LLMs through MCP servers to help coding agents understand and use component libraries.https://reinhold.is/https://bsky.app/profile/reinhold.ishttps://storybook.js.org/
In this episode I talk with Kendall Miller about MCP (Model Context Protocol) and why AI agents need third-party guardrails. His company Maybe Don't sits between AI agents and MCP servers to prevent disasters—because AI sometimes solves problems in creative and terrifying ways.Links:Maybe Don't, AIKendall Miller on LinkedInNonsense Monthly
Wes and Scott talk about their bold predictions for web development in 2026, from WebGPU-powered design and modern CSS breakthroughs to JavaScript standards, AI-driven tooling, security risks, the future of frameworks, workflows, and more! Show Notes 00:00 Welcome to Syntax! 00:49 WebGPU and 3D experiences will finally take off Lando Norris 01:30 Web design will make a comeback Raycast shaders.com 04:03 Light mode returns (yes, really) 07:06 Modern CSS standards are about to have a huge year CSS Wrapped Graffiti 13:15 Will the Temporal API finally ship everywhere in 2026? 14:18 The rise of the standard stack 16:18 Are we headed toward standardized RPC? 19:41 What's next (and what's not) for React 21:07 Why we'll see more security failures in web dev 22:35 SvelteKit 3 lands in 2026 22:53 Where developer tooling is headed next Oxc Biome 26:44 More big acquisitions Anthropic Bun 28:02 2026: the year of durable compute 30:57 Frameworks will matter less as AI gets better 33:34 End-to-end AI workflows become the norm 36:04 Brought to you by Sentry.io 37:21 Personalized software for everyday people 39:11 MCP and MCP UI will pop 42:24 Developer skills will fall off 46:20 Crappy software will continue Hit us up on Socials! Syntax: X Instagram Tiktok LinkedIn Threads Wes: X Instagram Tiktok LinkedIn Threads Scott: X Instagram Tiktok LinkedIn Threads Randy: X Instagram YouTube Threads
Sun, 28 Dec 2025 16:00:00 GMT http://relay.fm/mpu/829 http://relay.fm/mpu/829 Tech That Worked in 2025 829 David Sparks and Stephen Hackett With the year winding down, Stephen and David reflect on what went well in their tech stacks, touching on Apple silicon, the company's default apps, AI tools, home automation, and networking. With the year winding down, Stephen and David reflect on what went well in their tech stacks, touching on Apple silicon, the company's default apps, AI tools, home automation, and networking. clean 6175 With the year winding down, Stephen and David reflect on what went well in their tech stacks, touching on Apple silicon, the company's default apps, AI tools, home automation, and networking. This episode of Mac Power Users is sponsored by: 1Password: Never forget a password again. DEVONthink: Get Organized — Unleash Your Creativity. Get 10% off. Ecamm: Powerful live streaming platform for Mac. Links and Show Notes: Sign up for the MPU email newsletter and join the MPU forums. You can watch the podcast over on YouTube. More Power Users: Ad-free episodes with regular bonus segments Submit Feedback Give the Gift of RelayFinal call for 20% off an annual membership to MPU! Four Winners and Losers of Apple's 2025 - 512 Pixels Apple announces Mac transition to Apple silicon - Apple Add a user or group on Mac - Apple Support Safely open apps on your Mac - Apple Support Apple in China - by Patrick McGee Apple and Intel Rumored to Partner on Mac Chips Again in a New Way - MacRumors Four Mac Studios as an AI Cluster - YouTube The M3 Ultra Mac Studio for Local LLMs - MacStories OmniFocus - The Omni Group The Apple Productivity Suite Field Guide - MacSparky iWork - Apple MacWhisper Voice Memos transcription on iPhone - Apple Support OmniFocus-MCP - GitHub Automating Project Creation with MCP and OmniFocus - MacSparky Labs on YouTube Elsewhen OpenAI Opens Up ChatGPT App Submissions to Developers - MacStories Apple Music app now available on ChatGPT, here's how to use it - 9to5Mac Set up your HomePod, HomePod mini, or Apple TV as a home hub - Apple Support Aqara Presence Sensor FP2 - Amazon iRobot files for bankruptcy | The Verge iRobot's bankruptcy isn't the end — it's a reboot, says CEO Gary Cohen | The Verge Best Robot Vacuum and Mops & Robot Mops - eufy US Mac Power Users #8
Sun, 28 Dec 2025 16:00:00 GMT http://relay.fm/mpu/829 http://relay.fm/mpu/829 David Sparks and Stephen Hackett With the year winding down, Stephen and David reflect on what went well in their tech stacks, touching on Apple silicon, the company's default apps, AI tools, home automation, and networking. With the year winding down, Stephen and David reflect on what went well in their tech stacks, touching on Apple silicon, the company's default apps, AI tools, home automation, and networking. clean 6175 With the year winding down, Stephen and David reflect on what went well in their tech stacks, touching on Apple silicon, the company's default apps, AI tools, home automation, and networking. This episode of Mac Power Users is sponsored by: 1Password: Never forget a password again. DEVONthink: Get Organized — Unleash Your Creativity. Get 10% off. Ecamm: Powerful live streaming platform for Mac. Links and Show Notes: Sign up for the MPU email newsletter and join the MPU forums. You can watch the podcast over on YouTube. More Power Users: Ad-free episodes with regular bonus segments Submit Feedback Give the Gift of RelayFinal call for 20% off an annual membership to MPU! Four Winners and Losers of Apple's 2025 - 512 Pixels Apple announces Mac transition to Apple silicon - Apple Add a user or group on Mac - Apple Support Safely open apps on your Mac - Apple Support Apple in China - by Patrick McGee Apple and Intel Rumored to Partner on Mac Chips Again in a New Way - MacRumors Four Mac Studios as an AI Cluster - YouTube The M3 Ultra Mac Studio for Local LLMs - MacStories OmniFocus - The Omni Group The Apple Productivity Suite Field Guide - MacSparky iWork - Apple MacWhisper Voice Memos transcription on iPhone - Apple Support OmniFocus-MCP - GitHub Automating Project Creation with MCP and OmniFocus - MacSparky Labs on YouTube Elsewhen OpenAI Opens Up ChatGPT App Submissions to Developers - MacStories Apple Music app now available on ChatGPT, here's how to use it - 9to5Mac Set up your HomePod, HomePod mini, or Apple TV as a home hub - Apple Support Aqara Presence Sensor FP2 - Amazon iRobot files for bankruptcy | The Verge iRobot's bankruptcy isn't the end — it's a reboot, says CEO Gary Cohen | The Verge Best Robot Vacuum and Mops & Robot Mops - eufy US Ma
One year ago, Anthropic launched the Model Context Protocol (MCP)—a simple, open standard to connect AI applications to the data and tools they need. Today, MCP has exploded from a local-only experiment into the de facto protocol for agentic systems, adopted by OpenAI, Microsoft, Google, Block, and hundreds of enterprises building internal agents at scale. And now, MCP is joining the newly formed Agentic AI Foundation (AAIF) under the Linux Foundation, alongside Block's Goose coding agent, with founding members spanning the biggest names in AI and cloud infrastructure. We sat down with David Soria Parra (MCP lead, Anthropic), Nick Cooper (OpenAI), Brad Howes (Block / Goose), and Jim Zemlin (Linux Foundation CEO) to dig into the one-year journey of MCP—from Thanksgiving hacking sessions and the first remote authentication spec to long-running tasks, MCP Apps, and the rise of agent-to-agent communication—and the behind-the-scenes story of how three competitive AI labs came together to donate their protocols and agents to a neutral foundation, why enterprises are deploying MCP servers faster than anyone expected (most of it invisible, internal, and at massive scale), what it takes to design a protocol that works for both simple tool calls and complex multi-agent orchestration, how the foundation will balance taste-making (curating meaningful projects) with openness (avoiding vendor lock-in), and the 2025 vision: MCP as the communication layer for asynchronous, long-running agents that work while you sleep, discover and install their own tools, and unlock the next order of magnitude in AI productivity. We discuss: The one-year MCP journey: from local stdio servers to remote HTTP streaming, OAuth 2.1 authentication (and the enterprise lessons learned), long-running tasks, and MCP Apps (iframes for richer UI) Why MCP adoption is exploding internally at enterprises: invisible, internal servers connecting agents to Slack, Linear, proprietary data, and compliance-heavy workflows (financial services, healthcare) The authentication evolution: separating resource servers from identity providers, dynamic client registration, and why the March spec wasn't enterprise-ready (and how June fixed it) How Anthropic dogfoods MCP: internal gateway, custom servers for Slack summaries and employee surveys, and why MCP was born from "how do I scale dev tooling faster than the company grows?" Tasks: the new primitive for long-running, asynchronous agent operations—why tools aren't enough, how tasks enable deep research and agent-to-agent handoffs, and the design choice to make tasks a "container" (not just async tools) MCP Apps: why iframes, how to handle styles and branding, seat selection and shopping UIs as the killer use case, and the collaboration with OpenAI to build a common standard The registry problem: official registry vs. curated sub-registries (Smithery, GitHub), trust levels, model-driven discovery, and why MCP needs "npm for agents" (but with signatures and HIPAA/financial compliance) The founding story of AAIF: how Anthropic, OpenAI, and Block came together (spoiler: they didn't know each other were talking to Linux Foundation), why neutrality matters, and how Jim Zemlin has never seen this much day-one inbound interest in 22 years — David Soria Parra (Anthropic / MCP) MCP: https://modelcontextprotocol.io https://uk.linkedin.com/in/david-soria-parra-4a78b3a https://x.com/dsp_ Nick Cooper (OpenAI) X: https://x.com/nicoaicopr Brad Howes (Block / Goose) Goose: https://github.com/block/goose Jim Zemlin (Linux Foundation) LinkedIn: https://www.linkedin.com/in/zemlin/ Agentic AI Foundation https://agenticai.foundation Chapters 00:00:00 Introduction: MCP's First Year and Foundation Launch 00:01:17 MCP's Journey: From Launch to Industry Standard 00:02:06 Protocol Evolution: Remote Servers and Authentication 00:08:52 Enterprise Authentication and Financial Services 00:11:42 Transport Layer Challenges: HTTP Streaming and Scalability 00:15:37 Standards Development: Collaboration with Tech Giants 00:34:27 Long-Running Tasks: The Future of Async Agents 00:30:41 Discovery and Registries: Building the MCP Ecosystem 00:30:54 MCP Apps and UI: Beyond Text Interfaces 00:26:55 Internal Adoption: How Anthropic Uses MCP 00:23:15 Skills vs MCP: Complementary Not Competing 00:36:16 Community Events and Enterprise Learnings 01:03:31 Foundation Formation: Why Now and Why Together 01:07:38 Linux Foundation Partnership: Structure and Governance 01:11:13 Goose as Reference Implementation 01:17:28 Principles Over Roadmaps: Composability and Quality 01:21:02 Foundation Value Proposition: Why Contribute 01:27:49 Practical Investments: Events, Tools, and Community 01:34:58 Looking Ahead: Async Agents and Real Impact
Note: Steve and Gene's talk on Vibe Coding and the post IDE world was one of the top talks of AIE CODE: https://www.youtube.com/watch?v=7Dtu2bilcFs&t=1019s&pp=0gcJCU0KAYcqIYzv From building legendary platforms at Google and Amazon to authoring one of the most influential essays on AI-powered development (Revenge of the Junior Developer, quoted by Dario Amodei himself), Steve Yegge has spent decades at the frontier of software engineering—and now he's leading the charge into what he calls the "factory farming" era of code. After stints at SourceGraph and building Beads (a purely vibe-coded issue tracker with tens of thousands of users), Steve co-authored The Vibe Coding Book and is now building VC (VibeCoder), an agent orchestration dashboard designed to move developers from writing code to managing fleets of AI agents that coordinate, parallelize, and ship features while you sleep. We sat down with Steve at AI Engineer Summit to dig into why Claude Code, Cursor, and the entire 2024 stack are already obsolete, what it actually takes to trust an agent after 2,000 hours of practice (hint: they will delete your production database if you anthropomorphize them), why the real skill is no longer writing code but orchestrating agents like a NASCAR pit crew, how merging has become the new wall that every 10x-productive team is hitting (and why one company's solution is literally "one engineer per repo"), the rise of multi-agent workflows where agents reserve files, message each other via MCP, and coordinate like a little village, why Steve believes if you're still using an IDE to write code by January 1st, you're a bad engineer, how the 12–15 year experience bracket is the most resistant demographic (and why their identity is tied to obsolete workflows), the hidden chaos inside OpenAI, Anthropic, and Google as they scale at breakneck speed, why rewriting from scratch is now faster than refactoring for a growing class of codebases, and his 2025 prediction: we're moving from subsistence agriculture to John Deere-scale factory farming of code, and the Luddite backlash is only just beginning. We discuss: Why Claude Code, Cursor, and agentic coding tools are already last year's tech—and what comes next: agent orchestration dashboards where you manage fleets, not write lines The 2,000-hour rule: why it takes a full year of daily use before you can predict what an LLM will do, and why trust = predictability, not capability Steve's hot take: if you're still using an IDE to develop code by January 1st, 2025, you're a bad engineer—because the abstraction layer has moved from models to full-stack agents The demographic most resistant to vibe coding: 12–15 years of experience, senior engineers whose identity is tied to the way they work today, and why they're about to become the interns Why anthropomorphizing LLMs is the biggest mistake: the "hot hand" fallacy, agent amnesia, and how Steve's agent once locked him out of prod by changing his password to "fix" a problem Should kids learn to code? Steve's take: learn to vibe code—understand functions, classes, architecture, and capabilities in a language-neutral way, but skip the syntax The 2025 vision: "factory farming of code" where orchestrators run Cloud Code, scrub output, plan-implement-review-test in loops, and unlock programming for non-programmers at scale — Steve Yegge X: https://x.com/steve_yegge Substack (Stevie's Tech Talks): https://steve-yegge.medium.com/ GitHub (VC / VibeCoder): https://github.com/yegge-labs Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction: Steve Yegge on Vibe Coding and AI Engineering 00:00:59 The Backlash: Who Resists Vibe Coding and Why 00:04:26 The 2000 Hour Rule: Building Trust with AI Coding Tools 00:03:31 The January 1st Deadline: IDEs Are Becoming Obsolete 00:02:55 10X Productivity at OpenAI: The Performance Review Problem 00:07:49 The Hot Hand Fallacy: When AI Agents Betray Your Trust 00:11:12 Claude Code Isn't It: The Need for Agent Orchestration 00:15:20 The Orchestrator Revolution: From Cloud Code to Agent Villages 00:18:46 The Merge Wall: The Biggest Unsolved Problem in AI Coding 00:26:33 Never Rewrite Your Code - Until Now: Joel Spolsky Was Wrong 00:22:43 Factory Farming Code: The John Deere Era of Software 00:29:27 Google's Gemini Turnaround and the AI Lab Chaos 00:33:20 Should Your Kids Learn to Code? The New Answer 00:34:59 Code MCP and the Gossip Rate: Latest Vibe Coding Discoveries
In this repeat episode, Kent C. Dodds came back on to the podcast with bold ideas and a game-changing vision for the future of AI and web development. In this episode, we dive into the Model Context Protocol (MCP), the power behind Epic AI Pro, and how developers can start building Jarvis-like assistants today. From replacing websites with MCP servers to reimagining voice interfaces and AI security, Kent lays out the roadmap for what's next, and why it matters right now. Don't miss this fast-paced conversation about the tools and tech reshaping everything. Links Website: https://kentcdodds.com X: https://x.com/kentcdodds Github: https://github.com/kentcdodds YouTube: https://www.youtube.com/c/kentcdodds-vids Twitch: https://www.twitch.tv/kentcdodds LinkedIn: https://www.linkedin.com/in/kentcdodds Resources Please make Jarvis (so I don't have to): https://www.epicai.pro/please-make-jarvis AI Engineering Posts by Kent C. Dodds: https://www.epicai.pro/posts We want to hear from you! How did you find us? Did you see us on Twitter? In a newsletter? Or maybe we were recommended by a friend? Let us know by sending an email to our producer, Em, at emily.kochanek@logrocket.com (mailto:emily.kochanek@logrocket.com), or tweet at us at PodRocketPod (https://twitter.com/PodRocketpod). Follow us. Get free stickers. Follow us on Apple Podcasts, fill out this form (https://podrocket.logrocket.com/get-podrocket-stickers), and we'll send you free PodRocket stickers! What does LogRocket do? LogRocket provides AI-first session replay and analytics that surfaces the UX and technical issues impacting user experiences. Start understanding where your users are struggling by trying it for free at LogRocket.com. Try LogRocket for free today. (https://logrocket.com/signup/?pdr)
Today I break down a big news item I think is flying under the radar: OpenAI quietly launched Skills for Codex, and I explain what that means (and how it differs from sub-agents and MCPs). I then share a fast-moving trend I'm watching and why it's a strong wedge for a simple app. After that, I recommend the to-do app I've used for 14 years and give away a startup idea. I close with a practical 6-step framework for going from idea → viral validation → mobile app launch in 2026. Timestamps 00:00 – Intro: the new format (news, trend, app, startup idea, framework) 00:40 – AI New Item: OpenAI launches Skills for Codex 05:45 – Trend: Face Yoga 07:56 – App Recommendation: Things 09:33 – Startup Idea: Call-an-expert service for non-developers stuck at 80% done 14:44 – Framework: Viral Mobile App Framework Key Points OpenAI “Skills” make Codex/ChatGPT more reusable and consistent by packaging repeatable workflows. A “skill” is the recipe, a “sub-agent” is extra worker instances, and an “MCP” is the tool access plug. Face yoga is an emerging sub-niche with clear app potential (simple routines, monetization via paid or ads). Last 20 is a practical marketplace idea: pay for 15 minutes of expert unblock help to finish the last 20%. Viral validation favors apps that are visually obvious, explainable in three words, and tied to insecurity-driven outcomes. Numbered Section Summaries OpenAI Skills: The Quiet Upgrade I walk through OpenAI's launch of Skills for Codex—reusable bundles of instructions/scripts/resources that can be called directly or chosen automatically. I'm excited because this makes agent workflows more consistent and scalable across tasks. The Foundation: Skill vs Sub-Agent vs MCP I clarify the taxonomy: a skill is the written playbook, sub-agents are extra “worker” copies of the model that split a big job, and MCPs are what let the model access external systems like tickets or repos. This is the mental model I want everyone using going into 2026. The Trend: Face Yoga As An App Wedge I share a niche trend I'm seeing—face yoga—and why it's a product opportunity similar to how yoga apps became huge. I call out the obvious app angles: guided routines, jawline/face-slimming programs, and content-driven growth via short videos. The Tool: Things (My Simple Focus System) I recommend the Things to-do app because it's simple: “Today,” “Upcoming,” and “Someday,” without a monthly fee. I also note what's missing (I'd like more AI features), but it still wins for focus if you don't want a “kitchen sink” system. The Startup Idea: Last 20 (Phone-A-Friend For Vibe Coders) I give away the idea: builders get stuck at 80% after using Cursor/Replit/V0, so Last 20 matches them with someone who's solved that exact wall before. The product is a fast screen-share session—problem solved—priced per session or bundled for teams/agencies, with the marketplace taking a cut. The Distribution Framework: Viral Validation → Launch I share a 6-step process: warm up the account, design a visually obvious app, build a tiny MVP fast, post daily until something hits, build the community before the product, then launch with a hard paywall and keep content rolling. It's a simple playbook for getting to organic traction in 2026. The #1 tool to find startup ideas/trends - https://www.ideabrowser.com LCA helps Fortune 500s and fast-growing startups build their future - from Warner Music to Fortnite to Dropbox. We turn 'what if' into reality with AI, apps, and next-gen products https://latecheckout.agency/ The Vibe Marketer - Resources for people into vibe marketing/marketing with AI: https://www.thevibemarketer.com/ FIND ME ON SOCIAL X/Twitter: https://twitter.com/gregisenberg Instagram: https://instagram.com/gregisenberg/ LinkedIn: https://www.linkedin.com/in/gisenberg/
Who's going to win the Super Bowl? What about the latest season of Survivor? Or the race to be the next chair of the Federal Reserve? Who will be Portugal's next president? How many times will Elon Musk tweet in the next week? On Polymarket, and other prediction markets, you can bet on all these things and more. Are we entering a world in which everything is gambling and gambling is everything? Bloomberg's Joe Weisenthal joins the show to explain the rise of prediction markets, what's betting and what's investing, and more. Then, The Verge's Hayden Field teaches us about Model Context Protocol, a wonky bit of AI infrastructure that might be key to making AI agents work. MCP is barely a year old, and practically all of tech is ready to embrace it. Finally, Hayden helps David answer a question on the Vergecast Hotline (call 866-VERGE11 or email vergecast@theverge.com!) about why every AI company seems to want you to go shopping. Further reading: Are prediction markets gambling? Robinhood CEO Vlad Tenev is betting not Election night at Kalshi HQ Joe Weisenthal at Bloomberg From Bloomberg: My Biggest Question About Prediction Markets Anthropic launches tool to connect AI systems directly to datasets AI companies want a new internet — and they think they've found the key Subscribe to The Verge for unlimited access to theverge.com, subscriber-exclusive newsletters, and our ad-free podcast feed.We love hearing from you! Email your questions and thoughts to vergecast@theverge.com or call us at 866-VERGE11. Learn more about your ad choices. Visit podcastchoices.com/adchoices
In this episode we are joined by Jaime from Across the Bifrost to recount his journey playing 250 games of mutant Affiliations in Marvel Crisis Protocol. Through the course of our conversation, we talk about setting goals and maximizing fun, and how you can reshape your personal MCP goals effectively. Lastly we deep dive all of the Mutant afflation's and Jaime's strengths and weaknesses he found in each affiliation along the way on his 250 game journey.Vote on Jaime's Sub Afflation MCP Longhshanks Request:https://www.longshanks.org/request/409/Fury's Finest is a podcast and resource devoted to the discussion of the tabletop game Marvel Crisis Protocol.___________________________________Fury's Finest is supported by our wonderful patrons on Patreon. If you would like to help the show go topatreon.com/furysfinest and pledge your support. Fury's Finest Patrons directly support the show and its growth by helping pay our monthly and annual fees, while contributing to future projects and endeavors.Fury's Finest is sponsored by MR Laser:https://mr-laser.square.site/ use our code furysfinest at checkout.Check out our Fury's Finest apparel and merchandise on TeePublic.___________________________________Twitch Itwitch.tv/furysfinestTwitter I@FurysFinestCastInstagram I@FurysFinestFacebook IFury's FinestYouTube I Fury's FinestApple Podcasts l Spotify l Google Podcasts___________________________________Thanks to Approaching Nirvana for our music.Help spread the word of our show. Subscribe, rate, and review!Email us at: FurysFinest@gmail.com
Amanda Silberling of TechCrunch joins Mikah Sargent on Tech News Weekly this week! The former CEO of Hinge left his position this week to launch an AI-powered dating app. Pebble is coming out with its take on a smart ring. What is the AI Model Context Protocol? And could grocery delivery services be using AI to charge different prices for groceries to consumers? Amanda talks about a new AI-powered dating app called Overtone that the former CEO of Hinge, Justin McLeod, has founded. Pebble is coming out with its own smart ring with a built-in microphone, and Mikah has some quarrels with the device. Mikah talks about the Model Context Protocol, or MCP: an approach companies like Google and OpenAI have adopted that would allow AI agents to access information online in a standardized manner easily, and now Anthropic has donated the protocol to the Linux Foundation. And Derek Kravitz of Consumer Reports joins the show to talk about its investigation into Instacart utilizing artificial intelligence that would offer different prices of the same product to consumers. Hosts: Mikah Sargent and Amanda Silberling Guest: Derek Kravitz Download or subscribe to Tech News Weekly at https://twit.tv/shows/tech-news-weekly. Join Club TWiT for Ad-Free Podcasts! Support what you love and get ad-free audio and video feeds, a members-only Discord, and exclusive content. Join today: https://twit.tv/clubtwit Sponsors: ventionteams.com/twit threatlocker.com/twit pantheon.io cachefly.com/twit
December 2025's Patch Tuesday brought major shifts, but the real action is in Microsoft's pricing, privacy battles, and the arms race to control AI-enabled browsers. Plus, Paul recommends Tiny11 Builder for a clean install, or Win11Debloat for an existing install. Then, Rufus to create installation media without the forced Microsoft account (MSA) sign-in or hardware requirement checks. Use MSEdgeDirect to use the default web browser for stories from Widgets, web-based search results, etc. And ExplorerPatcher can fix the performance and reliability issues in File Explorer. It's the final Patch Tuesday of 2025 Major dark mode updates (with a fix for the "flashbang" problem) AI Agent in Settings, Click to Do, Windows Studio Effects, and Search improvements for Copilot+ PCs Many other improvements: FSE, Share, Settings, Widgets, more More Windows 11 New 25H2 preview build on Beta/Dev adds MCP public preview, Quick Machine Recovery auto-enabled, Unified Update Orchestration Platform, Windows MIDI services Microsoft 365 Microsoft 365 is getting a lot more expensive in mid-2026. You didn't think all those free AI updates were free, did you? AI Paul has been talking about "programmatic" apps and services because he wasn't sure of a term for this type of interaction. But there is a term for this: Semantic. As in semantic web. And there you go Microsoft one of 1,000 companies partnering on Agentic AI Foundation because you're getting agents whether they work or not Gartner says NO to AI web browsers The New York Times is suing Perplexity for all the obvious reasons After a big win in the legal battle with OpenAI Opera for Android gets a big AI update Google Workspace Studio brings code-free agent creation to business users - automation is a solid AI use case Xbox Xbox Series X|S notably absent during Black Friday sales Call of Duty won't repeat the mistakes of the past anymore since it didn't work out twice now MS Flight Simulator 2024 is now available on PS5 Red Dead Redemption comes to mobile for the first time, free with a Netflix account Tips & Picks Tip and app(s) of the week: De-enshittify Windows 11 RunAs Radio this week: Incident Management and the Crowdstrike Event with Liam Westley Brown liquor pick of the week: Old Farm Pennsylvania Straight Rye Whiskey Hosts: Leo Laporte, Paul Thurrott, and Richard Campbell Download or subscribe to Windows Weekly at https://twit.tv/shows/windows-weekly Check out Paul's blog at thurrott.com The Windows Weekly theme music is courtesy of Carl Franklin. Join Club TWiT for Ad-Free Podcasts! Support what you love and get ad-free audio and video feeds, a members-only Discord, and exclusive content. Join today: https://twit.tv/clubtwit Sponsors: 1password.com/windowsweekly auraframes.com/ink helixsleep.com/windows ventionteams.com/twit