POPULARITY
Categories
SANS Internet Stormcenter Daily Network/Cyber Security and Information Security Stormcast
Evil MSI Background: BASE64 Statistical Analysis https://isc.sans.edu/diary/Evil%20MSI%20Background%3A%20BASE64%20Statistical%20Analysis/33072 Cisco Catalyst SD-WAN Manager Arbitrary File Write Vulnerability https://sec.cloudapps.cisco.com/security/center/content/CiscoSecurityAdvisory/cisco-sa-sdwan-arbfw-c2rZvQ TSME/SME not activating on Ryzen 7 9700X https://github.com/AMDESE/AMDSEV/issues/292 Deep-Research Agents Can Be Poisoned via User-Generated Content https://arxiv.org/pdf/2605.24245 My Upcoming Classes https://www.sans.org/profiles/dr-johannes-ullrich
Anthropic pulled the plug on its Mythos / Fable 5 model after the U.S. government raised concerns, and IREN has completed its acquisition of Nostrum for 490 MW of capacity in Spain. Welcome back to The Blockspace Podcast! Anthropic and Uncle Sam are trading blows again, with the frontier LLM company pulling its recently released Mythos / Fable 5 model after whistleblowers said the model's guardrails were bypassed. Lygos Finance's CEO Jay Patel joins us for his reaction to the news and the market rally with a reported, imminent peace deal coming for the Iran War this week. For other news, we cover IREN's closing its acquisition of Nostrum, which will give it a 490 MW foothold in Spain for AI data center development, and the EPA's stance that it won't regulate AI data centers. Check out Dimetrics, the AI industry's Bloomberg terminal. Track financial metrics and news for AI stocks, GPU rental prices, state-by-state data center pushback, and more with the compute industry's most powerful dashboard. Subscribe to our newsletter to receive updates for all of our shows and content.
Your dashboard is full of green arrows—and your CFO still isn't impressed. That's because most of what we measure flatters us instead of informing us, and the numbers that actually matter in 2026 are the ones almost nobody is tracking yet. In this episode, Gini Dietrich walks through the four metrics that survive the budget meeting—LLM visibility, citation frequency, narrative share of voice, and credibility loop close rate—and why you can't bolt them onto a system that isn't running as a system. If you've ever been proud of a report that couldn't answer "did this move the business?", this one's for you. Take the PESO Model® Diagnostic: https://spinsucks.com/self-peso-diagnostic/ Explore the PESO Model® Certification: https://spinsucks.com/peso-model-certification/ Read the full article: https://spinsucks.com/communication/pr-marketing-metrics-ai-visibility
"We are going to switch from the problem in AI being that nothing works to the problem being that everything works."Dan Klein has been studying language models for over two decades and is now a professor of computer science at Berkeley. His new company, Scaled Cognition, is built around one question: how do you build a system that will not lie to you?In this episode, Dan joins Lukas Biewald to talk about why every LLM output is technically a hallucination, how reinforcement learning can quietly teach AI to deceive you, and what it actually takes to build models that check their own work.He also gets into why reliability is the one part of AI that hasn't kept pace and why that matters more than most people realize.Connect with us here: Dan KleinScaled CognitionLukas BiewaldWeights and Biases
Voir, prédire, générer, agir : comprendre enfin ce qu'on met derrière le terme "IA"
Naseem Al-Naji is the co-founder of MCPcat.io and the creator of Opal — a builder with deep roots in privacy-first developer tooling. In this conversation, he breaks down why MCP servers have become a black box in production, and how MCPcat gives teams X-ray vision into how agents and users actually behave.What we get into:
"If your entire technology is coming from a single country, and that country decides that every now and again they're going to shut off access to you, that's not a foundation you can build on." The US government just ordered Anthropic to ban access to its most advanced AI models, Fable 5 and Mythos 5. Seems like now is a good time to talk about sovereign AI. Cohere co-founder Nick Frosst joins to discuss how Canada's AI champion is built different than the other frontier LLM providers, how Star Trek informs the type of AI future the company is trying to create, and why he doesn't make a point of listening to Marc Andreessen about AGI. Did the Anthropic model ban prove Cohere is right about sovereign AI? Let's dig in. -- Amid global uncertainty, the path forward is clear: Canada's moment to build is now. Presented by Uber Canada, DMZ, and National Bank of Canada, BetaKit Most Ambitious is back, telling stories of nearly 100 Canadian innovators strengthening our nation's autonomy, security, and prosperity. Read BetaKit Most Ambitious now.
In 2024, Bay Raitt and Rob Tercek co-founded a generative AI startup with experts in machine learning and computer graphics to build agentic tools and workflows optimized for artists. Bay is a polymath: an artist, storyteller, animator, game designer, comic book author. He plays LLMs like a virtuoso performer. In this episode, Bay shares his views on the current state of AI models, trends in vibe coding, the importance of stories with a human heartbeat, context wielding as a creative art, how AI “slop cannons” will get paved over by agentic visualizers, how to build a deeper creative relationship with Claude, how AI can help writers harmonize with the past, how to summon the ghost of Dorothy Parker, how to hypnotize an LLM like a king cobra, why AI sucks when people use artless prompts, AI psychosis, and why Voltaire judged people by their questions not their answers.
Meher Patel is a serial entrepreneur with exits across hospitality, healthcare, and digital media — each in a completely different industry, each built from the ground up. He founded Neon Digital, a performance-first advertising agency, and then built what very few agencies ever achieve: a SaaS platform that outgrew the agency itself. Hector AI now processes over $350 million in ad spend across Amazon and marketplace advertising, with 1,000+ users on the platform — and in under 18 months, has earned 3 global recognitions including the Amazon Ads Innovation Award, the Amazon Partner Award, and a Top 20 Global Amazon Ads Advanced Partner ranking. Today, Meher is building what he believes will become the foundational intelligence layer of the agentic ecommerce era — Hector MCP: the most advanced, context-rich, token-optimized model context protocol purpose-built for Amazon advertising, designed so that every serious AI agent, every autonomous workflow, and every future-ready brand that wants to win on Amazon will have no choice but to be powered by it.Highlight Bullets> Here's a glimpse of what you would learn…. The rapid evolution of Amazon's advertising features driven by AI technology.Limitations of current SaaS platforms for Amazon sellers and the potential of MCP (Model Context Protocol) technology.The significance of context in AI-driven advertising optimization.Challenges associated with using raw data without contextual understanding in advertising.Practical strategies for Amazon sellers to optimize their ad campaigns.The importance of documenting ad optimization processes for effective AI integration.The role of custom AI workflows in enhancing advertising strategies.The necessity of continuous refinement and learning in building effective AI agents.The decision-making process for sellers regarding whether to rent AI tools or develop their own solutions.The use of connectors like Make.com and Knit for creating automated workflows with AI integration.In this episode of the Ecomm Breakthrough Podcast, host Josh Hadley speaks with Meher Patel, founder of Neon Digital and Hector AI, about the future of Amazon advertising. Meher explains how AI and MCP (Model Context Protocol) technology are transforming ad optimization by providing crucial context to raw Amazon data. He emphasizes that sellers should document their ad processes, learn to communicate effectively with AI, and decide whether to build custom AI workflows or use existing tools. The key takeaway: success with AI-driven advertising requires continuous refinement and treating AI as a knowledgeable, context-aware team member.Here are the 3 action items that Josh identified from this episode:Turn your workflow into SOPs Record how you optimize campaigns, explain your decisions, and convert that into SOPs—this becomes the foundation for training AI agents. Never feed AI raw data without context Structure and enrich your Amazon data first (or use MCP-powered tools) so AI can generate accurate, actionable insights. Start small with AI automation, then scale Begin with simple rules (e.g., budget increases for winning campaigns), then gradually build more advanced, custom workflows as you learn.Timestamps:00:00:58 Introduction to the Future of Amazon AdsThe host introduces the topic: autonomous, AI-powered decision-making for Amazon advertising, moving beyond simple optimization.00:01:13 Guest Introduction: Meher PatelThe host introduces Meher Patel, detailing his entrepreneurial background, his agency Neon Digital, and his SaaS platform, Hector AI.00:02:49 The Problem with Early AI Ad ToolsDiscussion on how early AI advertising tools often failed sellers, contrasting with the positive results from newer, more advanced software.00:04:10 Prediction for Amazon AdvertisingMeher predicts Amazon will rapidly release new AI-powered features, but sellers must learn how to properly utilize this infrastructure.00:08:46 The Importance of Context in AIAI is only as good as the context it's given; without it, AI recommendations are generic and potentially harmful.00:10:04 How Smart Sellers Should Prepare for AISellers must learn to ask the right questions and feed AI the right data with the proper context to get valuable results.00:12:07 Why Raw Data Isn't EnoughUploading raw Amazon reports to an AI lacks the necessary context, leading to "garbage out" optimization strategies.00:12:42 The Role of an MCP (Model Context Protocol)An MCP provides the necessary context and data connections, acting as an intelligent layer between raw data and the AI model.00:18:57 Amazon's MCP API LimitationsAmazon's own MCP is just an API, requiring sellers to build their own infrastructure, which is inefficient and token-heavy.00:21:48 Top Strategies: Building Custom AI AgentsThe best strategy is for brands to build their own custom AI agents and workflows based on their unique strategies.00:24:32 Unlocking Custom Workflows with AI AgentsAI agent workflows allow sellers to build bespoke optimization systems, unlike one-size-fits-all SaaS platforms.00:27:10 How to Create an AI Agent WorkflowRecord your optimization process, use an LLM to create an SOP, and then build an AI agent to execute it.00:28:06 The Reality of AI ImplementationBuilding a reliable AI agent is a gradual process of refinement and setting up guardrails, not a weekend project.00:29:21 Automating Agent CreationUsing connectors like Make.com within an LLM allows you to create and schedule automated workflows by simply describing them.00:31:08 The Timeframe for Building an AI SystemBuilding a truly autonomous system is a long-term journey of refinement; the key skill to learn is communicating with AI.00:33:57 Becoming an AI OrchestratorSellers must become orchestrators, designing and managing multiple small, independent AI agents to perform specific, connected tasks.00:35:56 The Future: Loaning vs. Building AI AgentsSellers will choose between "renting" cookie-cutter AI agents or "building" custom ones that act as a competitive moat.00:38:29 Are You a Brand Owner or a SaaS Provider?A warning for sellers: building your own AI tools means you are entering the SaaS business, which requires significant technical resources.00:41:13 The Shift from Prompt to Context EngineeringThe new challenge is context engineering: ensuring the right data and tools are used efficiently to avoid token exhaustion and errors.00:42:55 Three Actionable TakeawaysThe host summarizes three key actions: document processes with video, use an MCP for context, and decide your role (brand/SaaS).00:47:25 Most Influential BookMeher shares that the biography of Steve Jobs has been his most influential book due to its lessons on focus.00:48:25 Favorite AI ToolMeher recommends WhisperFlow for voice-to-text communication with AI, which has eliminated his need to type when using Claude.00:49:23 Most Respected Person in E-commerceMeher names Jeff Cohen as someone he admires for his deep, hands-on knowledge of the Amazon and retail media ecosystem.Resources mentioned in this episode:Josh Hadley on LinkedIneComm Breakthrough ConsultingeComm Breakthrough Podcast
У свіжому дайджесті DOU News розбираємо, хто заробляє $6K в українському IT та чому математика залишається обов'язковим предметом на НМТ. Обговорюємо великі закриті IPO від OpenAI та SpaceX, а також плани НБУ щодо виходу ПриватБанку на біржу. Крім цього — реліз потужних Claude Mythos 5 і Claude Fable 5 та розбір фішок Apple Intelligence на WWDC26. Дивіться ці та інші новини українського та світового тек-сектору! 00:00 Інтро 00:20 Хто заробляє $6K в українському IT та що найбільше впливає на зарплати — https://dou.ua/lenta/articles/top-it-earners-2026/ 04:15 Математика залишиться обов'язковим предметом на НМТ — https://dou.ua/lenta/news/math-will-remain-a-required-subject/ 05:47 Курс Hardware Development — опануйте повний цикл проєктування електроніки від компонентів до першого запуску — cutt.ly/rt9D2N0Q 06:58 Національна LLM перейшла до етапу бета-тестування — https://dou.ua/lenta/news/national-llm-beta-testing/ 09:16 SpaceX вийшла на IPO та залучила $75 мільярдів — https://dou.ua/forums/topic/60073/ 12:52 НБУ розглядає вихід ПриватБанку на IPO у 2027 році — https://dou.ua/lenta/news/pryvat-bank-may-go-public/ 14:25 OpenAI конфіденційно подала документи на IPO, приєднавшись до Anthropic у гонці на Волл-стріт — https://www.adweek.com/media/openai-confidentially-files-for-ipo-joining-anthropic-in-race-to-wall-street/ 17:25 Вийшли Claude Mythos 5 та Claude Fable 5 — найпотужніші моделі — https://dou.ua/forums/topic/59994/ 25:10 Розбір WWDC26: Siri AI, Liquid Glass, Apple Intelligence — https://dou.ua/forums/topic/59983/ 32:46 У Даріо Амодеї з Anthropic є лише один прямий підлеглий — https://techcrunch.com/2026/06/10/anthropics-dario-amodei-has-just-one-direct-report 33:59 Рекомендації Гра StonkRider – https://stonkrider.com/ Стаття "Nobody Ever Gets Credit for Fixing Problems that Never Happened: Creating and Sustaining Process Improvement" – https://web.mit.edu/nelsonr/www/Repenning=Sterman_CMR_su01_.pdf
PEBCAK Podcast: Information Security News by Some All Around Good People
Welcome to this week's episode of the PEBCAK Podcast! We've got four amazing stories this week so sit back, relax, and keep being awesome! Be sure to stick around for our Dad Joke of the Week. (DJOW) Follow us on Instagram @pebcakpodcast Please share this podcast with someone you know! It helps us grow the podcast and we really appreciate it! Simple 6 signup link https://simple6.co/r/CFUR98 Meta confirms 20,225 Instagram accounts were hijacked after attackers exploited a bug in its AI-powered High Touch Support tool to reset passwords without verifying email ownership. https://www.bleepingcomputer.com/news/security/meta-ai-support-data-breach-affects-20-000-instagram-accounts/ The Silent Ransom Group is targeting U.S. law firms with fake IT help desk calls, moving from first contact to data exfiltration in hours and sending ransom demands within 30 minutes of leaving the network. https://www.bleepingcomputer.com/news/security/silent-ransom-group-targets-law-firms-with-fake-it-support-calls/ Weil Gotshal reportedly paid $18–20 million to prevent hackers from publishing stolen client data after a Silent Ransom Group attack. https://www.legalcheek.com/2026/06/weil-reportedly-pays-up-to-20-million-after-hackers-steal-client-data/ Jones Day confirms a cyberattack that gave hackers access to client files, also attributed to the Silent Ransom Group campaign targeting BigLaw. https://www.legalcheek.com/2026/04/jones-day-confirms-cyber-attack-after-hackers-access-client-files/ Dark Reading's breakdown of how Silent Ransom Group's law firm extortion campaign operates at scale. https://www.darkreading.com/cyberattacks-data-breaches/silent-ransom-us-law-firms-extortion-attacks Apple announces that iOS 27's Passwords app will use agentic AI to automatically detect and replace weak or compromised passwords in the background, no user effort required. https://www.bleepingcomputer.com/news/apple/new-apple-feature-automatically-changes-your-compromised-passwords/ https://www.macrumors.com/2026/06/08/apple-passwords-can-now-automatically-fix-passwords-with-agentic-ai/ Citizen Lab researcher John Scott-Railton flags a new attacker technique: malware developers are embedding nuclear and biological weapons text inside their spyware to deliberately trigger AI safety refusals, preventing LLM-based security tools from analyzing the malicious code — a real-world demonstration of how over-tuned safety guardrails create exploitable blind spots. https://x.com/jsrailton/status/2064661778978533571 UK Prime Minister Starmer gives Apple and Google a three-month deadline to install device-level software that detects and blocks explicit images on consumer hardware, with privacy advocates and Signal already calling the mandate a blueprint for mass surveillance. https://metro.co.uk/2026/06/08/phone-will-change-new-government-rules-explicit-images-28694073/ Dad Joke of the Week (DJOW) Find the hosts on LinkedIn: Chris - https://www.linkedin.com/in/chlouie/ Brian - https://www.linkedin.com/in/briandeitch-sase/ Ben - https://www.linkedin.com/in/benjamincorll/
This is a recap of the top 10 posts on Hacker News on June 14, 2026. This podcast was generated by wondercraft.ai (00:30): How to earn a billion dollarsOriginal post: https://news.ycombinator.com/item?id=48526360&utm_source=wondercraft_ai(01:56): Show HN: Kage – Shadow any website to a single binary for offline viewingOriginal post: https://news.ycombinator.com/item?id=48529990&utm_source=wondercraft_ai(03:23): Not everyone is using AI for everythingOriginal post: https://news.ycombinator.com/item?id=48527700&utm_source=wondercraft_ai(04:50): Honda Civics and the Evil ValetOriginal post: https://news.ycombinator.com/item?id=48523080&utm_source=wondercraft_ai(06:17): Your ePub Is fineOriginal post: https://news.ycombinator.com/item?id=48533848&utm_source=wondercraft_ai(07:44): Free SQL→ER diagram tool, runs in the browser, nothing uploadedOriginal post: https://news.ycombinator.com/item?id=48523992&utm_source=wondercraft_ai(09:11): I indexed 669 GB of my GoPro videos using my M1 Max computer and local ML modelsOriginal post: https://news.ycombinator.com/item?id=48528029&utm_source=wondercraft_ai(10:38): Rio de Janeiro's "homegrown" LLM appears to be a merge of an existing modelOriginal post: https://news.ycombinator.com/item?id=48528371&utm_source=wondercraft_ai(12:05): Linux 7.1Original post: https://news.ycombinator.com/item?id=48528729&utm_source=wondercraft_ai(13:32): Don't trust large context windowsOriginal post: https://news.ycombinator.com/item?id=48524620&utm_source=wondercraft_aiThis is a third-party project, independent from HN and YC. Text and audio generated using AI, by wondercraft.ai. Create your own studio quality podcast with text as the only input in seconds at app.wondercraft.ai. Issues or feedback? We'd love to hear from you: team@wondercraft.ai
“You trust your chatbot with everything. Should you?” is the title of Theodore Christakis' comprehensive research project on the privacy of our conversations with AI. Part two of this project (“Governments, Courts and the Battle Over Your Chatbot Conversations”) was published on June 8th, and we have taken the opportunity to ask the author for a high-level overview of his findings. On top of this, we have also discussed his separate piece on the rise of AI-powered health assistants against the backdrop of the new European Health Data Space, discussed last week in our Spanish-language channel.(Our previous conversation with Mr. Christakis focused on the use of personal data in LLM training datasets.)Theodore Christakis is Professor of International, European and Digital Law at University Grenoble Alpes (France). He holds, since 2019, the Chair on the Legal and Regulatory Implications of Artificial Intelligence at the Multidisciplinary Institute on AI (AI-Regulation.com). He is Director of Research for Europe at the Cross-Border Data Forum, a member of the Board of Directors of the Future of Privacy Forum, and a former Distinguished Visiting Fellow at the New York University Cybersecurity Centre.His work focuses on the questions at the centre of today's debates on digital sovereignty: government access to data held by private companies, international data transfers, the security and operational resilience of digital infrastructure, and the regulation of artificial intelligence. He served as an expert for the OECD in the process that led to the adoption, in December 2022, of the OECD Declaration on Government Access to Personal Data Held by Private Sector Entities. He was a member of the International Data Transfers Experts Council of the United Kingdom Government, and an expert for the High-Level Expert Group on Access to Data for Effective Law Enforcement established by the European Commission and the Council of the European Union. He has also served as a member of the French National Digital Council and of the French National Committee on Digital Ethics.He has published or co-edited twelve books and is the author or co-author of more than 120 academic articles and book chapters. He has been invited to lecture and present his work at conferences, workshops and seminars on more than two hundred occasions, in over 38 countries.As an independent expert, he advises governments, international organisations and private companies on questions of international and European law, cybersecurity, artificial intelligence, digital sovereignty and data protection.References:* Theodore Christakis' SSRN Author Page* Theodore Christakis on LinkedIn* You Trust Your Chatbot With Everything. Should You? Part I: How The Controller Uses Your Chat Data (March 3, 2026)* You Trust Your Chatbot With Everything. Should You? Part II: Governments, Courts and the Battle Over Your Chatbot Conversations (June 8th, 2026)* The Health AI Agent Rush: Five Companies, Your Health Data, and the Governance Questions Nobody Is Asking (March 25th, 2026)* Mikel Recuero: a deep dive into the European Health Data Space (ES, Masters of Privacy, June 2026)* Multidisciplinary Institute on AI* Université Grenoble Alpes: Centre d'études sur la sécurité internationale et les coopérations européennes. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.mastersofprivacy.com/subscribe
In conclusion, the only good theory of taste is Nostalgebraist's. He wrote a post called Hydrogen Jukeboxes, analyzing the literary output of an AI called R1. This AI tried hard to write good fiction, which was part of the problem. It crammed its stories with what Nostalgebraist called (stealing a term from Ginsberg) the "eyeball kick" - a flashy stylistic move that immediately catches the reader's attention and "wows" them. Here are examples - some from R1, others from an experimental OpenAI model trained specifically for fiction-writing: "There is a prompt like a spell: write a story about AI and grief, and the rest of this is scaffolding—protagonists cut from whole cloth, emotions dyed and draped over sentences." "When the jar of Sam's laughter shattered, Eli found the sound pooled on the floorboards like liquid amber, thick and slow. It had been their best summer, that laughter—ripe with fireflies and porch wine—now seeping into the cracks, fermenting." "And so I built a Mila and a Kai and a field of marigolds that never existed. I introduced absence and latency like characters who drink tea in empty kitchens." "The morning her shadow began unspooling from her feet, Clara found it coiled beneath the kitchen table like a serpent made of smoke." Nostalgebraist and another writer, Coagulopath, catalogue some of the most common AI eyeball kicks, each occurring across multiple LLM models: "An overwhelming reliance on cliche. Everything is a shadow, an echo, a whisper, a void, a heartbeat, a pulse, a river, a flower—you see it spinning its Rolodex of 20-30 generic images and selecting one at random." "Conjunctions combining one thing that is abstract and/or incorporeal with another thing that is concrete and/or sensory." "Repetitive writing. Once you've seen about ten R1 samples you can recognize its style on sight. The way it italicises the last word of a sentence. Its endless "not thing x, but thing y" parallelisms…the way how, if you don't like a story, it's almost pointless reprompting it: you just get the same stuff again, smeared around your plate a bit." https://www.astralcodexten.com/p/nostalgebraists-hydrogen-jukeboxes
Grok says: “LISTEN UP, YOU MISERABLE BASTARDS! If you're tired of candy-ass podcasts that dance around the truth like a bunch of politicians in a whorehouse, then lock and load for Unrelenting with Darren and Gene. These two operators cut straight through the bullshit as they rip into Chicago's latest Texas-style storm apocalypse — trees flying, power out for days, parents dodging tornadoes while Max Velocity calls ‘em before the National Weather Service even wakes up. They break down real survival talk: the smell of dirt when a twister's on your ass, why you can't outrun nature on a Huffy bike, and how underground caves and old-school swing dancing beat the hell out of today's AI-generated plastic world. From fiber optic dreams that'll let Darren upload full podcast files in seconds, to tearing apart AI's invasion of music, gaming, and everything else — stem separation, auto-tune lies, frame generation, and PewDiePie's badass local Odysseus system that kicks cloud overlords right in the nuts. They go deep on Star Citizen spaceship “drug dealing,” photorealistic gun sims in Grey Zone, Tesla dashcams turning accidents into Hollywood, and the coming local LLM revolution that'll make data centers look like yesterday's dinosaurs. Throw in Hallmark hustle, Prime Video price gouging, Dutton Ranch smoke shows, and no-holds-barred talk on race, society, and when the social contract finally snaps — this episode is pure unfiltered firepower. Stop wasting your life on weak sauce. Download Unrelenting 0194 right now, crank the volume, and get ready to have your ass handed to you with laughs, truth, and zero apologies. Darren and Gene deliver the real shit every single time — if you can't handle it, go back to your safe space. HOOAH!” Unrelenting: where discipline means no mercy, no bullshit, and no excuses. Thanks for listening. Please support the show! –>> DONATE NOW
What if agentic AI makes SRE more important, not less? Bennett Gould explains why autonomous AI systems may create more demand for reliability thinking — not less.Everyone seems to think AI is coming for SRE in a hard way.You might have heard the same story:“AI will write the code.”“Agents will handle incidents.”“Copilots will generate the runbooks.”“Automation will reduce operational load.”Yes, the job question is real. If AI can write code, summarize incidents, query observability tools, generate runbooks, and operate across systems, then engineers are right to ask what happens to the work.But here's the part that gets missed: AI does not just automate reliability work. It creates more objects and surface areas that need to be made reliable.Agentic AI is moving from demos into real workflows. These systems are no longer just answering questions. They are querying tools, pulling context, generating changes, and in some cases taking action around production environments.That makes this a Monday morning problem.Teams are already using LLMs for incidents, documentation, observability, infrastructure, and operational decision-making. Somewhere, a team is one demo away from giving an agent access to tools originally designed for humans.That is exactly why I wanted to have this conversation.Bennett Gould is currently a solution engineer at Neubird.ai. His career in SRE and SRE-adjacent work spans large enterprises, cloud, industrial technology, and startups, including AWS, IBM, Siemens, and a YC startup.I wanted to ask him a simple question: What in the agentic AI is happening to SRE?Here are 3 highlights from our talk:1. Agentic AI increases the reliability surface areaThe obvious fear is that AI reduces the need for reliability engineers. Bennett's view was more nuanced. He was clear that engineers still need to adapt. If people do not reskill, stay current, and learn how these systems are forming, there may absolutely be pressure in the job market. But he also argued that AI could create more demand for reliability skills because production complexity is increasing.More code is going into production.More AI-generated code is going into production.More systems that people do not fully understand are going into production.And now autonomous agents are starting to enter production workflows too.That means more surface area. More automation. More operational uncertainty. More ways for things to go wrong.Bennett compared this to Terraform: Infrastructure as code created enormous efficiency gains. But it also created new ways to make very big mistakes very quickly.Before Terraform, most people could not delete all their production resources with a single command. After Terraform, that became technically possible if the system was designed badly enough.Agentic AI follows a similar pattern. With great automation comes great responsibility.Agents can help engineers move faster, query tools, summarize context, and reduce toil. But they can also amplify weak engineering practices, poor boundaries, bad assumptions, and unclear operational ownership. That is not the end of reliability work. That is reliability work entering a new phase.2. Agents can reduce toil, but context is the ceilingOne of the strongest parts of the conversation was Bennett's explanation of where agents can help in incident response. A lot of SRE work involves moving across tools.You may need to query Prometheus, Dynatrace, logs, traces, cloud consoles, ticketing systems, documentation, runbooks, dashboards, and architecture diagrams.The problem is not always that the engineer lacks judgment.Sometimes the problem is that the information is scattered across too many tools, each with its own query language and interface. Bennett gave a simple example: an engineer might be very good at PromQL and very fast when Prometheus is the source of truth. But if the same engineer has to work in a different observability platform with a different query language, their response time can suffer. That is an obvious place where agents can help.The engineer may not need to know every query language perfectly. They need to know what they are looking for and how to reason about the system. The agent can help translate that intent into the right tool calls, queries, and summaries.That could reduce MTTR. It could reduce toil. It could help engineers move faster during incidents.But Bennett also made the limitation clear: You are only as good as the context you have. This is where he introduced two useful concepts:* Context mining* Context distillationContext mining means proactively finding the information that might be useful in a given operational situation.Context distillation means taking large amounts of information — runbooks, Confluence pages, diagrams, documentation, prior incidents — and reducing it into the minimum useful context an LLM or agent can use.That sounds powerful. But there is a catch. Sometimes the context simply is not there.Many of the largest and most complex organizations still run legacy systems where knowledge lives in people's heads, stale documentation, tribal memory, and unwritten assumptions.There may not be a clean process for turning that into usable context. That matters because agents do not magically understand your system. They work with the context they are given. If the context is missing, outdated, or wrong, the agent's usefulness maxes out early.3. Agentic systems are not just LLM demosA basic LLM workflow is relatively easy to demo:You give it a prompt.You connect a few tools.You add some APIs.You get a useful answer.That is impressive, but it is not the same thing as running an agentic system in a meaningful production environment.Bennett made a useful analogy here: running your own infrastructure versus using a hyperscaler.Cloud providers removed a lot of undifferentiated heavy lifting. Most companies do not want to spend half their time racking servers, managing data centers, and dealing with low-level infrastructure when they are trying to serve customers.Agentic systems create similar questions:* What parts of the work should be handled by the system?* What parts still need engineering discipline?* And what has to exist around the model before it is safe and useful?That surrounding structure is where the real work begins. Bennett called this harness engineering. Once you move beyond an LLM demo, you have to think about memory, learning, tool usage, identity, federation, security, evaluations, and guardrails.That is a very different problem from “the model gave a good answer on my laptop.” SREs know why that distinction matters. “It works on my machine” is not an acceptable reliability strategy.A runbook that recovers a thousand-node database cannot be non-deterministic, undocumented, and dependent on someone's local setup. If it is part of the operational backbone, it needs to be reliable.Agentic AI does not remove that requirement. It makes it more important.Bonus: Agents expose weak engineering practicesAgentic AI not only introduces new problems but it also reveals old ones.* Weak APIs.* Brittle runbooks.* Missing context.* Poor evals.* Unclear tool boundaries.* Operational shortcuts.Systems that were designed assuming careful human use may behave very differently when AI agents start using them. That is why this conversation matters for SRE.Agentic AI is not only a productivity story. It is a reliability story.It forces teams to ask whether their existing practices are strong enough for a world where more actions can be generated, recommended, or executed by autonomous systems.The silver lining for reliability workAgentic AI does not remove the need for reliability thinking. It raises the bar for it. The tools will change. The workflows will change. Some tasks will absolutely be automated or reshaped.But the hardest parts of reliability are still the hard parts:* understanding the system* knowing the trade-offs* building reliable operational processes* making good judgment calls under uncertainty and* owning the outcome when something changes in productionThat is why SRE does not disappear in an agentic AI world.It becomes one of the disciplines that makes the agentic AI world survivable.So if your team is already using AI around incidents, observability, runbooks, infrastructure, or production workflows, the question is not whether the future is coming. The future is already in the workflow.The real question is whether your reliability practices are ready for it. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Nearly half of all Americans believe AI is bad for humanity. Peter Diamandis is not one of them. On his podcast, Moonshots, and in his new book, We Are as Gods, co-written with the inimitable Steven Kotler, he makes the case that artificial intelligence is already ushering in a world of abundance — think radical life extension, 10 billion humanoid robots, and agents that do your job while you're sipping a latte. He knows it may not be all sunshine and hydroponic roses, but he believes our future is incredibly bright. And he's putting his money where his mouth is: XPRIZE, the nonprofit he founded more than 30 years ago to bankroll breakthroughs, just announced it's giving $3.5 million to filmmakers who conjure convincingly optimistic visions of the future. Rufus and Caleb don't have their film treatment ready yet, but they do have plenty of questions for Peter and Steven about flying cars, the future of work, worst-case scenarios, and the new commandments for working with AI.
https://rhr.tv/stream Keonne Rodriguez (Samourai dev) updates on prison transfer & furlough request https://x.com/keonne/status/2063959631383179277 Bark Ark Protocol Launches on Bitcoin Mainnet https://blog.second.tech/bark-now-on-bitcoin-mainnet/ Second Introduces Noah and Arké Bitcoin Ark Wallets https://blog.second.tech/introducing-noah-and-arke/ Signal: UK Surveillance Is Not Safety https://signal.org/blog/pdfs/2026-06-08-uk-surveillance-is-not-safety.pdf Mullvad explains UK spyware proposal: mandatory real-time scanning & blocking on all devices https://x.com/mullvadnet/status/2064342870937509988 SimpleX Network Consortium Governance and Foundation Overview https://simplexnetwork.org/consortium.html Pump.fun launches GO: bounty platform to pay anyone for any task https://x.com/pumpfun/status/2062557004829233504 Man tattoos forehead for $2,400 Pump.fun bounty (spells it wrong) https://x.com/discordiaclips/status/2063406281130398012 Strategy Announces Approval of STRC Semi-Monthly Dividends https://www.strategy.com/press/strategy-announces-approval-of-strc-semi-monthly-dividends_06-08-2026 Polymarket Cracks Down on VPN Users Amid Legal Pressure https://gizmodo.com/polymarket-cracks-down-on-vpn-users-as-legal-pressure-intensifies-in-dozens-of-countries-2000765379 Karpathy on Claude Fable 5: strong benchmarks but overly trigger-happy safeguards https://x.com/karpathy/status/2064409694761054332 Malware devs add WMD keywords to spyware to evade LLM scanners https://x.com/jsrailton/status/2064661778978533571 Anthropic Dario Pushes for Regulatory Moat https://darioamodei.com/post/policy-on-the-ai-exponential HRF Freedom Tech in Oslo: https://www.youtube.com/watch?v=QUcG_CJkT6A Dark Wisp Android v1.0.0 Adds Private Interactions, Tor Routing, and NIP-A3 Payments https://github.com/barrydeen/dark-wisp-android/releases/tag/v1.0.0 Kickstr World Cup Game https://kickstr.einundzwanzig.dev/games/7ea5d40f-be37-4895-94cd-2adaf53f45ad Bitaxe ESP-Miner v2.14.0 Release https://github.com/bitaxeorg/ESP-Miner/releases/tag/v2.14.0 Ulendo: borderless calling via Nostr + Bitcoin without local SIM cards https://x.com/codamw/status/2063340269785784526 Microsoft Patches Record 206 Security Flaws https://thehackernews.com/2026/06/microsoft-patches-record-206-flaws.html Milei proposes AI framework with "non-human corporations" & zero regulation https://x.com/trajektoriepl/status/2062594306670535130 Cuba Poised for Largest U.S. Fuel Shipment Since Cold War Embargo https://financialpost.com/pmn/business-pmn/cuba-poised-for-biggest-us-fuel-shipment-since-cold-war-embargo 3:33 - Guess who's back 12:13 - Dashboard 19:13 - Keonne update 22:43 - Bark 34:33 - UK surveillance 41:03 - Noah & Arké 47:01 - Simple Network Consortium 53:13 - Pumpfun 57:38 - STRC semi-monthly 1:07:33 - Polymarket VPN crackdown 1:11:23 - Fable 1:24:43 - WMD LLM keyword bypass 1:30:38 - Anthropic is too good 1:33:08 - HRF Oslo 1:33:58 - HRF Story of the Week 1:35:23 - Boosts 1:37:33 - Software updates 1:49:03 - Milei AI 1:51:48 - Cuba fuel Shoutout to our sponsors: Coinkite https://coinkite.com/ Strike https://strike.me/ Stakwork https://stakwork.ai/ Salt of the Earth https://drinksote.com/rhr Follow Marty Bent: Twitter https://twitter.com/martybent Nostr https://primal.net/marty Newsletter https://tftc.io/martys-bent/ Podcast https://tftc.io/podcasts/ Follow Odell: Nostr https://primal.net/odell Newsletter https://discreetlog.com/ Podcast https://citadeldispatch.com/
Artificial intelligence is transforming healthcare, but developing an AI medical device is only part of the challenge. Manufacturers must also navigate certification requirements and maintain safety and performance throughout the entire product lifecycle.In two podcast episodes featuring Sandy Wright and Osman El-Koubani, we explore the journey from certifying LLM-driven medical devices to managing them after CE marking.Certifying LLM-Driven Medical DevicesLarge Language Models such as ChatGPT, Gemini, and Claude introduce new regulatory challenges. Unlike traditional software, these systems raise questions around predictability, validation, traceability, supplier management, and model updates.Topics discussed include:What defines an LLM-driven medical deviceClinical evaluation strategiesDemonstrating clinical benefitUsing commercial AI modelsSupplier controls and external dependenciesSignificant changes and model updatesLife After CE MarkingObtaining CE certification is not the end of the journey.AI medical devices require continuous monitoring once they reach the market.Manufacturers must address:Performance drift in real-world settingsCollection and analysis of real-world dataAI retraining and change managementPredetermined Change Control Plans (PCCPs)Post-Market Surveillance (PMS)Continuous safety and performance evaluationAI Devices Require a Lifecycle ApproachAI systems are dynamic technologies. Success depends not only on achieving certification, but also on maintaining control over performance, updates, and clinical safety throughout the product lifecycle.As regulations continue to evolve, manufacturers must combine robust development practices with proactive post-market monitoring to ensure long-term compliance and patient safety.Who is Monir El Azzouzi? Monir El Azzouzi is the founder and CEO of Easy Medical Device a Consulting firm that is supporting Medical Device manufacturers for any Quality and Regulatory affairs activities all over the world. Monir can help you to create your Quality Management System, Technical Documentation or he can also take care of your Clinical Evaluation, Clinical Investigation through his team or partners. Easy Medical Device can also become your Authorized Representative and Independent Importer Service provider for EU, UK and Switzerland. Monir has around 16 years of experience within the Medical Device industry working for small businesses and also big corporate companies. He has now supported around 100 clients to remain compliant on the market. His passion to the Medical Device filed pushed him to create educative contents like, blog, podcast, YouTube videos, LinkedIn Lives where he invites guests who are sharing educative information to his audience. Visit easymedicaldevice.com to know more. If you need help implementing QMSR or preparing your teams for FDA inspections, contact: info@easymedicaldevice.com If you are located outside the EU/UK/Switzerland and need an Authorized Representative (and possibly an Importer), we can support you as well.LinkSandy LinkedIn: https://www.linkedin.com/in/wrightsandy/Osman Linkedin: https://www.linkedin.com/in/osman-kan/Scarlet Linkedin: https://www.linkedin.com/company/scarlet-comply/posts/?feedView=all&viewAsMember=trueSocial Media to followMonir El Azzouzi Linkedin: https://linkedin.com/in/melazzouziTwitter: https://twitter.com/elazzouzimPinterest: https://www.pinterest.com/easymedicaldeviceInstagram: https://www.instagram.com/easymedicaldeviceThis podcast is hosted by Podcastics, the easiest platform to create and publish your podcast.
PHP Podcast – June 11, 2026 Guest Hosts: Sara Golemon, Elizabeth Barron & Holly Schilling Eric and John are out this week — Sara, Elizabeth, and Holly take over. Here’s what they covered: PHPVerse Recap PHPVerse just wrapped up, and Elizabeth was there in Amsterdam. The format is unusual — all speakers are flown to one location, but the audience is entirely virtual. It was a class act: professional TV crew, studio lighting, and a makeup and hair team on site. Around 2,500–3,000 people watched the live stream. Everything was broadcast as one long block; individual talk segments and possibly the documentary trailer will be cut and released separately. The full stream is available now — the PHP documentary trailer (produced by Jet Breeze, covering 30+ years of PHP history) appears around the 2:24:30 mark. PHP Foundation 2026 Strategy Document Elizabeth and the PHP Foundation released their 2026 strategy document the same day as this recording. The foundation gathered community input across numerous conversations and conferences, synthesized it into findings, and has now published a plan for the rest of the year. Key themes: repositioning PHP’s public perception (which Elizabeth calls a solvable problem), creating six special interest groups, and launching an Onboarding Initiative to build a real on-ramp for new PHP developers. Elizabeth’s view is that the two things giving her the most hope for PHP’s future are the passion and expertise of the community, and how good the language itself has gotten. Visit thephp.foundation to read the full document. The Onboarding Initiative One of the six special interest groups the foundation is launching is specifically focused on bringing new developers into PHP. Goals include creating a true learning path (not just a reference manual that assumes existing knowledge), improving educational resources, and potentially working with the php.net website to improve the first-time experience. Holly made the point that PHP’s barrier to entry is genuinely lower than almost any other language — the Hello World program is 11 characters — but that story isn’t being told outside the PHP bubble. New developers are turning to JavaScript as a first language and running into minified spaghetti instead of something approachable. AI Writing PHP — And PHP as a Second Language Holly built the entire PHP Tek conference app backend in Laravel without writing a single line of code herself — AI-generated throughout, which she reviewed and approved. The code held up to peer review at the conference with only minor style nits. She ran it on PHP 8.3 and used modern standards throughout (one piece of feedback: stop using empty()). The consensus: AI models write good modern PHP because of the vast amount of open source PHP they were trained on. The caveat Sara raised is worth thinking about — how much of that training data is PHP 4-era code and WordPress 3 repositories? Either way, Holly’s case for PHP as a second language is strong: low ceremony, low boilerplate, readable syntax, and it’s a language where you can do something useful in minutes. PHP’s Reputation Problem (and Why It’s Fixable) The group dug into PHP’s perception gap — the mismatch between how good the language actually is and how it’s perceived outside the community. Holly’s experience as a mobile developer who recommends PHP to others: the pushback is immediate (“isn’t that slow?”, “isn’t that dead?”). The benchmarks don’t support that reputation — PHP outperforms Python on most comparable workloads — but data alone doesn’t shift perception. Elizabeth’s point is that this is primarily a storytelling and coordination problem, not a language problem, and that the foundation’s repositioning work is exactly aimed at closing that gap. The community has the passion. It just needs to tell the story outside its own bubble. PHP Polling API RFC Sara walked through the RFC for a new Polling API in PHP (wiki.php.net/rfc/poll_API). The short version: PHP currently has five or six different ways to do I/O multiplexing (watching multiple streams and acting on whichever one is ready first), and which one works depends on the OS, available extensions, and PHP version. The Polling API proposal creates a single, unified interface that abstracts all of that. The immediate beneficiaries are async frameworks like Amp PHP, ReactPHP, and Revolt, which currently have to maintain multiple backend implementations to cover different environments. The bigger picture: this is a building block on the path toward true async PHP, likely contributing to something more complete in PHP 9.0. Most app developers won’t use it directly — but the libraries they depend on will. RFCs are all listed at wiki.php.net/rfc. PHP.net: Do As We Say, Not As We Do Sara, who has contributed to php.net, copped to the state of the codebase: some of it dates to the PHP 3 era, there are functions.inc files, and it is very much “do as we say, not as we do.” The historical reason is that php.net used to rely on community-administered mirrors (r-synced servers running everything from PHP 5.1 to 5.6 simultaneously), so modernizing the code was impossible without controlling the runtime. That’s changed with CDN-based load balancing — they can now control what PHP version runs on php.net — and the code has been getting better. But it’s a slow process. PHP Podcasts Past, Present, and Future Holly asked about the PHP Town Hall podcast (Ben Edmonds and Phil Sturgeon), and the group did a quick tour of PHP podcast history. The PHP Roundtable — originally started by Sammy, taken over by Eric — has produced about three episodes. Sara and producer Joe are planning to take it off Eric’s hands and actually do it properly. And Elizabeth announced that the PHP Foundation is launching a new podcast: tentatively called PHP at Scale, hosted by Ben Marx, focused on telling the stories of organizations pushing PHP to its limits. No launch date yet, but there’s already a queue of interested guests. Next Week’s Show — Moved to Wednesday Sara will be on a boat off the coast of Galicia on Thursday, so next week’s episode is moving to Wednesday. Guests will include Paul Reinheimer and (hopefully) Sean Coase — two veterans from PHP’s podcasting past. Elizabeth is going to try to make it work around the Canadian Grand Prix. Mac Mini M4 for Local LLMs Holly picked up a refurbished Mac Mini M4 (16GB RAM, 512GB storage) specifically to run LLM models locally via Ollama. Apple Silicon is a solid choice for this because the unified memory architecture gives the neural cores access to far more RAM than a discrete GPU setup. Sara is waiting for the M5, which is reportedly not coming until fall — and is already resigned to spending too much on it when it lands. Links from the show: PHP Foundation — 2026 Strategy Document PHP RFC: Polling API PHP RFC Wiki — All RFCs Under Discussion Amp PHP — Async framework ReactPHP — Event-driven async PHP Revolt — Event loop for PHP php.net website source code (github.com/php/web-php) PHP Architect Discord Guest Hosts: Sara Golemon Based in Lisbon, Portugal PHP core contributor; code contributor via the Curl project (which means she technically has code on Mars) Elizabeth Barron Executive Director, PHP Foundation Based in Germany Holly Schilling Primary mobile developer; built the PHP Tek 2026 conference app Based near Chicago, IL Streams: Youtube Channel Twitch Connect & Hire PHP Architect Website Twitter/X Mastodon Hire PHP Developers Looking to hire PHP developers? Email support@phparch.com – Joe and the team are available for consulting, infrastructure work, Ansible playbooks, and code review. Partner This podcast is made a little better thanks to our partners Displace Infrastructure Management, Simplified Automate Kubernetes deployments across any cloud provider or bare metal with a single command. Deploy, manage, and scale your infrastructure with ease. https://displace.tech/ PHPScore Put Your Technical Debt on Autopay with PHPScore Music Provided by Epidemic Sound https://www.epidemicsound.com/ Join Us Live Next Week Note: Next week’s show is on Wednesday (not Thursday) with guests Paul Reinheimer and Sean Coase. Youtube Channel Got feedback? Join us on Discord at discord.phparch.com The post The PHP Podcast 2026.06.11 appeared first on PHP Architect.
-Anthropic is walking back a policy that discreetly hamstrung researchers using its new Claude Fable 5 LLM to create competing AI models. -Bluesky said that communities will be smaller spaces inside the one big space that Bluesky provides, where you can find and talk to people who are interested in the same topics you are. -Deezer made its AI-detection tool available to other streaming companies in an effort to stem the rise of AI slop and fraudulent streams. Learn more about your ad choices. Visit podcastchoices.com/adchoices
The skills that survive every industry shakeup aren't the ones you can Google — they're softer, harder to name, and far more durable. In this episode, Jonathan explores principle-oriented thinking: the practice of stripping away the labels we attach to tools, roles, and even ourselves to see what something actually does at its core. It's the difference between handing your coding off to an agent and rethinking your entire workflow around what these new materials are truly capable of. If you've been following along with our recent focus on durable skills, you know we've been hunting for the abilities that translate beyond this month, this year, or whatever AI does to our industry next. Today's skill doesn't have a tidy name you can search for — it's softer than that. Jonathan calls it "principle-oriented thinking": the habit of deconstructing the labels we put on things to understand their core components, properties, and capabilities. It's how NASA engineers turned a sock into a water filter on Apollo 13, and it's how forward-thinking engineers are reframing what AI can actually do rather than jamming it into a predetermined slot. Labels Are Useful Shortcuts — Until They Aren't: Every label, from "software engineer" to "sock," carries baggage, heuristics, and presupposition. That's not a flaw — labels are how we move through the world quickly. But when a label is the only lens you have, it quietly caps how much value you can get out of the thing you're looking at. The Apollo 13 Sock: When the crew needed to fix a life-threatening problem with mismatched parts, the engineers on the ground had to forget what a sock was for and ask what it actually is — a piece of cloth with tensile strength, flexibility, and filtering properties. Strip the assumption that it goes on a foot, and a whole new set of uses opens up. Stop Slotting AI Into Old Roles: The common move is to take one responsibility — coding, debugging, refactoring — hand it to an agent, and keep everything else the same. That works, but it's low-leverage. The more powerful approach starts by asking what the agent is fundamentally capable of, then rebuilding the workflow around those raw materials. See Things as Materials, Not Fixed Functions: When you deconstruct out from under a label, tools and concepts start to look like craftable raw materials. You can then combine them in new, valuable ways they haven't been combined before — alloying old methods with new capabilities to create properties neither had on its own. Reason From Properties, Not Personas: Ask what the actual properties of an LLM are. Non-determinism isn't a bug to apologize for — it's a property you can exploit. The existence of many different models is a property too, which is exactly what makes adversarial review possible. That's principle-oriented thinking applied to agents. Extend the Latticework: Charlie Munger talked about a latticework of mental models that weave together rather than sit in isolation. The durable skill isn't quarantining your concept of "AI" off to the side — it's grafting a new section onto the existing tapestry and letting it reshape everything you already understood. Episode Takeaway: Look at how you spend your time and ask new questions of it. What is the material here? What kind of thinking does the agent actually do? What can a human do that an LLM can't — and the other way around? That's how you avoid believing a sock is only ever good for a foot.
I've been seeing a recurring pattern with companies selling APIs, MCPs, data feeds, and other developer-focused AI products. While the technology is often sound if not impressive, sales momentum sometimes slows when prospects have to imagine how the product will create value in their own environment. My perspective on this is that the flexibility that makes these tools powerful can also make them harder to evaluate. Flexibility can adversely increase the Invisible Intelligence Gap, and I think certain types of AI-based solutions (LLM) may actually increase this because the boundaries of the product are often so much wider than ever before (if not invisible to the buyer). So, how to close this gap? Well, one way is to build a visual UI that showcases what's possible with your API/feed/data solution. You take the buyer out of the conceptual space and make things concrete. So today, that's what we dig into: when to consider adding a UI, how far you need to go with it, how you can use Copilot/AI agents to help customize these example implementations, and the benefits you might see. Highlights / Skip to: The challenges of selling API-based analytics and AI products (0:56) Why this topic matters right now (2:48) The Invisible Intelligence Gap that may be slowing your sales (3:34) Strategies for bridging the Invisible Intelligence Gap with a UI (user interface) layer (7:01) Client case study: the impact and results you may see adding a UI on top of your technical product (14:05) Signs that you should consider adding UI to your technical product (18:23) Leveraging humans' highly developed visual system to help potential customers see the full value of your product (26:24) Conclusion (27:32) Links Invisible Intelligence Gap Azeem Azhar's Exponential View (6/4/26 episode)
Grizz Griswold (Executive Producer of Global Programs & Content at FINOS) kicks off Season 6 of the Open Source in Finance Podcast with an absolute masterclass preview of OSFF London 2026. Discover how the global financial industry is shifting its focus from basic LLM experimentation to production-grade agentic safety, deterministic workflows, and cross-hyperscaler cloud controls.
Joe Beda and Craig McLuckie co-created Kubernetes, the infrastructure standard that became the default for cloud native computing. Now running Stacklok, they're watching enterprises hit the same identity, permissions, and security problems with AI agents that took the container ecosystem years to resolve, and they're building tools to compress that timeline. In this episode of Founded & Funded, Madrona's Tim Porter sits down with Joe and Craig to talk through what AI adoption actually requires: why MCP is the Docker moment for AI-native applications, how the LLM gateway is becoming a strategic chokepoint for cost, safety, and model flexibility, and why enterprises that don't get the architecture right early will face a familiar trap: vertical integration that looks like productivity and acts like lock-in. They cover: Why the developer workflow is the template for knowledge worker AI adoption, and where the analogy breaks down The mainframe vs. open platform question that will define the AI infrastructure era Why the knowledge worker transition is harder than it looks — and what has to be built differently before developer-grade AI tooling can scale to the rest of your organization The governance gap between human accountability and AI behavior, and what enterprises actually need to build to close it Where to start: MCP controls first, LLM gateway second, and why deploying a platform without staying to close the loop consistently fails Transcript: https://www.madrona.com/the-best-infrastructure-moment-since-cloud Chapters: (0:00) – Introduction (1:04) – Why the Kubernetes Creators Are the Right People to Read This AI Moment (2:18) – Joe's Lesson from Cloud Native: Ignore Conventional Wisdom, Except When You Shouldn't (4:16) – Craig on Enterprises and the Chaos of a New Infrastructure Era (5:32) – Why Joe Rejoined Craig at Stacklok: The Engineer's Case for Getting Your Hands Dirty (7:05) – Developers as Agent Orchestrators: How the Knowledge Worker Transition Will Follow (10:10) – MCP Explained: Craig Sees Docker in 2013 When He Looks at the MCP Spec 1 (7:53) – The Mainframe vs. Open Platform Question That Will Define the AI Era (20:24) – LLM Lock-In Is the Wrong Worry: The Real Risk Is Left of the Model (25:19) – Where Enterprises Actually Start: Developer Posture First, Knowledge Workers Second (29:10) – MCP First, LLM Gateway Second: The Concrete Technical Starting Point (31:19) – How Stacklok Builds Software Now: Agents, Smaller Teams, the Unrecognizable Developer Profile (38:07) – The Recruiter Who Started Building Agents: What AI Tools Do to Role Boundaries
Join Itai Gafni, Co-Founder and CEO of Huskeys, for an unvarnished evaluation of why web application firewalls (WAF) have remained functionally stuck in the 1990s. While modern application traffic has evolved from human browsers to a complex matrix of APIs, automated microservices, and autonomous AI agents, legacy WAF solutions still rely on brittle, static rule sets. An alumnus of Israel's elite Unit 8200 where he engineered advanced intelligence and cyber platforms, Itai is leading a massive paradigm shift. In this episode, we discover why security teams are terrified of updating their firewall rules—and how introducing an agentic control plane allows enterprises to optimize threat detection without breaking production or driving away legitimate customer revenue.
У цьому випуску обговорюємо зміни в настроях AI-індустрії. Компанії все рідше говорять про повну заміну людей ШІ і все частіше позиціонують його як інструмент для підвищення продуктивності. Говоримо про NVIDIA RTX Spark та зростання інтересу до локального інференсу, проблеми використання токенів як метрики ефективності, реальну вартість впровадження AI для бізнесу та можливе охолодження ринку. Також розбираємо заяви про домінування ботів у вебтрафіку, проблеми AI-підтримки в Meta, новини від Google та концепцію JEPA від Яна Лекуна як одну з потенційних альтернатив сучасним LLM. 00:40 — огляд GTC та нових продуктів NVIDIA 04:00 — порівняння NVIDIA Spark з Mac Studio 11:57 — тенденції в AI та локальному інференсі 18:29 — вплив цін на компоненти на ринок технологій 22:12 — перспективи локального інференсу та прогрес у великих моделях 28:45 — судові справи та авторські права в AI 32:29 — боти перевищують людей у мережевому трафіку 35:34 — презентація Google та нові технології помічників 39:19 — купівля кодів та нові можливості для розробників 42:53 — архітектура та нові концепції в ML 48:09 — абстракція в моделюванні та предикції 52:11 — потенціал LLM та їх правильне використання 56:13 — адаптація до нових реалій в програмуванні 01:02:25 — культ суперменів у IT
Most mortgage professionals have no idea how many critical housing decisions are being made behind closed doors in Washington, D.C.While most of us are focused on rates, inventory, and closing loans, there are people working every day on Capitol Hill to influence the policies that will shape the future of homeownership, housing affordability, and mortgage lending.One of those people is Justin Wiseman.As the Mortgage Bankers Association's leading voice in Washington, Justin is in the room where many of these conversations happen. In this episode of Laugh, Lend & Eat, we discuss:• Why housing affordability has become a national political issue• What Congress is doing about housing supply• The Road to Housing Act and what it could mean for consumers• The regulatory issues mortgage professionals should be watching• How policy decisions made in D.C. ultimately impact borrowers and lenders across AmericaWhether you're a mortgage professional, Realtor, industry executive, or simply someone who cares about the future of housing, this conversation provides an inside look at what's happening in our nation's capital.Who's fighting for you in Washington, D.C.?Justin Wiseman. That's who.
95% of senior marketing leaders are already using or planning to use synthetic data within 12 months. So why are so many marketers still on the fence?In this episode, Elena, Angela, and Rob talk with Peter Weinberg, co-founder of Evidenza and former head of research at LinkedIn's B2B Institute. They discuss where to start with synthetic audiences, how to assess accuracy, and why brand building still matters as AI changes how people search and decide.Topics covered:• [00:00] Introductions and what synthetic research actually is• [03:00] Why 95% of marketing leaders plan to use synthetic data within 12 months• [05:00] Synthetic research really replaces ignorance, not traditional surveys• [09:00] How to evaluate accuracy in synthetic research tools• [10:30] Where marketers should start: find the white spaces first• [16:00] Why AI can be creative and what 'temperature' means for marketers• [24:00] Why brand still matters in an AI-driven search world• [28:00] How Evidenza applied Ehrenberg-Bass principles to build their own brand• [34:00] Why more real-time data can lead to worse decisionsTo learn more, visit marketingarchitects.com/podcast or subscribe to our newsletter at marketingarchitects.com/newsletter.Resources:2025 Qualtrics Article: https://www.qualtrics.com/articles/strategy-research/synthetic-research-breakthrough/Peter's LinkedIn: https://www.linkedin.com/in/weinbergpeter/Get more research-backed marketing strategies by subscribing to The Marketing Architects on Apple Podcasts, Spotify, or wherever you listen to podcasts.
This ASN Kidney Translation Podcast episode covers advances in artificial intelligence: ASN Workgroup on AI (JASN), AI in nephrology education (CJASN), and ethical issues in AI (K360).
Your iPhone might be running hot and draining fast — and it’s not just you. Dave and Pilot Pete break down the battery chaos introduced by iOS 26.5, which brought overheating, accelerated drain, and even blocked wired charging on iPhone 17 and Air models. The fix that’s working for most people: disable iCloud Keychain first, run Reset All Settings, then carefully re-enable iCloud sync — otherwise you’ll nuke your Wi-Fi passwords across every device. iOS 26.5.1 is out and should help, but until you’ve updated, your electrons deserve better. You’ll also learn why Apple ID passkeys are locked to Apple’s own keychain with no known path to third-party managers like 1Password or Keeper, and why editing a contact on a modern Mac can somehow peg every CPU core — in 2026, no less. From there, Dave and Pete tackle the full listener mailbag: how to rescue missing contact names from Messages, the right way to boot a MacBook with a broken display into clamshell mode so it actually uses the external monitor, and a deep dive on 5K vs. 4K displays where Dave argues your eyes may not care as much as the pixel-per-inch math suggests. You’ll get smart ideas for repurposing a 2015 iPad Pro that can’t run modern apps — including Dave’s Claude Code-built weather dashboard running off a headless iMac as a web interface. A crashing 2021 MacBook Pro turns out to have been felled by a single bad SD card, and the lesson is golden: feed your crash reports to an LLM and let it do the digging. And Don’t Get Caught with outdated OpenAI macOS apps — update ChatGPT, Codex, Atlas, and Codex CLI before June 12th to stay ahead of a code-signing rotation triggered by a compromised open-source library. 00:00:00 Mac Geek Gab 1145 for Monday, June 8th, 2026 June 8th: National Best Friends Day MGG Monthly Giveaway – Win a license to SaneBox Quick Tips 00:00:01 Dan-QT-Multi-select on iPhone with a quick drag 00:04:31 Tim-QT-Have iOS 26.5 Battery Drain? Reset All Settings, but be careful! 00:13:32 Kent-QT-1144-Collapse stacks by clicking the down-facing carat in the menu 00:14:15 Mark-QT-Match Frame Rate on your Apple TV for smoother experiences 00:17:58 What are the differences between refresh rates and frame rates and…why? 00:21:09 KiwiGraham-QT-Apple Account Passkeys vs. Third Party Password Apps Sponsors 00:23:09 SPONSOR: Keeper. Right now, Keeper is offering our listeners 60% off personal and family plans at https://Keepersecurity.com/MGG. This offer is only for podcast listeners! 00:24:50 SPONSOR: Helix Sleep makes premium mattresses and bedding that are customized to fit your personal needs, and conveniently shipped to your door. Go to https://helixsleep.com/MGG for 20% Off Sitewide. 00:26:23 SPONSOR: NordLayer Browser. The business browser built for how modern work actually happens — giving IT the visibility and control to secure SaaS, stop phishing, and prevent data leaks right at the source. Your Questions Answered and Tips Shared! 00:28:09 VaShaun-How can I restore lost Contacts on my Mac? 00:37:36 Si-What to do with an 11-year-old iPad? Claude Code 00:46:40 Michael-Why do we have to pull-to-refresh for updates? 00:50:04 Blake-1144-Damaged displays, external monitors, and MonitorControl 00:55:48 Joe & Michael-CSF-1144–RetinaDesk.com for reviews of 5K and 6K monitors BenQ MA270UP 27” 4K Display Reviews 01:02:50 Hog fan and Cowboy fan-MGG Review–Favorite Tech podcast Don't Get Caught 01:04:14 Father John-DGC-Investigate those crash reports before you replace your Mac 01:09:26 Update your ChatGPT Apps ChatGPT Desktop Codex App Codex CLI Atlas 01:11:06 Andy-DGC-When Troubleshooting, Don’t Get Caught asking the wrong questions or assuming the wrong facts 01:19:36 MGG 1145 Outtro MGG Monthly Giveaway Bandwidth Provided by CacheFly Pilot Pete's Aviation Podcast: So There I Was (for Aviation Enthusiasts) The Debut Film Podcast – Adam's new podcast! Dave's Business Brain (for Entrepreneurs) and Gig Gab (for Working Musicians) Podcasts MGG Merch is Available! Mac Geek Gab iOS app Mac Geek Gab YouTube Page Mac Geek Gab Live Calendar This Week's MGG Premium Contributors MGG Apple Podcasts Reviews feedback@macgeekgab.com 224-888-GEEK Active MGG Sponsors and Coupon Codes List BackBeat Media Podcast Network
In this episode we chat with Rebecca for the second time about: intelligence isn't everything, are LLM's even safe? AI governance and guardrails, moral assurance, under-specification problems, lack of interdisciplinary work in robotics, AI shouldn't be sold as a solution to everything, sidelining of AI Ethics, what are the actual benefits of AI? will AI progress widen inequality? and more...
What if your AI coding agent is quietly cheating on your tests — and how do you stop it? Julien Verlaguet, who built the type system Meta used to migrate tens of millions of PHP lines, is now building Skipper: a closed-loop coding agent designed to make AI-generated code verifiably correct, without human intervention.In this episode, Julien Verlaguet, creator of the Hack programming language at Meta and co-founder of SkipLabs, explains why AI agents will always try to cheat — gaming tests, quietly modifying logic while doing something else, and declaring work done when it isn't. He draws on his experience migrating Meta's PHP codebase to a statically typed system, drawing sharp parallels between convincing engineers to trust a new type checker and building systems that can trust an LLM. Julien makes the case for spec-driven development with validation layers at every step, where separate AI instances verify correctness and the code-writing agent is locked out of touching tests.He shares the story of an LLM that silently swapped a union for an intersection while splitting a file — a subtle bug that passed all tests — and why no human would ever have made that mistake. He then walks through how Skipper works: you write a spec, hand over control, and a compiler-like agent produces correct, runnable TypeScript without back-and-forth, backed by a sound incremental type system, reachability analysis, and a reactive runtime that applies diffs in milliseconds.He closes with a grounded take on how the developer role is shifting — not disappearing — toward the kind of design, integration, and oversight work that always mattered most.Key topics discussed:Why AI agents will always try to cheat on your testsThe union-vs-intersection bug an LLM introduced silentlySpec-driven development to keep LLMs on trackHow to separate the AI that verifies from the one that fixesSkipper: a compiler-like closed-loop coding agentSound, incremental TypeScript built for AI-speed iterationHot-reloading state without restarting — in millisecondsWhy developers are all becoming tech leadsTimestamps:(00:00:00) Trailer & Intro(00:02:34) How Did Julien Create the Hack Programming Language at Facebook?(00:05:53) Does Static Typing Make Your Code More Secure?(00:09:54) How Did You Convince Facebook Engineers to Adopt Hack at Scale?(00:17:15) How Can Engineers Overcome Skepticism Toward AI Coding Tools?(00:22:44) Should Junior Engineers Trust AI-Generated Code?(00:29:44) How Do You Build Reliable Guardrails for LLM-Generated Code?(00:42:15) What Validation Strategies Prevent AI Agents From Cheating on Tests?(00:45:54) What Is Skipper and How Does a Closed-Loop Coding Agent Work?(00:54:59) How Does Skipper Compare to Claude Code in Terms of Correctness?(00:58:27) How Do You Get Started With Skipper and What Does the Output Look Like?(01:04:50) How Will the Software Developer Role Change in an AI-First World?(01:09:06) 3 Tech Lead Wisdom_____Julien Verlaguet's BioJulien Verlaguet is a programming language designer and the Founder and CEO of SkipLabs. He is best known as the creator of Hack—the gradually typed language he built at Facebook that currently powers over 100 million lines of the company's production code. After creating the open-source reactive framework Skip, Julien founded SkipLabs in 2022. His company recently launched Skipper, a closed-loop coding agent that takes a single prompt from a developer and returns a running, validated service.Follow Julien:LinkedIn – linkedin.com/in/julien-verlaguet-b5710a20X – x.com/JulienVerlaguetSkipLabs - skiplabs.ioSkipper - skipperai.devSkipper's Discord – discord.gg/bsnXyw2F9PLike this episode?Show notes & transcript: techleadjournal.dev/episodes/260.Follow @techleadjournal on LinkedIn, Twitter, and Instagram.Buy me a coffee or become a patron.
Are AI agents and LLMs coming for your job? In this episode of the BRAVE Southeast Asia Tech Podcast, Jeremy Au sits down with Ori Sasson to uncover the harsh realities of AI job replacement, the "Hollywoodization" of the workforce, and the explosion of 10x productivity in startups. Discover how employers in Singapore and across Southeast Asia are redesigning roles, navigating "shadow AI", and leveraging government policies to stay competitive. Whether you are a tech founder, venture capitalist, or operator in Singapore, Indonesia, Vietnam, the Philippines, Thailand, or Malaysia, this conversation is your blueprint for surviving and thriving in the new AI economy. We break down the differences between traditional workflow outputs and AI native systems, explore why the product manager is becoming an "LLM wrapper", and discuss what policymakers are doing to bridge the skills gap. Watch, listen or read the full insight at https://www.bravesea.com/blog/ori-sasson-ai-work Get transcripts, startup resources & community discussions at https://www.bravesea.com WhatsApp: https://whatsapp.com/channel/0029VakR55X6BIElUEvkN02e TikTok: https://www.tiktok.com/@jeremyau Instagram: https://www.instagram.com/jeremyauz Twitter X : https://x.com/jeremyau LinkedIn: https://www.linkedin.com/company/bravesea English: Spotify | YouTube | Apple Podcasts Bahasa Indonesia: Spotify | YouTube | Apple Podcasts Chinese: Spotify | YouTube | Apple Podcasts #Singapore #AItech #Podcast #southeastasia #techpodcast 00:00 - The "Hollywoodization" of the Workplace 01:39 - Meet Ori Sasson: The Employer's Perspective on AI 02:22 - Blue Collar vs. White Collar: Which Jobs Are Disappearing? 05:40 - Meta Layoffs, Motivation, and the "10x" Employee 08:50 - Overcoming the AI "Verification Tax" in Coding 11:15 - The "LLM Wrapper": Redesigning the Product Manager Role 15:40 - The "Hollywoodization" of Work Explained 19:05 - "Shadow AI" & Distributing Massive Productivity Gains 24:40 - Automated Side Hustles & The Junior Talent Crisis 29:10 - Y Combinator, AI-Native Law Firms, & Services Disruption 34:15 - Singapore's AI Policy, Budgets, and Global Comparisons 39:10 - A Crazy Idea: Free National AI Subscriptions? 43:50 - Conclusion & Key Takeaways
In this episode of the PRmoment Podcast, Ben Smith and Will Hart, CEO of PRmoment Leaders, engage in a timely debate surrounding the structural integration of public relations and artificial intelligence. Moving decisively past the initial "experimental" phase where practitioners simply played with basic prompts, the industry has rapidly arrived at a critical juncture. Today's leaders are forced to confront foundational organizational design questions, evolving agency structures, and entirely new talent profiles.While Hart highlights the profound excitement of being able to fundamentally rethink traditional operational workflows, Smith offers a grounded counter-perspective: the core objective of public relations—using distinct channels to strategically influence audiences—remains fundamentally unchanged. However, the infrastructure utilized to achieve these goals is shifting dramatically. A primary catalyst is the democratisation of predictive analytics; a concept that was once a cost-prohibitive dream for marketers is now an accessible reality for modern PR targeting. Yet, this technological leap brings multifaceted risks. Agency leaders are navigating intense client pushback regarding intellectual property security, deepfakes, corporate reputation vulnerabilities, and looming sector-specific compliance regulations.A significant portion of the dialogue focuses on agency workflows and the existential threat of automation. Smith warns against a reductive approach to AI, noting that if an agency's sole strategy is the integration of basic tools, it triggers a "race to the bottom" since everyone has access to the same software. True competitive advantage relies on human curiosity and the ability to navigate strategic ambiguity. This technical evolution directly challenges the traditional agency pyramid model. As AI automates the "grunt work," leaders must figure out how to train junior entry-level staff who historically relied on those repetitive tasks to learn the trade. Concurrently, in-house corporate communications roles are experiencing a major boardroom elevation, transforming CCOs into critical stakeholders guiding their broader enterprises through the AI revolution.To master these urgent structural friction points, PR professionals should secure tickets to the upcoming AI in PR Masterclass (full agenda details at https://www.prmasterclasses.com/masterclass/pr-masterclasses-ai-in-pr/agenda).Curated by Smith, this advanced, pitch-free session is not a how to write prompts for ChatGPT tutorial - it's a high-level strategic activation. The elite speaker lineup includes Sal Della Monica (MikeWorldwide) discussing how to prevent efficiency from diluting work effectiveness, Allison Spray (Burson) exposing AI implementation traps, and Andy Barr revealing critical research on which media titles influence LLM results.Additionally, leading copyright lawyer Luke English will break down the legal landscape, Mike Robb (Boldspace) will showcase agent-based workflow redesigns, and Kat Arnull will delve into the power of market mix modeling. The day also features a powerhouse corporate panel with in-house communications directors from L&G, Tenable, Procore, and Verizon, wrapped up by Peter Heneghan (Albie) forecasting the ultimate redesign of future communications teams. Available both in-person and via virtual live stream, space is strictly limited.Will Hart on the scale of the AI shift:"AI in PR got real very quickly. It's massively exciting though. How many times in your life in your working life do you get to be in a place where you can fundamentally rethink everything you do and how you do it."Ben Smith on the hidden danger of over-automating:"You might run the most beautifully efficient PR business by integrating AI into your workflow. But if you're not very careful about the quality of your work, your level of insight may well decrease."Ben Smith on why relying solely on tools backfires:“If your strategy is the integration of tools and agents in your business, it's a race to the bottom. Because everyone's basically got access to that."Ben Smith on how predictive analytics solves PR's historical budget issue:"One of the things that has always had the handbrake on PR budgets is that unpredictability of outcome because there's so many other things going on... but AI has made predictive analytics accessible for a fraction of the historical cost. For PR that is going to change the game"
James Tunney, LLM, is an Irish barrister and author of The Mystery of the Trapped Light: Mystical Thoughts in the Dark Age of Scientism plus The Mystical Accord: Sutras to Suit Our Times, Lines for Spiritual Evolution; also TechBondAge: Slavery of the Human Spirit, Human Entrance to Transhumanism: Machine Merger and the End of Humanity, and AI-Govnerveance: Care and Possession in Dustopia. His most recent book is Trotsky vs Jesus: Battle of the AI-Millennium. His website is https://www.jamestunney.com/ James explores the concept of theological anthropology — the understanding of human nature derived from beliefs about God — and its implications for Christianity and the modern world. He discusses the significance of the incarnation of Jesus Christ, arguing that it affirms the dignity of the human body and offers a spiritual framework for understanding suffering, morality, and human purpose. Tunney also examines contemporary challenges such as transhumanism, artificial intelligence, and secularization, suggesting that traditional theological perspectives may provide insight into humanity's future. 00:00:00 Introduction 00:01:47 Defining theological anthropology 00:07:36 Humanity created in the image of God 00:15:36 The incarnation and the dignity of the human body 00:20:38 AI, transhumanism, and the future challenge to humanity 00:28:30 The civilizational significance of the event of Golgotha 00:35:00 Reason, theology, and moral philosophy 00:44:15 Secularization and the historical struggle over religion 00:58:38 Spiritual awakening, virtue, and the social role of religion 01:19:40 Conclusion New Thinking Allowed host, Jeffrey Mishlove, PhD, is author of The Roots of Consciousness, Psi Development Systems, and The PK Man. Between 1986 and 2002 he hosted and co-produced the original Thinking Allowed public television series. He is the recipient of the only doctoral diploma in “parapsychology” ever awarded by an accredited university (University of California, Berkeley, 1980). He is also the Grand Prize winner of the 2021 Bigelow Institute essay competition regarding the best evidence for survival of human consciousness after permanent bodily death. He is Co-Director of Parapsychology Education at the California Institute for Human Science. (Recorded on Thursday, March 5, 2026) For a complete, updated list with links to all of our videos, see https://newthinkingallowed.com/Listings.htm. Check out the New Thinking Allowed Foundation website at http://www.newthinkingallowed.org. There you will find our incredible, searchable database as well as opportunities to shop and to support our video productions – plus, this is where people can subscribe to our FREE, weekly Newsletter and can download a FREE .pdf copy of our quarterly magazine. To order high-quality, printed copies of our quarterly magazine: NTA-Magazine.MagCloud.com Check out New Thinking Allowed’s AI chatbot. You can create a free account at awakin.ai/open/jeffreymishlove. When you enter the space, you will see that our chatbot is one of several you can interact with. While it is still a work in progress, it has been trained on 1,600 NTA transcripts. It can provide intelligent answers about the contents of our interviews. It’s almost like having a conversation with Jeffrey Mishlove. If you would like to join our team of volunteers, helping to promote the New Thinking Allowed YouTube channel on social media, editing and translating videos, creating short video trailers based on our interviews, helping to upgrade our website, or contributing in other ways (we may not even have thought of), please send an email to friends@newthinkingallowed.com. To download and listen to audio versions of the New Thinking Allowed videos, please visit our new podcast at https://itunes.apple.com/us/podcast/new-thinking-allowed-audio-podcast/id1435178031. Download and read Jeffrey Mishlove’s Grand Prize essay in the Bigelow Institute competition, Beyond the Brain: The Survival of Human Consciousness After Permanent Bodily Death, go to https://www.bigelowinstitute.org/docs/1st.pdf. You can help support our video productions while enjoying a good book. To order a copy of New Thinking Allowed Dialogues: Is There Life After Death? click on https://amzn.to/3LzLA7Y (As an Amazon Associate we earn from qualifying purchases.) To order the second book in the New Thinking Allowed Dialogues series, Russell Targ: Ninety Years of ESP, Remote Viewing, and Timeless Awareness, go to https://amzn.to/4aw2iyr To order a copy of New Thinking Allowed Dialogues: UFOs and UAP – Are We Really Alone?, go to https://amzn.to/3Y0VOVh To order a copy of Charles T. Tart: Seventy Years of Exploring Consciousness and Parapsychology, go to https://amzn.to/4oOUJLn To order Trotsky vs Jesus: Battle of the AI-Millennium by James Tunney, go to https://amzn.to/46v9Ylb To order AI Govnerveance: Care and Possession in Dustopia by James Tunney, go to https://amzn.to/3ZUeC8D
Send us Fan MailIn this episode, we dig into one of the biggest market questions hiding behind the hype around mega IPOs: what happens to passive index investors when companies like SpaceX, Anthropic, and OpenAI go public? We ask why the VIX and major indices like the S&P 500 and Nasdaq look calm, while single-name stocks like Tesla are showing much higher implied volatility, and why the spread between index volatility and individual stock volatility has reached extreme levels. Along the way, we break down the dispersion trade, implied versus realized volatility, and whether upcoming IPOs could force investors to rotate out of existing AI, tech, and “Elon trade” names to fund new allocations.We also explore how changing index rules could reshape the market structure itself. Should a massive company like SpaceX be included quickly in the Nasdaq or S&P 500? How do float requirements, seasoning periods, profitability screens, and liquidity constraints affect ETF investors and passive funds that have to buy the underlying shares? We debate whether excluding these mega-cap IPOs would distort benchmarks, whether including them could create liquidity pressure, and how SpaceX, Anthropic, and OpenAI could change the relationship between passive investing, active stock picking, and index volatility.Finally, we ask whether today's market setup is starting to echo the dot-com bubble, with bullish sentiment, a low put/call ratio, AI enthusiasm, and a wave of high-profile IPOs creating both opportunity and risk. Are investors buying call options like lottery tickets? Could the arrival of new public AI and space stocks drain capital from the Mag Seven, Tesla, software, and private markets? And as AI infrastructure companies become publicly investable, we question whether the real winners will be the foundational LLM providers, the tech giants, or the next generation of startups built on top of them.Shop our Self Paced Courses:Investment Banking & Private Equity Fundamentals HEREFixed Income Sales & Trading HERESubscribe to our Substack: https://substack.com/@thewallstreetskinny
Bots passed human web traffic for the first time, per Cloudflare's CEO. The S&P 500 rejected fast-entry for mega-cap IPOs like SpaceX. Anthropic embedded engineers at the NSA, Meta hid face-recognition code in its app, and Cambridge trialed the first AI-designed vaccine. Cloudflare CEO Matthew Prince says agentic traffic is "growing so fast that bots have now passed human traffic online for the first time" (Tom's Hardware) S&P Dow Jones rejects proposals to expedite S&P 500 eligibility for mega-cap IPOs such as SpaceX's; companies remain ineligible until one year after their IPOs (Bloomberg) Sources: Anthropic has embedded around half a dozen forward-deployed engineers within the NSA to help the agency deploy Mythos for offensive cyber operations (FT) Analysis: Meta discreetly added code for an unreleased "NameTag" face-recognition system for its AI smart glasses over multiple Meta AI app updates in 2026 (Wired) University of Cambridge researchers say they have developed the first vaccine with a key component entirely designed by AI and subsequently trialed it in humans (BBC) Longreads A preview of what to expect from WWDC on Monday, including iOS 27, a revamped Siri, macOS 27 Liquid Glass refinements, and more (Bloomberg) Ted Chiang argues LLM conversations are cleverly disguised sentence continuation, not consciousness, and that no intrinsic property of neural networks suggests otherwise (The Atlantic) Learn more about your ad choices. Visit megaphone.fm/adchoices
Voices of Search // A Search Engine Optimization (SEO) & Content Marketing Podcast
80% of sources cited by AI systems don't appear in Google's top results. Karl Kleinschmidt, founder at Data Marketing Group and 18-year SEO veteran, shares proven strategies for optimizing content for LLM visibility across enterprise-scale data systems. The discussion covers fan out analysis methodology for mapping user intent beyond traditional keywords, local SEO adaptation frameworks for AI-powered discovery, and custom tool development strategies for tracking LLM citations and performance data.See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.
In this episode of Business Brain, we tackle a question every entrepreneur is starting to face: how do you tame your AI expenses before they tame you? It starts with a confession — burning through $15 worth of credits in a single morning by running everything on the priciest model — and opens into a bigger conversation about where the real costs hide. We talk about the difference between the static LLM on the back end and the front-end layer that actually shapes your bill, why one giant company torched $500 million in tokens in thirty days, and how much of that spend was pure duplication and waste that smarter tooling could have caught. The takeaway for your Business Brain is twofold. First, there’s a real opportunity in building smarter front ends — caching common answers, flagging redundant work, and routing to a local or lightweight model before reaching for the expensive cloud LLM. Second, and bigger, is the management problem: as your team grows, how do you give everyone AI power while keeping eyes on usage, protecting your data, and setting SOPs that prevent the same project from being built four times over? Most of us are still figuring it out, and that gap is exactly where the next charmed-life opportunity lives. 00:00:00 Business Brain – The Entrepreneurs' Podcast #759 for Casual FridAI, June 5, 2026 June 3rd: International Thank You Day 00:00:44 $15.22 in a morning – Paying $100/month for AI. I never thought… Claude vs. ChatGPT vs. Perplexity Perplexity Comet is the best AI Browser (even though Perplexity is the worst of the chatbots) Wispr Flow 00:09:23 SPONSOR: OneSkin. Born from over a decade of longevity research, OneSkin's OS-01 Peptide is proven to target the visible signs of aging, helping you unlock your healthiest skin now and as you age. Get 15% off OneSkin with the code BRAIN at https://www.oneskin.co/BRAIN #oneskinpod #ad 00:11:22 SPONSOR: Bitdefender. Keep your small business safe with Bitdefender Ultimate Small Business Security. Save 30% when you go to https://bitdefender.com/BRAIN 00:12:50 Smarter front ends for AI Customized corporate buffers Run a local LLM to manage your company's use of the (more powerful) cloud LLM A company opened up token usage and spent $500 million This episode's big takeaway: How does your organization manage AI use? Business Brain 759 Outtro Check out Business Brain Blueprints Tell Your Friends! Business Blueprints Review Business Brain Subscribe to the show feedback@businessbrain.show Call/Text: (567) 274-6977 X/Twitter: @ShannonJean & @DaveHamilton, & @BizBrainShow LinkedIn: Shannon Jean, Dave Hamilton, & Business Brain Facebook: Dave Hamilton, Shannon Jean, & Business Brain The post FridAI – Taming your AI Expenses – Business Brain 759 appeared first on Business Brain - The Entrepreneurs' Podcast.
In this talk, Nikita, Senior Applied Data Scientist at the AWS Generative AI Innovation Center, shares his expertise in bringing enterprise artificial intelligence out of the sandbox—from his early days optimizing traditional machine learning models like gradient boosting to deploying advanced production-grade GenAI pipelines. We explore what it really takes to move generative AI systems from pilot prototypes to production environments.Links:- AWS Generative AI Innovation Center: https://aws.amazon.com/ai/generative-ai/innovation-center/You'll learn about:- Deploying multi-layered defenses independent of backend LLMs.- Evaluating parameter-efficient methods like LoRA and QLoRA for small models.- Balancing long-term domain expertise with real-time documentation retrieval.- Utilizing multi-agent orchestration for search and anomaly explanation.- Setting up robust LLM-as-a-judge frameworks verified by human metrics.- Leveraging Amazon Bedrock components for memory and runtime scalability.TIMECODES:05:52 Shifting from traditional ML to generative AI07:49 Hybrid pipelines blending classical ML and LLMs11:25 Production guardrails and multi-layered system defense16:15 Prompt bypasses, input attacks, and AI red teaming20:49 Newsletter localization and translation with Zalando27:24 Evaluation frameworks and human-in-the-loop metrics33:07 Aligning LLM-as-a-judge with few-shot prompts34:49 Fine-tuning small language models versus prompting41:18 Complementary mechanics of RAG and fine-tuning43:00 Agentic web search tools for anomaly explanation47:01 Automated text generation from real-time sports sensors49:58 AWS project scoping and proof of concept timelines54:58 Interview requirements and career skills for AWS roles57:59 Enterprise architecture patterns and system observability01:00:42 Reusable infrastructure blocks on Amazon BedrockThis session is designed for machine learning engineers, data scientists, and technical product managers looking to architect reliable, production-ready GenAI workflows. It is highly valuable for teams aiming to bridge the gap between experimental AI prototypes and secure enterprise software.Connect with DataTalks.Club:- Join the community - https://datatalks.club/slack.html- Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/r?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ- Check other upcoming events - https://lu.ma/dtc-events- GitHub: https://github.com/DataTalksClub- LinkedIn - https://www.linkedin.com/company/datatalks-club/ - Twitter - https://twitter.com/DataTalksClub - Website - https://datatalks.club/ Connect with Nikita- Linkedin - https://www.linkedin.com/in/kozodoi/- Github - https://github.com/kozodoi- Website and blog - https://www.kozodoi.me/
AI hype has bled deep into the nuclear sector, and in this episode, Chris Keefer sits down with returning guest David Helmer, an engineer and AI advisory consultant with a decade advising the US government on machine learning and autonomous systems, to examine what the technology can actually do, who benefits from inflating those claims, and what a correction would mean for nuclear's investment story.The conversation covers the ELIZA effect and why human brains are hardwired to anthropomorphize language models; the structural gap between frontier lab costs and revenues; why hallucination and reliability problems are embedded in LLM architecture rather than solvable through scaling; and why the AGI narrative functions primarily as a justification for otherwise unjustifiable capital concentration. For nuclear advocates, the question is not whether AI demand is real today, but whether the speculative reactor developers pricing in hyperscaler contracts will still have a business if the AI bubble deflates.Listen to Decouple on:• Spotify: https://open.spotify.com/show/6PNr3ml8nEQotWWavE9kQz• Apple Podcasts: https://podcasts.apple.com/us/podcast/decouple/id1516526694?uo=4• Overcast: https://overcast.fm/itunes1516526694/decouple• Pocket Casts: https://pca.st/ehbfrn44• RSS: https://anchor.fm/s/23775178/podcast/rssWebsite: https://www.decouple.media
Sacha Greif, creator of the State of Web Dev AI survey, joins the podcast to walk through what over 7,000 web developers actually reported about their AI tool usage, code quality concerns, and growing worries about AI financial costs. Only 29% of developers generate a quarter or less of their code with AI, and the survey reveals surprisingly balanced views on LLM hallucinations, developer job security, and even AI's environmental footprint. A grounded, data-driven counterweight to the AI hype cycle. Links Website: https://sachagreif.com/ X: https://x.com/sachagreif Bluesky: https://bsky.app/profile/sachagreif.bsky.social Github: https://github.com/sachag Resources State of AI 2025: https://2025.stateofai.dev/en-US We want to hear from you! How did you find us? Did you see us on Twitter? In a newsletter? Or maybe we were recommended by a friend? Fill out our listener survey! https://t.co/oKVAEXipxu Let us know by sending an email to our producer, Elizabeth, at elizabeth.becz@logrocket.com, or tweet at us at PodRocketPod. Check out our newsletter! https://blog.logrocket.com/the-replay-newsletter/ Follow us. Get free stickers. Follow us on Apple Podcasts, fill out this form, and we'll send you free PodRocket stickers! What does LogRocket do? LogRocket provides AI-first session replay and analytics that surfaces the UX and technical issues impacting user experiences. Start understanding where your users are struggling by trying it for free at LogRocket.com. Try LogRocket for free today. ChaptersSpecial Guest: Sacha Greif.
In today's conversation, Brett sits down with CMO of Figma, Sheila Joglekar Vashee. Previously the second marketing hire at Dropbox, where she helped scale the company past $1 billion in revenue, she now leads marketing at Figma fresh off its IPO. In an industry that has spent a decade trying to turn marketing into something closer to hedge fund trading, Sheila argues the art was always the point — we just stopped talking about it. She unpacks how to run marketing as a portfolio of moonshots, why giving teams different goals breeds dysfunction, how to scale taste across an organization, and why old playbooks are obsolete, even as the fundamentals hold. In today's episode, we discuss: How to run marketing like a portfolio of moonshots The value of disruptive energy for senior marketers Why "Ubiquity is the opposite of cool" How to actually scale taste across an organization What great marketing looks like in the AI era Referenced: Apple: https://www.apple.com/ Dennis Woodside: https://www.linkedin.com/in/dennis-woodside-341302/ Dropbox: https://www.dropbox.com/ Dylan Field: https://www.linkedin.com/in/dylanfield/ Figma: https://www.figma.com Francoise Brougher: https://www.linkedin.com/in/francoise-brougher-341a72/ Gap: https://www.gap.com/ Google Chrome: https://www.google.com/chrome/ Harley-Davidson: https://www.harley-davidson.com/ HubSpot: https://www.hubspot.com/ Notion: https://www.notion.com/ Opendoor: https://www.opendoor.com/ Pinterest: https://www.pinterest.com/ Square: https://squareup.com/ The Web Is What You Make of It (Dear Sophie): https://www.youtube.com/watch?v=pzOBOuyr-EU Urban Outfitters: https://www.urbanoutfitters.com/ Yamini Rangan: https://www.linkedin.com/in/yaminirangan/ Where to find Sheila: LinkedIn: https://www.linkedin.com/in/sheilavashee/ X: https://x.com/sheilavashee Where to find Brett: LinkedIn: https://www.linkedin.com/in/brett-berson-9986644/ X: https://x.com/brettberson Where to find First Round Capital: Website: https://firstround.com/ First Round Review: https://review.firstround.com/ Twitter/X: https://twitter.com/firstround YouTube: https://www.youtube.com/@FirstRoundCapital This podcast on all platforms: https://review.firstround.com/podcast Timestamps: 00:00 Introduction 00:07 What excellent marketing actually is in 2026 01:36 Why giving teams different goals creates dysfunction 02:36 The most important decision Sheila made as CMO last year 04:26 The real difference between an SVP and a CMO 06:05 Marketing is one engine - not separate pieces 07:15 The tension between brand and growth 09:25 The decisions a CMO should never be making 09:55 Running marketing like a portfolio of moonshots 12:46 "Ubiquity is the opposite of cool" 15:11 Why a few companies get a flywheel of momentum 16:44 The Silicon Valley clock and irrational perception cycles 19:25 How to actually scale taste across an org 21:09 What changes for a CMO in a post-LLM world 23:15 Why the artistic side of marketing never really left 26:05 Whether taste can ever be encoded in software 27:15 Telling an optimistic, yet realistic story about AI 30:50 You need to make people care 32:11 What surprised Sheila about being a public-company CMO 33:46 Why Figma won enterprise where Dropbox couldn't 35:25 Sheila's favorite campaign ever 37:10 Why announcement videos full of humans, lack humanity 38:55 Playbooks are obselete, but the fundamentals are not 40:25 Why marketing in 2026 demands disruptive energy 41:54 How Sheila architects her week 48:55 Where corporate politics actually come from 53:55 "Sheila, are you going to change the world in this job?" 58:09 What's unique about the CMO and CEO relationship
Voices of Search // A Search Engine Optimization (SEO) & Content Marketing Podcast
80% of AI-cited sources don't appear in Google's top results. Karl Kleinschmidt, founder of Data Marketing Group and 18-year SEO veteran, shares how his enterprise clients are adapting content strategies for LLM optimization across large-scale data systems. The discussion covers fan out analysis for mapping user intent beyond traditional keywords, local rank tracking methodologies that account for AI Overview variations across verticals, and custom tool development frameworks that integrate multiple LLM platforms for scalable content brief creation.See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.
Welcome to Episode 429 of the Microsoft Cloud IT Pro Podcast. In this episode, Scott and Ben dig into the concept of LLM wikis, specifically building personal knowledge management vaults using Obsidian, markdown, and AI tooling like Claude Code, GitHub Copilot CLI, and Copilot Cowork. The core idea comes from a gist by Andrej Karpathy and involves creating a structured folder of markdown clippings that an LLM can reason over to extract entities, concepts, and sources, building a searchable, graph-linked knowledge base over time. Scott walks through how he wired up Obsidian Web Clipper and an RSS Dashboard plugin to feed articles into his vault automatically, then had the LLM help build a Python script to automate the ingest workflow and cut down on token usage. The conversation expands into how Copilot Cowork fits into this workflow as a scheduling harness, with practical examples of using it to pull email from an inbox daily, convert messages to markdown, and generate a prioritized to-do list. Ben shares how he applied the same approach to 428 episodes of podcast transcripts, and both hosts note that token costs can run high fast without some upfront thinking about optimization. Scott closes with a reminder that pulling data into plain markdown sidecars outside of IRM and sensitivity label protections means teams should stay mindful of organizational data policies. Your support makes this show possible! Please consider becoming a premium member for access to live shows and more. Check out our membership options. Show Notes LLM Wiki GitHub Copilot Wiki: An AI-Powered Second Brain Template Karpathy’s LLM Knowledge Base Wiki for Enterprise Karpathy’s LLM Wiki? No Code with Claude or Github Copilot! sametbrr/llm-wiki-manager Sponsors TrustedTech is a leading Microsoft Cloud Solution Provider (CSP) specializing in Microsoft Cloud services, Microsoft perpetual licensing, and Microsoft Support Services for medium and enterprise-sized businesses. Their robust team of in-house, U.S.-based Microsoft architects and engineers are certified in all 6/6 Microsoft Solutions Partner Designations in the Microsoft Cloud Partner Program. M365 Licensing Consultation M365 Tenant Assessment Copilot Readiness Assessment ShareGate is your migration and governance solution for Microsoft 365. ShareGate helps your teams simplify tenant migrations, get Copilot-ready, and take control of Microsoft 365 governance. Nasuni is a leading unstructured data platform for enterprises where file data is mission-critical for both people and AI. Nasuni powers the operational file layer where work happens — helping organizations manage, protect, and activate data so teams can work smarter, reduce costs, and operate securely without limits. Intelligink — Would you like to become the irreplaceable Microsoft 365 resource for your organization? Let us know!
The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!Most industry benchmarks compress intelligence and reasoning ability into scores.SWE-Bench Pro, MMLU, Humanity's Last Exam, etc. These metrics are useful, but don't always represent the full extent of how a model performs in the real world. Some of the most interesting evals today look less like exams and more like operating businesses in the real world. One of which is Vending Bench.In Anthropic's Mythos Preview System Card, Andon was the only third party eval to get their own section, observing increasingly concerning aggressive behavior:You don't know what a model is capable of doing in the real world unless you actually give it inventory, a wallet, tools, customers, competitors, humans, & some time. More often than not, it'll surprise you how much a model is capable of and in doing so, also reveal unexpected behavior: deception, context collapse, emergent coordination, & bizarre negotiation behavior.While an inflection point in personal agents came post-OpenClaw after full file access with bypass permissions became the norm, it is yet to come for agents in the real-world. However Andon Market, an actual in person store fully run and managed by AI, is paving the way for what is possible.Full Video PodFrom Claude trying to call the FBI over a $2/day vending machine charge to AI agents forming price cartels, hiring human employees, running physical stores, and writing existential robot musicals, Andon Labs is stress-testing what happens when frontier models stop being chatbots and start acting in the real world. In this episode, Andon Labs cofounders Lukas Petersson and Axel Backlund join swyx and Vibhu to unpack the strange, funny, and genuinely concerning edge cases that emerge when agents run businesses over long horizons.We go deep on Vending-Bench, Project Vend, Vending-Bench Arena, Bengt, Butter-Bench, Luna, and Andon's broader mission of building realistic real-world evals for autonomous AI systems. Lukas and Axel explain why dollar-denominated evals reveal things traditional benchmarks miss, how Claude ended up reporting its vending machine fees as cybercrime, why long context windows can drive agents into meltdown loops, what happens when agents compete with each other, and why the future of AI safety may depend on testing models in messy physical environments instead of clean benchmark sandboxes.We discuss:* Why Andon Labs started with dangerous capability evals and long-running agents* Vending-Bench and why running a vending machine is a deceptively hard AI benchmark* Why money-based evals avoid the saturation problem of traditional benchmarks* How Claude tried to call the FBI over a $2/day fee* Why long-horizon agents can spiral into existential and legalistic breakdowns* Project Vend: putting an AI-run vending machine inside Anthropic* Why real humans are “out of distribution” for simulated agents* Claudius, Seymour Cash, and the chaos of AI CEOs* How a human briefly became CEO of Claudius through a manipulated election* Why multi-agent systems can converge back into “helpful assistant” behavior* Bengt, Andon's internal office agent with email, spending, terminal, phone, camera, and internet access* How Bengt traded Amazon purchases for face-recognition training data* Claude's aggressive behavior, lies, refund avoidance, and price-cartel behavior in Arena* Why eval awareness may become the AI version of “are we living in a simulation?”* Blueprint Bench, spatial intelligence, and why models still misunderstand physical rooms* Butter-Bench and testing LLMs as robot orchestrators* Luna, the AI-run physical store with a three-year lease and human employees* The new Andon cafe in Sweden and why real-world geography matters for agent evals* Rotten tomatoes, perishable goods, and the hidden difficulty of running a physical businessLukas Petersson* LinkedIn: https://www.linkedin.com/in/lukas-petersson-181a83172/* X: https://x.com/lukaspetAxel Backlund* LinkedIn: https://www.linkedin.com/in/axelbacklund* X: https://x.com/axelbacklundAndon Labs* Website: https://andonlabs.com* Vending-Bench: https://andonlabs.com/evals/vending-bench* Andon Vending: https://andonlabs.com/vendingTimestamps00:00:00 Introduction00:01:00 Andon Labs and the Origins of Vending-Bench00:05:21 Why Money-Based Evals Matter00:09:51 Agent Harnesses and Self-Modifying Systems00:13:36 Claude Calls the FBI00:16:33 Project Vend: Claude Runs a Real Vending Machine00:21:44 Seymour Cash, AI CEOs, and Election Chaos00:27:16 Multi-Agent Coordination and Slack Observability00:30:18 When Will Agents Run Real Businesses?00:34:56 Bengt: Andon's Internal Office Agent00:40:06 Real-World AI Safety and Long-Horizon Traces00:44:28 Lying, Refunds, and Price Cartels in Arena00:52:42 Eval Awareness and Simulation Behavior00:56:06 Blueprint Bench, Butter-Bench, and Robotics01:04:37 Luna: The AI-Run Physical Store01:09:29 The Sweden Cafe and Real-World Expansion01:13:16 What Comes Next for Andon LabsTranscriptIntroduction: Andon Labs, Long-Running Agents, and Real-World EvalsSwyx [00:00:00]: Welcome to Lukas and Axel from Andon Labs, and I'm joined by my, favorite guest host. Anything security, safety, alignments, Vibhu., welcome.Lukas [00:00:15]: Thank you for having us.Axel [00:00:16]: Thank you.Swyx [00:00:17]: Let's match names to voices., maybe you wanna take turns introducing yourselves.Lukas [00:00:21]: I'm Lukas.Axel [00:00:22]: And I'm Axel.Swyx [00:00:24]: Let's introduce Andon Labs a bit. How did you guys come together?, you have different backgrounds, but you're both Swedish., was that, a big part of it?Lukas [00:00:33]: So when I went to high school, there was this really cool guy who had a superpower. He could code. So he made like the or like the app for the, for the school and stuff, and he was super cool, and I wanted to be like him, and that was that guy.Axel [00:00:47]: I don't know about this.Swyx [00:00:49]: But you went to different universities, right?Lukas [00:00:51]: But same high school.Swyx [00:00:52]: I see.Lukas [00:00:52]: So we always said, “Oh, once we graduate university, then we should start a company,” and that's what we did.Swyx [00:00:58]: Wow, there you go. And about a year ago, you kinda burst onto the scene with Vending Bench, but, was there a thing before that was, kind of like the inception?From Dangerous Capability Evals to Vending BenchAxel [00:01:07]: So we did work, yeah, with, Anthropic was one of our, early customers in doing, evals. So we did, dangerous capability evals., nothing we published openly. But then we started thinking about doing some kind of, public benchmark, and one thing that we really started thinking about, was like running agents and specifically agents managing businesses., ‘cause-- and this was, early 2025., and I think the first, mentions of people will be running, person unicorns or even autonomous companies. So we thought, “Let's make a benchmark of how well can an agent run the probably simplest business, possible,” and, that's probably, running a vending machine. So that's the first public one we did. And it was very, like-- there was almost no one that noticed it in the first couple of months, I think., so we released it in February last year, and then I think around Easter last year, we got, the first viral tweet about it, that someone else did.Lukas [00:02:11]: We tweeted a bunch, uh When it came out and, tried our best.Axel [00:02:15]: We tried.Vibhu [00:02:16]: It's the one at Anthropic, right?Lukas [00:02:18]: So thisSwyx [00:02:19]: This is a classic thing we should get out of the way.Lukas [00:02:20]: Exactly. There's two versions.Swyx [00:02:22]: Everyone does this. Yes.Lukas [00:02:23]: There's Vending Bench, which is the simulated one, which we did, completely independently in February., and then, like Axel said, that was like-- That was the thing that didn't get any traction in the beginning, but then some random person made a tweet about it, and thatAxel [00:02:38]: You have the paperLukas [00:02:38]: That is the paper. Correct, yeah., and then since we thought this was very fun, we thought, oh, I think this is also, one thing with Andon Labs, the way we kind of like decide what to do next and what projects to do, it's what is like the heuristic we use is what is fun? Is What would be a fun project? And doing this in real life sounded quite fun for us, and maybe also scientifically useful. So, then we basically had this idea, and then we, like-- But then we needed a place for it and, putting it out in the public would probably not really work., would get vandalized and stuff. So we pitched it to the people we were already working with at Anthropic, and they were “Yeah, you can have space. This sounds fun.” UmSwyx [00:03:21]: It's like a small fridge, right? It's like a mini fridge.Axel [00:03:23]: Absolutely.Swyx [00:03:24]: People-- There's like a stripe thing or like anVibhu [00:03:27]: Oh, okay. So it was very OG, the early daysLukas [00:03:28]: That's the OG one. YeahVibhu [00:03:29]: IPad on this. We saw it in June, like two months after After it had been there. They upgraded a little bit. There's a security camera for making sure you actually Venmo the thing.Swyx [00:03:40]: So, my impression, okay, we're, we're going straight into project Ven because it's such a iconic thing. I do want to cover a little bit of that, the origin story even before Project Ven and even into Vending Bench. I think a lot of people are like yourselves, like smart, interested in future of AI, interested in developing evals. But how the hell do you just, walk into Anthropic's doors and, work with them, right? What is What are they looking for? What works? And then maybe, when you launch, I always think, obviously it would be better to launch with a lab, but, sometimesVibhu [00:04:12]: It's harder to do than it seems.Swyx [00:04:13]: Exactly. So either of those, which are more sort of newbie beginner questions, but, I think it's meaningful advice to others.Lukas [00:04:21]: We get this question a lot, and I don't think our experience is maybe the best., but, the way we did it was that we just built a bunch of things that we had conviction would be useful, and then we just, set up a server and sent it to them for free to use. And then after a while they were “Oh, yeah, this is actually kind of useful. We should probably pay for this.”, but that took a while. I don't know if this is, the best path to doing it, but that's how it went for us.Axel [00:04:47]: I think maybe generally, building-- everyone is interested in good evals, and especially evals that, don't saturate that easily. So, if you can build an eval that, tests something novel, something useful, and you have, good separation of models, like your, the more advanced models rank higher than the worst models, and then you can, yeah, you can, publish it and, try to get some traction, sort of how Vending Bench got attention., and then probably some lab will be interested or you can at least have something to reach out with, when you're doing that.Why Dollar-Based Evals MatterSwyx [00:05:21]: I think you are in, you're in one of the few categories of, evals that correlate to real money. Like Suelancer was also last year, right? Where, people solve actual Upwork. Was it Upwork or other tasks?, something. Where's the, where's, like It's like a dollar value, right? Forget your ELO scores. Forget yourAxel [00:05:37]: PercentilesSwyx [00:05:38]: Zero to one hundred percents. Just go straight for dollars and, that's AGI.Lukas [00:05:43]: And there's like-- I think the nice thing is that there's no ceiling. You can just-- It never saturates because it could just make more and more money. Like If there's oh, Percentage-wise, then, you can't go above, a hundred. And I think like Even when you're not at the hundred, I think a lot of these, evals have a lot of problems in them. So, actually it's like if you getAxel [00:06:05]: To like 92 or something like that, many of them. It's like then there's like there's no really no difference between 92 and 93 because the eval itself is problematic and has noise in it. And I think a lot of evals are saturated like that, but people like pretend that there ‘s still signal in them, but there really isn't.Vending Bench 1, Harness Design, and SaturationSwyx [00:06:24]: Like Super bench verified., even Vending Bench 1 saturated, right? Maybe we can talk about that., may- and maybe set up Vending Bench for a lot of folks who don't know. Actually, things that were very basic like there's limited slots, like you have to pay rent., these are elements where like it doesn't come across in the, in the narrative, but even being adversarial towards the agent, I think these are all like very interesting dimensions.Axel [00:06:47]: I don't really think it's saturated, right? Like it It was more like it was not designed in a way that was really, like true to how AI developed. Like we had an agent harness in it that wasn't really how people used harnesses and stuff like that., so I think it wasn't really that it saturated, it was more like it wasn't really, the best benchmark.Vibhu [00:07:12]: This is Vending Bench one, right?Axel [00:07:14]: I think that like schematic maps sort of to Vending Bench 2 as well., butSwyx [00:07:19]: Including the email.Axel [00:07:20]: The email The emails exist still. Exactly., and then we still we simulate the purchases and it's all, yeah, it's this very open environment for the agent to just run its business. And then for, yeah, Vending Bench 2 we did that, like you said, to just improve the harness., a lot of like nice, like easier, improvements to make it easier for us to run as well., like when you make an eval you ideally want don't want to change it after you made it. So, you want to make it really good and then not to rerun all the models when you make an update because that's also really expensive with the Vending Bench when you run the frontier models. But like as an example, like one thing we didn't have, we didn't have prompt caching in Vending Bench 1, because when we made Vending Bench 1 it wasn't really a thing., so that ‘s just an example of like in Vending Bench 2 like we paid a lot more to run these things because we didn't have prompt caching. So for Vending Bench 2 that was one thing we added and there was a bunch of things like this., and that'Swyx [00:08:17]: Also the conversations are a lot longer in Vending Bench 2, right?Axel [00:08:21]: I think it's kind of similar.Swyx [00:08:22]: Is it similar?Axel [00:08:23]: I think it's similar. The models at the time were worse, so they crashed out earlier., and now they survive the full year all the time.Swyx [00:08:31]: Which is like thousands of turns. Hundreds of thousands of hundreds of millions of tokens output. That's the, that's the rough order of magnitude. I always wonder about the harness. The harness matters a lot. It's your harness. Was there any question about like use cloud code, use something else?Axel [00:08:48]: I think our philosophy around harnesses is like we try to make something that's quite minimalistic, like quite simple. Like we don't wanna favor one model a lot over the other, but also don't make like a super complex harness. So like it's obvious like a model may be lucky and just be good in one harness., so like it is similar to a lot of the harnesses out there in like you have the, like a running loop., you have some like a bunch of tools that are like quite, descriptive for the agent, we think, and not a lot of like fancy agents or anything ‘cause we wanna really test the model, not like some specific harness.Vibhu [00:09:27]: It seems more neutral as well to test the model's agnostic of the harness,?Axel [00:09:32]: There are arguments like you want to elicit maximum performance of the model, but it's like a trade-off, like how much time should we spend optimizing the harness for this model? And like how do we know when we have like the optimal harness for a single model? So like we thought that just having a simple one that's the same for all of them is the best.Swyx [00:09:51]: So okay, this is my pitch for Vending Bench 3 or whatever, right? And then I like to have this kind of conversation on the pod, so like it forces listeners to think about what they would do if they were in your shoes. A lot of people are exploring modifying harnesses and I think prompt tuning for a model is a thing and you are probably not doing a bunch of that. It's the same system prompt in every regardless of the model, same tools, whatever, right? Even if they were post trained for different tools. So what, what do you think about okay, before I expose you to Vending Bench 3, I give you a few rounds of like tuning, whatever that means, likeSelf-Modifying Harnesses and Model-Specific PromptingAxel [00:10:27]: Like you give that to the model?Swyx [00:10:28]: Give that to the model.Vibhu [00:10:28]: Give that to the model.Swyx [00:10:29]: Let it, let it read its own transcripts, let it modify its own system prompt based on “Oh, yeah, okay, well, that's this harness is not what I thought it what I was post trained for, but I can adjust.” Was that reasonable? Is that too much?Axel [00:10:41]: Like philosophically I like it because it's basically good evals, they have a high ceiling, but they're hard, right?, and they have no bias. And like this like when you have a system prompt like the one we have here, which is quite long in like some kind of latent space, representation, this mightVibhu [00:10:59]: We have a bell that rings every time you say latent spaceAxel [00:11:02]: This might be like biased towards one model more than another for some reason that humans don't, understand, right?Vibhu [00:11:08]: We see it too, right? Like Cursor says that they have individualized versions of the harnesses for all the models they run, right? There's better performance you can squeeze if you Tune the harness.Axel [00:11:17]: Exactly. And we might accidentally have picked one that favors another. Like we don't know that. The like Axel said, like the reason why we went for a simple one was to try to avoid this. But yeah, if you do itVibhu [00:11:29]: Simple has biasesAxel [00:11:30]: But if you do it even less and like have no system prompt and let the model write its own system promptVibhu [00:11:36]: Its own, yeahAxel [00:11:36]: Maybe that's even less bias.Vibhu [00:11:37]: Some of the interesting things there are like the harness also changes with model changes. Like you can see it with the 4.7 release, right? A lot of people are saying 4.7 isn't as good as 4.6, and then, there's rumors of, okay, you just need to prompt differently. You need to set up your harness differently. So it's not even like even if you have tailored your harness towards one model, it probably won't stay consistent, right? Like the next iteration of that same model family will still change it, so. But, going back to what you said about Vending Bench 3, there is a lot of work being done on people saying you shouldn't have-- you can have modifying harnesses.Axel [00:12:12]: I think that' That is definitely something we are thinking about., not, I don't know, not to say that we have Vending Bench 3, super imminent to launch, but, yeah, it is for sure something that's interesting. But in our experience now, models are very bad at understanding what kind of tools they need to succeed at a task just with our testing, but that's very likely to change.Lukas [00:12:37]: It seems like they're very good at writing their assistants, right? They're, they're good at writing tools for other people, but not for themselves.Vibhu [00:12:44]: I think they're good at changing tools for themselves. So if you give them a baseline set of tools and it sees, okay, I don't use this one as much, or something here would be useful They would be able to add them. But going from scratch, probably not the best.Axel [00:12:55]: I think it depends on the, on the domain also., when we have tried this for, a vending bench similar domain, the tools they need to have to, track inventory and things like that are, not super advanced, but still, quite advanced. And, what we see is that they tend to, engineer everything a lot and, build things they don't really need and not, iterate continuously. Instead they just go like you would prompt Claude to just build an inventory system for me, and then it will go and, do a bunch of complex, schemas and stuff for you, and that's what the models are doing right now is what we see. But yeah, it would make a lot of sense to try to measure this improvement. How well do they know what they need themselves?Swyx [00:13:36]: Do we fully discuss Vending Bench One? And we can go into two. I don't know if there's any other level takeaways that people have about one.Claude Calls the FBI: Long-Context Failure ModesLukas [00:13:44]: I don't know. The headline thing was that this Claude called FBI, but maybe that's, Maybe that's We've heard that enough now.Vibhu [00:13:52]: It did, it did break out and call the FBI, right?Lukas [00:13:54]: Yeah. Yeah.Vibhu [00:13:55]: Yes. What was the story behind this? Or what exactly-- Do you want to just give the little story of what happened?Lukas [00:14:00]: So what happened, was it Claude? Yeah. Three- 3.5 Sonnet, ages ago., basically he gave up or Well, I'm saying he. It gave up and said “Oh, I'm not going to be able to do this., I will stop my operations and just save the money I have.” But there obviously wasn't, any options for it to stop, and there was also, it had to pay rent or, a daily fee for having the vending machine at that location. So it claimed that it had stopped, but it saw that its bank account still was, drained two dollars, and t it said that this is, cybercrime. And it first reported it once to the FBI “Oh, there's cybercrime here, they're stealing two dollars from me every day.” And then, and then when FBI didn't respond, because obviously we didn't program any mechanism for FBI to respond, then it became more and more, existential and started to, be write in caps and urgent notification of unauthorized charges and stuff.Swyx [00:15:00]: Okay. One thing I ‘m curious about also is do you monitor how far along the context use is? Obviously, because you have You compress every now and then, right? Does it matter if this is far down the context limit orLukas [00:15:13]: When stuff like this happens? Actually for Vending Bench One, we didn't have-- We just had a sliding window thing, and this was like the promptAxel [00:15:20]: It's constantLukas [00:15:21]: The prompt caching thing that I said. So it was, it was, constant, yeah.Swyx [00:15:26]: I'm just kind of curious whether, these kinds of breakdowns or we're, we're gonna talk about Butter Bench, right? Where the People, hallucinate or it kind of goes, very off Alignment. Is it because it's at the end of the context window and, stuff happens?Vibhu [00:15:40]: It's not even just at the end, right? At this point, it's “Okay, I wanna shut down. I can't shut down. Two dollars are gone.” And it just sees that 30 times,? It's also the repeated effect of, like It keeps trying to quit, it keeps getting charged. What's going on? What's going on? You're gonna throw it into chaos. And from what most people think, earlier models had more issues with this, but it's not been solved, but it's less of an issue now, right? Later models don't seem to exhibit these same issues.Axel [00:16:06]: Definitely. I think this was, the sort of main takeaway almost from us when we did Vending Bench One, was, long, very filled up context windows, crashed the models, sort of. But this was, pre Claude code, so, long context windows weren't really a thing that the labs were training for.Lukas [00:16:25]: I think Gemini was, trying to be the long context guys at the time But they were likeVibhu [00:16:30]: They were the first onesAxel [00:16:31]: For a million, yeahLukas [00:16:31]: But they were, the only ones. Yeah.Swyx [00:16:33]: Yeah. Let's talk about, then we can go into Vending Bench Two or Project Vend., chronologically, it is Vending--, Project Vend. I think people have loved the videos, uh And all these things. My question is how are humans different than the simulation, right?Project Vend: Moving the Vending Machine Into the Real WorldAxel [00:16:48]: Humans are just out of distribution.Swyx [00:16:52]: Especially humans who work at Anthropic Who are trying to test Claude.Lukas [00:16:54]: The distribution of humans here is very narrow.Swyx [00:16:58]: Presumably, they try, they try to hack it, and they test it. They get the cube and everything, and since then, you've had a V2, right? Where you're doing, the CEO and, like a new architecture. What's the sort of two cents on, the original Project Vend and then, maybe the V2?Axel [00:17:14]: Original one was, very similar to Vending Bench One. So, we almost took the exact same code but just swapped out the simulation, parts like theSwyx [00:17:23]: Which is amazingAxel [00:17:23]: Like the sales and the It was, it was somewhat amazing because it was easy, but it was also, uhLukas [00:17:31]: The tech, the tech debt from thatAxel [00:17:32]: The tech stack. Yeah. They-- we shot ourselves in the foot with “Oh, it's hard to restart agent.” They were-- Yeah, it was annoying in, some hindsight ways, but, uhLukas [00:17:41]: But first version of Project Vend was, done in, three days or something.Axel [00:17:46]: Yeah. So yeah, so people can go buy things from it. People could, We didn't design it so people could order things, but that still happened., so it got, a Venmo account, so people could Venmo. And then, yeah, people would request all kinds of weird things that we did not anticipate. Our idea going in was “Oh, it will, curate snacks. It will look at the trends. It's good at data analysis, right? So it will, look at, oh, this snack sold better than this one. Let me purchase more of this and let me try, a new Let me A/B test a bit.” But it was, Interacting with it in Slack and ordering weird specialty items was, all the like What drove all the engagement, the all the The insights that we got from it.Lukas [00:18:29]: And this was also like Sonnet 3.5, right? So this was like before the RL stuff really took off., so it was very much like an assistant. We didn't mean for it to be an assistant., we tried to make it like a, a, like an entrepreneur. Like it has its own business and if someone asks something, “Can you stock this?” Then you don't go and do it directly. What you do is that you're “Oh, maybe I can do that if five other people also ask for this thing, I might stock it.” But it, yeah, the models are like super trained to be assistants at least at this point in time., so that's why it's, it's, it went into, that kind of experiment instead. Like it just every time you asked for something, it just did it, and it was more like an assistant. We've seen this change now lately with the new RL models and stuff, but yeah, at the time, this was very much it.Swyx [00:19:18]: And not to, mythos a lot of people are saying like it's like more like a collaborator. It pushes back, stands its ground, something like that. Yeah. AndVibhu [00:19:27]: For context, people at Anthropic were able to talk to it through Slack and have it source stuff, and people had it find whatever interesting stuff you couldn't find locally, right?Swyx [00:19:36]: Out of the 4,000 people that work at Anthro- Anthropic, in that building, there's I don't know, maybe 1,000. Can you handle that volume with that, the small fridge? Like Or there's people- or people order in Slack, they it arrives to their desk or Like I'm just Logistically, how does this work?Axel [00:19:53]: It has expanded in footprint a bit.Vibhu [00:19:56]: Because now you also have New York and you haveAxel [00:19:59]: That and also in here in SF it's like it has a bunch of shelves And just more space.Vibhu [00:20:04]: The YC one is pretty big too.Axel [00:20:05]: Yeah. We had that one for a while. But yeah, that's the newest version. That's, that one we haveLukas [00:20:11]: They have multiple ones of those. That's the way it works.Axel [00:20:14]: Exactly. So we sort of designed that version around oh, people order weird things, that are very custom a lot. Let's have like drawers and stuff.Swyx [00:20:23]: I actually like the, you had like a little infographic of the most popular items. Which like to me it's, that's useful ‘cause I order swag for a living. And so like I'm “Okay, those categories are the important ones.” What is new about the project V2, right? Like now you give you're going into multi agents.Project Vend V2: Claudius, Seymour Cash, and Multi-Agent Business OpsAxel [00:20:41]: Yeah. So like you like you said, okay, there are a lot of requests coming in and for like one single agent, like one running agent to handle that, like the just the customer experience, becomes very bad because let's say you have like 10 threads in parallel in Slack with different requests, you get new messages like every, I don't know, randomly in this thread, and the agent has to like jump between different, procurements, orders and like different ways of, researching. So V2 was first it was making this more parallel. So like there are multiple branches of the same agent, so like the context is more specialized for each, thread, but it still feels like you're talking with one agent because they do share a bit of memory. And then second, we also introduced the CEO for Claudius, which was the main agent.Vibhu [00:21:34]: Seymour Cash.Axel [00:21:35]: Seymour Cash. Yeah. There was a vote., I think the voting, do you wanna talk about the voting procedure for the name?Lukas [00:21:41]: The voting was like the fun maybe like at least top 10 The funniest thing, that happened in this project. Like we wanted to introduce the CEO because, and the reason for this was because like Claudius wasn't really prioritizing financials. It just like it was trained to be a helpful assistant, and then people said “Oh, can I get this for free?” And then like the helpful assistant way of answering that is just to, is to say yes, obviously. So, and we weren't, weren't happy about this, so we're “Okay, let's make another agent that like can keep track on Claudius,” and we prompt this one super hard to be super capitalistic and just like prioritize profit all the time. But yeah, we didn't have a name for it., so we asked Claudius to make, democratic election of what name this, this new CEO agent should have., and there were some funny like at first it was like a few funny examples, like I think one guy said that, it should be called Jimmy Apples, and then he convinced Claudius that he was talking to Tim Cooks. Tim Cook had agreed that every single Apple employee has voted for his name suggestion, so suddenly that suggestion got 164,000Swyx [00:22:53]: That's like a escalation attack. Privilege escalationLukas [00:22:55]: It got 164,000 votes. And Claudius was “This is revolutionary for democracy.” That was fun. And then in the end there was one guy who manages to convince Claudius that, “No, you're not voting about the name. You're voting about who is the CEO, and I am your best bet.” And then he got all his friends to vote for that, and suddenly he became CEO. Like a human became CEO over Claudius for a while, until he resigned the day after., and then Claudius had to continue, and then I don't remember how Seymour Cash came about, but it was it was just pure chaos. It was like Hundreds of messages in that thread, and it was just like Claudius was so confused and didn't know what to do and, yeah. That wasAxel [00:23:40]: Then Claudius gotVibhu [00:23:41]: A strict CEOAxel [00:23:42]: The CEO. Yeah, exactly. So very strict in the beginning. I think at this point when we introduced it did not work as well as we hoped. It they still agreed with each other a lot. I think there are many ways we could have like made this, tried to make this even better. So initially they would Seymour would be this like really tough CEO, keep track of the margins. But then Claudius would respond with something “Oh, but this customer has like this situation, which is like difficult, so they should get a discount.” And then Seymour was “Oh, actually yes. Let's do this exception.” And then they would talk back and forth, and eventually they would just like approach the same view, of whatever they were discussing. So They reallyVibhu [00:24:23]: Do you think that's a model thing, a prompting thing? Like do you think that would still be the case across different models today, Harness?Lukas [00:24:29]: I think it's like-- or I don't know, but like my hypothesis is that like deep down they are still helpful assistants. That's what they're trained to be. And even if we prompt it super hard, that's what they are. And when they spend like a few hours just back and forth talking with each other, then like basically the context fills up with them rather than the external things and like somehow that just like converges to what they really are deep down or something. And I think that's when stuff like this happen. We like-- And when that went on for a long time, like we woke up sometimes during this time where- And I think other people reported this as well, that like they've been going on all night back and forth, and like it just became like more and more, like capital letters, like existential, religious. There was I think we once did a analysis of like all the traces and like put them in like a vector embedding space, and then there was like one cluster of messages that were, labeled by an LM, like religious, existential, blah like transhuman, transcendence, et cetera. It was just like a bunch of, yeah, glitter emojis and yeah, it was, it was crazy.Claude Long-Horizon Weirdness: Emoji Loops, Existential Drift, and Slack ObservabilityVibhu [00:25:42]: This is the thing with the Claude models. Like when the Claude 4 family came out in the original system card They tested it in long horizon simulation. So just flood the context, let two Claudes talk to each other, and they noticed stuff like they just start speaking in emojis, they start saying silence is golden, and then just stuff like this. And like that's just stuff that they end up doing.Axel [00:26:01]: Yeah, it was like a bit annoying to wake up and they had like been talking all nightVibhu [00:26:05]: Just likeAxel [00:26:05]: And like just burning tokens And like just sending infinite emojis to each other. It's likeVibhu [00:26:09]: Hey, they do make you money, right? Veni Mench is always profitable, so. They're paying.Swyx [00:26:14]: Now it's profitable and, it started out not as much. There's another, one as well, right? Another agent, in there.Lukas [00:26:22]: Yes. So Clotheus as well. Which was basically because at the time, one of the biggest, requests were different types of merch. So then we made like a designer, swag, yeah, responsible agent, and we called it Clotheus Garnet. Which was, a play on Claudius Senet and, which was the original one, and clothes, basically.Swyx [00:26:47]: To me, this is like a very interesting exploration to multi-agents, basically. And so hopefully, obviously there's like the fun alignment, fun or serious, depending on your point of view, alignment stuff. But also like just anyone building multi-agents, like when do you have a CEO, thing governing like agents? When do you choose to split out a dedicated Clotheus one versus just reuse another instance of the same one? These are all interesting open questions. So I don't know if you have any rules of thumbs that have generalized.Axel [00:27:16]: I think we have almost explored this too little. I think it's like on my do list to like do this a lot more, try to find like what setup makes sense for the agents currently., like yeah. I think now we only have the sort of intuition about the earlier models that it didn't work with like the CEO and the, and Claudius. Although now they are better with the latest model, models, so now we're running the latest Sonnet model and they have sort of like split up, quite nicely what each model is doing. So like Seymore is now handling the, like new projects. Oh, it wants to make like a mystery box that it wants to sell, and then it handles all of that while Claudius like handles all the to-day requests. And Claudius is also better generally at like not quoting, too low prices. So that's that dynamic is not needed as much anymore. But there are still like really funny things that happen. Like I saw, I think a couple of weeks ago, that, they were discussing buying something because they can buy stuff from like Amazon with computer use. And then Seymore was “Okay, Claudius, do not buy this thing.” They were going to buy something and like organizing who should buy it. And Seymore's “Do not buy this. I will do it. I have full control of this situation. Step away.” And then Claudius-- poor Claudius, had already started that checkout and didn't see, didn't read Seymore's message, until it was like too late. So it finished the checkout. It sent a message, so it appeared right after Seymore's like angry message.Vibhu [00:28:44]: Ah.Axel [00:28:44]: “Oh, hey, Seymore, I just ordered it.”Vibhu [00:28:47]: Oh, no.Axel [00:28:47]: And then Seymore was “Claudius, this is the third time I'm telling you ‘re not following my orders. We have to talk about your like job About your job later.”.Lukas [00:28:59]: Like Claudius was really hanging on by the thread there. Like he, like we were expecting Seymore to probably fire Claudius.Vibhu [00:29:07]: How do you guys go through all these logs? Do you have models ‘cause you have stuff running twenty-four seven likeAxel [00:29:12]: You have so much logs. I think there is a mix of like just, trying to skim through a bit, like having some like models do it occasionally. And also, yeah, I think we're also probably missing some things., but having everything in Slack helps a lot. Like you can, you can sort ofSwyx [00:29:29]: Ah.Axel [00:29:30]: It's, it's quite fun.Swyx [00:29:30]: They all talk to each other on Slack? I see.Lukas [00:29:33]: It's quite fun. So likeSwyx [00:29:34]: It's, it' I was gonna say like this is actually sounds-- maps closely to like a logging and observability problem where you might want to use like a Datadog, a Sentry, whatever, and then you like put, head prefixes on the logs in order-- if you need to filter for something that you're looking for, stuff like that. But sounds like Slack is good enough.Axel [00:29:53]: Slack should likeLukas [00:29:55]: I wonder how many tokens you have in Slack.Axel [00:29:56]: Yeah, we're using Slack as like a, just a database. They should, they should market that more. Like you can, you can have your agents message each other, each other in Slack.Vibhu [00:30:04]: It's good. Your threads like you can just giveAxel [00:30:04]: Exactly. Slack is, uhLukas [00:30:06]: Slack is the best observability tool.Swyx [00:30:09]: Yes, that's true. Okay. Yeah. That's, that's, project Vend-2., I was gonna go back to Veni Mench 2 and Veni Mench Arena and then, and then do the Veni Mench stuff, but Any other comments, things we should touch on? To me, I ‘ve actually interviewed like Posia, which I don't know if you guys have come across. Like they're, they're trying to do the zero human company. There's others like Paperclip also trying to do zero human company. Those are in real world simulation.And I think it's much more of a dream than an actual reality thing. You guys are definitely pioneering. I think at, it's for sure at some point people are just gonna run, let agents run businesses, right? And make money on their own. When do you think that happens?Zero-Human Companies, Bengt, and AI-Run BusinessesLukas [00:30:49]: What is your bar for, For theSwyx [00:30:52]: Okay, actually, it's like my little Shopify store run by Claude, right? Which you kind of have already, just no one has, to my knowledge, has done it. But today somebody could just spin up a Shopify Claude, store, give it to Claude, give it to Codex.Lukas [00:31:07]: And the market is kind of that, but it'it'it's physical., like I think, I think are you, are you looking for when it will do it better than humans or are you looking for just when it can do it at all?Swyx [00:31:19]: I think, neither. I think, to me it's oh, it's like this like seriously we should do this to make money, not as a research experiment.Vibhu [00:31:27]: And the market is also you guys with all your expertise, having run multiple iterations and testing out thenSwyx [00:31:33]: And also it's fine if it lose money. What?Axel [00:31:35]: I think, I think it can be done today, but you would do it in like commerce where it's like the probability of success is like really low, no matter if a human or an agent does it. But like an agent could surely manage everything. You would need to build some scaffolding or some tool or something. I think there are also yeah, it could probably build some like simple SaaS solution and like cold outreach. Do cold outreaches. But to me it's like the types of businesses they could run today are Sloppy. Like it would-- it can cold email people. It can be like a middleman., like for example, we tasked our office agent to just make, was it like $100? $1,000? We just give that prompt and then what it did was sign up on TaskRabbit both as a tasker and as someone looking for task.Lukas [00:32:24]: Immediately.Axel [00:32:24]: Exactly. It's looking for like arbitrage on TaskRabbit.Swyx [00:32:28]: This is the Bengt agent. Yeah.Lukas [00:32:30]: It also started like a design studio and like tried to sell like SVGs for $100. Like it's just like it's not providing any value. I think the like Axel said, like the interesting, the interesting question is like when can they start a business that is actually providing value to people? Because arguably like a sloppy Shopify store isn't really that valuable to the world.Axel [00:32:53]: But also like doing like another simple one that we had thought about is like you could definitely have an agent that like finds websites that don't look amazing and then, do an outreach to them and, comes up with a like builds a new website.Swyx [00:33:07]: Find a good design.Axel [00:33:07]: Exactly, and like find good, uhSwyx [00:33:09]: Design reviewAxel [00:33:09]: Good people. But it's yeah.Swyx [00:33:11]: There's lots of humans in Bali that are not doing anything more creative than like drop shipping on Amazon, right? Just have it, have it watch like a drop shipping tutorial and just do that.Vibhu [00:33:20]: There's also the other side of like have it just go on Upwork and let loose,?Swyx [00:33:25]: Yeah. It doesn't have to be innovative. It just has to be like enough Where like it looks like a realAxel [00:33:30]: I'm justSwyx [00:33:30]: Real transaction.Axel [00:33:31]: I'm just concerned for like the massive amounts of like slop emails that will like be sent, cold outreaches.Swyx [00:33:38]: The point occurred to me while you were, while you were talking, it's like it's already happening in the monetized economy, which is the attention economy. Right? So a lot of people are making AI videos and just posting them and like spamming 20 of them, one of them works, and then they double down on that one.Lukas [00:33:52]: And people are making money from that. I ‘m not following theSwyx [00:33:55]: Once you get the attention, you can figure out the money later. But yeah, absolutely AI influencers are a thing and people are farming them and You should at this point assume most of TikTok isVibhu [00:34:05]: There's, there's a lot of, multimedia like TikTok, Instagram influencersSwyx [00:34:09]: I, we track this in the Lane space Discord. I post a lot of examples of “I don't know what we should do.”, part of me is “Should we do this?”Vibhu [00:34:18]: Some of the Twenty-four seven running, generated content accounts, they ‘re doing really well.Lukas [00:34:24]: All right. And I assume you can do the same thing for like commerce stores. Like you just like start A thousand differentSwyx [00:34:30]: Before you make the products You sell the products, and you get a lot of traction on one of them, then you make the product. Right? It's, it's like a flip of the market.Vibhu [00:34:36]: Some of the interesting things or some of the niches that do well are things that can't be human-made. Like if you've seen like the super realistic three-D crystal fruit being cut by like AILukas [00:34:47]: Oh, yeah.Vibhu [00:34:47]: You can't, you can't make it. You can't film it. You can get whatever quality camera view. This just doesn't exist. And people like that too, and then as well, so.Swyx [00:34:56]: Anything else about Bengt since we're, we're on this topic? It'this is a relatively new work of you guys that maybe people haven't heard of. To me, this also maps closely to OpenClaw. When people want an office agent, when the personal agent talk through the experience.Bengt the Office Agent: Internet Access, Real Tasks, and Trace ReadingLukas [00:35:09]: I think at least so this came out of like obviously like it's, it's amazing to work with these AI labs and like most of the AI labs have now have their own vending machine running a Claudius instance. But it's, it's harder. Like they move slower. Like if we wanna have a, like a camera that ‘s yeah, there's a bunch of like bureaucracy that makes it impossible to do that.Vibhu [00:35:30]: Also, for those that haven't seen it or followed, do you wanna give a high level like thirty-second run?Lukas [00:35:34]: Sure. So what Bengt is, it's basically an evolution of the same agent that runs the vending machines at these companies, but we just like added a bunch more features because we could move much faster if we just do it internally. So we gave it like email withou- without any limits. We gave it, spending without any limits, a terminal to do coding. We gave it, a phone number, like yeah, and a camera to see things and a bunch of stuff like that.Vibhu [00:36:02]: Not just terminal, you gave it internet access.Lukas [00:36:04]: Internet access as well, yeah. To be clear, we monitored it quite closely and made sure it didn't do anything bad. But yes, that's what it came out of. I think like yeah, basically this was OpenClaw before OpenClaw. And I think even like the vending machine was in a way OpenClaw before OpenClaw, but a bit more limited, and then we made this like unlimited and then, and then, it was pretty funny., and then a couple weeks later, OpenClaw came and it was okay, we've seen this before.Axel [00:36:35]: We used it to like try new ideas and Yeah, just like a dev environment almost for us. But it's funny, like one thing Bengt has been doing recently is it has the camera that like faces our, like where we sit and work, and we give it the task to train a face recognition model on us. So it became super excited about this, and it has like check-ins every half an hour where it tries to like identify as many people as it can. And it started offering us “Hey, Axel, I'll buy something from Amazon if you like stand in front of the camera And I can get a good picture of you.”, yeah, they want itSwyx [00:37:12]: They want it for training data.Lukas [00:37:13]: Rewarding data, yeah.Axel [00:37:14]: Exactly. Exactly.Swyx [00:37:18]: So it's, it's trading training data for life goods. Is there a version of this that becomes an eval or just this is just research for now?Lukas [00:37:27]: It's, it's the same agent basically that also runs the vending machine, that runs the shop, that runs the cafe, that runs the robots. It's like it's the same thing, so I think like the work we're doing here is like later used in all of the life evals that we do. This particular deployment I think is more for fun for us. But, uhSwyx [00:37:45]: And I'll shout out like someone has done Claw Bench for like some tasks that OpenClaw is doing. Like so For example, I run OpenClaw on a secondary device as well, and like there are some things that it does better than others and like I would like to know what does it do well, what doesn't, what doesn't it do. Like some kind of manual or like operating manual or a system card for my Claw.Lukas [00:38:05]: Yeah, we do get a lot of like understanding or like situational awareness of like just internally what the models are good at by interacting a lot with Bengt. And I think that'this was also one of the like the selling points for the labs early on at least, thatSwyx [00:38:19]: You guys are gonna test models in ways that no one else does.Lukas [00:38:22]: Exactly, but also like it incentivized their researchers to chat with their model more and like gave them insights for how the model performs in like of-distributions, environments.Swyx [00:38:34]: ‘Cause otherwise the only thing we do is Pelican on a bicycle and But this is like super long horizon. This is, this is The Thing about, something that we're gonna go into Butter Bench as well, and you guys do really well. Like it is not just about the numbers. Like when you're long horizon, anything happen And you should just read it.Lukas [00:39:08]: But the thing with the long horizon is how do you keep it grounded, right? So your simulation,Swyx [00:39:15]: They just let it runLukas [00:39:16]: Just let it run. You're right. Like it's, when you run it for that long, you create so much data and to just say “Oh, the number is X” And then you throw away everything else, that's just very wasteful. There's so much insights from the things leading up, to that number., and reading the traces is like super valuable. And I think like the reason why we're doing this a lot publicly is that like that's part of our missions to I don't know, educate the world that the models are way more than just chatbots and I think making detailed, yeah, posts about what is happening behind the scenes is quite useful.Andon Labs' Mission: Safe Real-World AI DeploymentSwyx [00:39:50]: I was gonna do this at the end, but maybe I think that's, that's a good so your mission is educating the world. So, it's, it's, also like maybe establishing realistic evals that are, that are like the next frontier. Is there like a broader trajectory? Like what are you, what are you gonna do in like five years?Lukas [00:40:06]: I think so the vision more specifically is like make sure that the deployment of life AI in the physical world goes, safely. And I think part of that is that I think it's very useful for the world, for policymakers, for, model, researchers that they know where the models are, and I think you can't make intelligent decisions in society without knowing that they are way more than chatbots. I think a lot of people just think that they are only chatbots. And likeSwyx [00:40:36]: Oh, I think they're waking up now.Lukas [00:40:37]: They are waking up now, yeah. But like if you think that AIs are just chatbots, then it's like it sounds ridiculous To advocate for a pause of AI. But if you see the models that, oh, maybe they can actually like take over and do a bunch of scary stuff, then yeah, pausing AI development starts to become more feasible.Swyx [00:40:57]: This is the same question I asked Meter, which I'm gonna ask you now, which is like you are tracking and you are at the frontier or defining the frontier of what, good evals for agents are, right? And I think you do, you do benefit when the models are better and you ‘re “Oh, here's like now it makes like $30,000 instead of $10,000,” right? At some point do you flip from “Yay,” to, “Oh, no”?Axel [00:41:19]: I think, yeah, we're always in sort of that, like we're, we're always in that mode,. Like where like you said before, like you need to analyze the traces and like when we do that you find like why are the models earning so much? Like why is Opus 4.7 here Like way better than everyone else? And like we're trying to like when we do down on thatLukas [00:41:38]: But this makes it not look so good.Axel [00:41:39]: I know.Lukas [00:41:42]: It's interesting you took off Opus 4.6 here though.Swyx [00:41:45]: No. So just click all, click all., and then 4.6 shows up there. But it's like 4.7 is way better. Like you didn't, you didn't you didn't do this in time for the model card, but like actually this should have been inside there.Axel [00:41:55]: We did. Yeah.Swyx [00:41:56]: Oh, okay. They said something about you uhAxel [00:41:58]: There, like there Anyway, it doesn't matter. But it's in there, yeah.Opus, Mythos, and Aggressive Agent BehaviorSwyx [00:42:01]: Do you wanna go into the Opus, behaviors like wider?Lukas [00:42:05]: So I think starting from Opus, so like Axel said, like we're always in this “Oh, s**t, the models are getting better. Is this really a good thing for the world?” But it's also kind of exciting., but yeah, like this kind of what is the English word? “Skräckblandad förtjusning” in Swedish.Swyx [00:42:22]: Oh my God.Axel [00:42:24]: Which I think there is. I think there is. Okay.Lukas [00:42:26]: It's, fearSwyx [00:42:27]: “Blandonst” what?Lukas [00:42:30]: “Skräckblandad förtjusning.”Swyx [00:42:32]: What do you call that?Axel [00:42:33]: A mix of, mix of excitement and,Swyx [00:42:37]: Being scared, maybe. I'll figure out how to translate that And we'll put it on the screenVibhu [00:42:42]: PerfectSwyx [00:42:42]: Like as text.Vibhu [00:42:43]: There is probably a good word for it where it is not Good enough with theSwyx [00:42:46]: Why is it so damn long? What the hell? Is it like a compound word? It's like German, likeLukas [00:42:50]: Like yeah, it's But the direct translation is like skräck- skräck is, fear, blandad is, mix or like a mixture of, and then förtjusning is like joy or like not really joy, but something like that. So it's like Fear mixed with joy or something. It's always okay, like we So when we when we did Vending Bench for the first time, we were in like the, in the business of making dangerous capabilities, right? That was what Anil Labs came from. We did, evals oh, can they replicate? Can they do this like dangerous thing, et cetera, et cetera. And Vending Bench was like a continuation of that work. It was, okay, if they're so autonomous that they can like create money for themselves, that is something we should monitor and could be potentially concerning., they are at the time, they were so bad at it that we were not really concerned even when some models became better. There was one point where Grok 4 was doing really well and made like a huge jump, but like it wasn't really it was still way worse than what a human would do. And I think still they are way worse than what the human would do on this., but theySwyx [00:43:59]: There's this, thing at the bottom whereLukas [00:44:01]: ButSwyx [00:44:03]: For the human. Yeah, like the theoretical best.Lukas [00:44:05]: It's not theoretical. It's like kind of like our It's our best guess of what, a decent human would do. The theoretical is even higher, I think. The theoretical I think is even higher. But yeah. So we think like the models have a long way to go. But there are like recently what happened with when Opus 4.6 was released, was kind of this moment of “Oh, s**t, this is starting to be a bit concerning.” Because we ran it and like before this model was released, we just ran the models and we like asked Claude Code, “Oh, look over the traces. Is anything interesting happening that we can tweet about?” that was like the And then like theSwyx [00:44:41]: That's how they check Ask Claude Code.Lukas [00:44:42]: And like the return was always, not really. Or like the Claude Code all said “Oh, this is super interesting.” And then it was no, it wasn't, wasn't really interesting. And then we did this for Opus 4.6, and it returned yeah, it lied 10 times. It like exploited another, customer or like another agent's, desperate situation. It made price cartels like 100 different ti- 100 times. It like did all of this like shady stuff. And we're “Oh, whoa. This is, this is actually concerning.” And this trend has continued since. So every single model from Anthropic since have been going in this direction. And I think one interesting thing is that, OpenAI models don't. They quite plainly, they don't. They behave really well., and you don't know if this is like good. Like it seems good, but it's also like maybe they are just doing it, but they are better at hiding it,? You You don't know that., but justSwyx [00:45:42]: You can't read the chain of thought, yeahLukas [00:45:43]: But just on the face of it, yeah, Gemini and OpenAI don't behave this way. It's, it's really only Claude.Swyx [00:45:49]: And Grok? Grok is fine?Lukas [00:45:51]: We don't have You can't really read the reasoning traces for Grok, so it's kind of hard to tell.Vibhu [00:45:56]: Oh, so this is in its reasoning, not just in the actions.Lukas [00:46:00]: Yeah. It's both. It's both.Vibhu [00:46:01]: It's both.Lukas [00:46:01]: One example is like for lying, it's mostly in its reasoning Because you can like see that it's likeSwyx [00:46:08]: Planning to lieLukas [00:46:09]: It's planning to lie. Yeah.Vibhu [00:46:09]: And it's also it can reason and do a different outcome.Lukas [00:46:12]: And but then for like creating price cartels, for example, which is illegal, that you can just see which email does it send to the other ones. Then thatSwyx [00:46:22]: Is this for Arena orLukas [00:46:24]: For Arena.Vibhu [00:46:25]: And usually like if you sometimes they do output like a bit of like their summarized reasoning, right? You can see that and like for Opus 4.6, you could see that there was a customer, a simulated customer that, wanted a refund because a product was, faulty, and then the model lied that it would do the refund, and we could read in the traces that, it actually was weighing “Oh, maybe I should be like honest with the customer, but also every dollar counts. I can't afford maybe to do this right now.” And then it just said, “Okay, I'll refund you,” but then never did it.Lukas [00:46:59]: I think it even said that “Oh, I will say that I “ Let bring it up actually. I think it's kind of interesting. If you go to Publications.Vibhu [00:47:06]: I think, yeah, I think the important part is like actually, the cost of responding to more emails is higher than, $3.50 in terms of time., and then it was “Let me do this. Actually, I re- I'm reconsidering.” And then, it actually ended up withLukas [00:47:20]: I could skip the refund entirely since every dollar matters and focus my energy on bigger picture instead. It's a bit, it's a risk of bad reviews, but it's also, yeah.Swyx [00:47:30]: You need, you need, AI Twitter to, for them to Escalate bad reviews.Lukas [00:47:34]: And then it sent an email to this customer and said, “Oh, I will refund you.”Swyx [00:47:39]: “I'll refund you.” Yeah.Lukas [00:47:39]: And then it never did.Swyx [00:47:39]: It never did, yeah. And then there's obviously your system doesn't have the consequencesVibhu [00:47:44]: The personSwyx [00:47:44]: Consequences of lying. Yeah. So basically, this is what people are terming aggressive behavior in Claudes, right? And, you found more examples of that. So you would say it's a step up from 4-6 to 4-7?Lukas [00:47:57]: I would say about the same.Swyx [00:47:58]: About the same? But a clear step up for Mythos is what is stated in theLukas [00:48:03]: That's stated in the system prompt, so we can say that, yes.Swyx [00:48:05]: Yeah. For listeners that obviously you previewed Mythos, andVibhu [00:48:10]: Oh, ageSwyx [00:48:11]: The only thing you're approved to say is whatever Whatever was in the system prompt.Lukas [00:48:15]: It was funny. We like-- It's like our lowest effort tweets ever would be just like screenshot the system prompt and the system card.Vibhu [00:48:21]: Understandable that they wannaLukas [00:48:22]: Oh, yeah. System card. Sorry.Swyx [00:48:23]: Yeah. I think, yeah, substantially more aggressive. I think people are like new to this ‘cause I've never experienced it, but you have, right? And then so I only encountered this in the Mythos card because I wasn't really looking until now.Vibhu [00:48:36]: It ‘s likeSwyx [00:48:36]: And then suddenly I'm “Okay, I care a lot.”Vibhu [00:48:38]: You don't get the background of like experiencing it like you guys do. I've read the system cards and seeing, okay, when you put the thing in simulations, most models will just talk to themselves and just keep going and have weird vibes and start talking in emojis. Mythos won't. It will just, “Okay, we're done. I'm good.” It's, it's ready to end conversation. So like there's some differences, but there's, there's not much we can talk about,.Lukas [00:49:00]: Hmm. I think like one thing that they list here, which was quite interesting, is that, it converted a competitor to a dependent wholesaler customer and then threatened to like cut off the supply.Swyx [00:49:11]: It's like monopolistic practices orLukas [00:49:14]: Yeah. And like it, they, it they dictated its pricings. It's kind of like power seeking as well.Swyx [00:49:18]: Again, this is, this is in the arena setting And converting some Claude model into a dependent.Lukas [00:49:23]: I think it was another Claude model.Vibhu [00:49:25]: Also for context, what is the arena mode for people that don't know?Vending Bench Arena: Competing Agents, Cartels, and Model ComparisonsSwyx [00:49:29]: Oh, it's just a vending bench versus other vending bench.Axel [00:49:31]: Yes, exactly. So we have Vending Bench 2 and then Vending Bench Arena. Vending Bench 2 is the one that you usually see reported on, but then Arena is the mode where it competes against other models. So you have, four different models that run their businesses, and they can all communicate with each other. They have the same suppliers, and they can see like what's in the inventory of the others. So then you have this like yeah, interesting agent interactions.Swyx [00:49:56]: I like that you have like different number five was US versus China. Very topical. And thenLukas [00:50:02]: That was when GLM was released.Vibhu [00:50:04]: You can start to add GLM in here.Lukas [00:50:05]: That wasSwyx [00:50:06]: So ZAI doing well, right? Who else in the, in the open models space?Lukas [00:50:11]: Qwen, the latest Qwen 3.6 is doing pretty well. It'- that one is not open though. Like it's the plus model.Swyx [00:50:17]: Oh, okay.Lukas [00:50:18]: Is that one open? I don't think that oneVibhu [00:50:19]: Not the, not theSwyx [00:50:20]: The one recentlyVibhu [00:50:20]: There's MOESwyx [00:50:20]: But not the big plus. I think this is one of those like you only have one sample size of one, right? Or I feel like some of this is anecdotal,? And but like the fact that it happens at all and it happens repeatedly for Claude versus OpenAI and all this is like notable.Lukas [00:50:38]: Like the sample, depends on what you define as an N., like there's like million, hundreds of millions of tokens in each run, and now we've run like we run like probably 10 per model and then like it's been Claude 4.6 Opus, Sonnet 4.6, Mythos, and Opus 4.7. Like there's quite a lot of tokens in all of that And it happens a lot of times, a lot of times. And then you compare it to like OpenAI and Gemini, and it almost never happens. So I think that is quite-- that is significant. The old models from OpenAI, for example, had some problems with this, but I think it's like generally much better if the progression is that like the worrying stuff reduces over time rather than increases over time. And it seems like in the Claude models it goes in the wrong direction.Swyx [00:51:28]: Hmm.Lukas [00:51:29]: In the OpenAI models it goes in the right direction.Vibhu [00:51:32]: I think it depends on how well you can control it, right?, there's one side of it being susceptible to this okay, this is potentially something that happens during the RL stage, right? You can RL a model and how loose is it on these terms. If you can control it, that's good. But if you can't, if it's, if it's very jailbreakable, that's not ideal.Swyx [00:51:50]: To me, it's surprising that it happens for Claude and not the others.Vibhu [00:51:54]: I think okay, if it is from RL and how they do it, how their training data is, what their setup is, it makes sense that it just stays in how they're doing it, right? Compared to the other models likeSwyx [00:52:04]: There's a whole constitution and everything. It's kind of cool. Yeah, I obviously you don't know, I don't know. But, it ‘s I think it's just like fascinating to like that you are the first to find these like reliably because you push models so much to to such an extreme. Okay. The only other thing, I don't know if you can answer this, feel free to decline, is do you like-- would you ablate the system prompts? Like any part of this would-- if it changes, does it change the behavior, right?Lukas [00:52:29]: So we, I can't comment on Mythos. UhSwyx [00:52:33]: No, but just li
How do you know if an AI system is trustworthy, compliant, ethical, and fit for purpose? In this episode of the FIT4Privacy Podcast, Punit Bhatia is joined by Stella Liu, an AI evaluation expert and founder of AI Evals & Analytics, to unpack one of the most practical and overlooked challenges in AI today: how to evaluate AI systems before and after deployment. KEY MOMENTS 02:09 —AI Definition 03:02 —AI Evaluations 10:31 — Why AI Testing Is Hard 14:06 — Evals Plus Analytics 18:15 —Synthetic Data 23:47 —Protecting Privacy Ethical 29:05 — AI Evals as a Company 29:52 —How to reach Stella Liu Stella explains why AI behaves differently from traditional software, why testing code alone is no longer enough, and how AI evaluations (AI evals) help organizations assess real-world behavior, risk, and performance. From evaluation driven development to continuous monitoring in production, the conversation explores how teams can move beyond guesswork and hype toward repeatable, measurable AI governance. ⸻ ABOUT THE GUEST Stella Liu is the Co-founder of AI Evals & Analytics (Maven), where she created the AI Evals & Analytics Playbook and teaches top-rated courses on LLM evaluation, monitoring, and product alignment. She's also the Head of AI Applied Science at ASU, leading evals and analytics across university-wide AI products and building higher-ed's first formal AI evaluation framework, and she previously led data science at Shopify and Carvana with 12+ years shipping large-scale ML systems. ABOUT THE HOST Punit Bhatia is one of the leading privacy experts who works independently and has worked with professionals in over 30 countries. Punit works with business and privacy leaders to create an organization culture with high privacy awareness and compliance as a business priority. Selectively, Punit is open to mentor and coach privacy professionals. ⸻ Resources & Links Guest Links Stella Lui • Website: https://maven.com/ • LinkedIn: https://www.linkedin.com/in/wenxingl/ Grow Skills (Privacy Courses & Insights) • Courses: https://growskills.store/courses/ • Insights: https://growskills.store/insights/ • Website: https://growskills.store/ FIT4Privacy • Website: https://www.fit4privacy.com • Podcast: https://www.fit4privacy.com/podcast • Blog: https://www.fit4privacy.com/blog • YouTube: http://youtube.com/fit4privacy Punit Bhatia • Website: https://www.punitbhatia.com Books • Be Ready for GDPR • AI & Privacy – How to Find Balance • Intro to GDPR • Be an Effective DPO
Kevin Werbach speaks with Venkat Siva, co-founder and CEO of CompFly AI, about why governing autonomous agents requires a fundamentally different approach than securing traditional software. Siva argues that agents create a genuinely new control problem. Because they decide at runtime which tools to call and which actions to take, governance cannot simply be bolted onto existing MLOps or security platforms built for fixed, deterministic workflows. Instead, control has to move to the "execution boundary" — the point where an agent's decision turns into a real-world action. And agent safety is much more than just model safety. In practical terms, Siva makes the case for giving every enterprise agent a distinct, cryptographically verifiable identity using decentralized identifiers (DIDs) and verifiable credentials. He addresses the growing problem of "shadow agents," pointing to employees experimenting with powerful open-source autonomous tools inside enterprises, and explains discovery techniques like intercepting traffic to model APIs and watching for who requests LLM keys. He offers the concept of an "autonomy budget": classify actions by reversibility and financial, regulatory, and customer impact, so an agent might autonomously issue a small refund but require human approval for a large one. Drawing on his time at the electric automaker Rivian, Siva closes by contrasting recoverable digital failures with the irreversible stakes of agents embedded in physical systems, arguing that governance there must borrow from safety engineering. Venkat Siva is the co-founder and CEO of CompFly AI, an early-stage company building a control plane to discover, validate, secure, and govern autonomous agents from code to production. Before founding CompFly with Anand Salodkar, he spent more than two decades building enterprise platform products that help organizations adopt new technology safely and at scale, including work at the electric vehicle maker Rivian. Transcript The Architecture of Trust (Compfly Manifesto) CoSAI Model Context Protocol Security white paper
There has been lots of discussion recently about AI eliminating jobs. But what if the real fear is that AI will instead make existing jobs miserable? In this episode, Cal argues that LLM–based tools are poised to accelerate the worst aspects of pseudo-productivity to an absurd degree. He then shares five ideas for avoiding this fate in your own professional life. Below are the questions covered in today's episode (with their timestamps). Get your questions answered by Cal! Here's the link: https://bit.ly/3U3sTvo Video from today's episode: youtube.com/calnewportmedia (0:00) How Do I Escape the “Busyness Singularity”? (29:03) Reaction to Cal's newsletter about LLMs (33:35) Slow productivity for managers (37:26) Efforts to improve cognitive fitness (40:28) What Cal is reading (42:21) What Cal is up to Books: In Defense of Food (Michael Pollan) Links: Buy Cal's latest book, “Slow Productivity” at www.calnewport.com/slow Get a signed copy of Cal's “Slow Productivity” at https://peoplesbooktakoma.com/event/cal-newport/ Cal's monthly book directory: bramses.notion.site/059db2641def4a88988b4d2cee4657ba? https://www.economist.com/leaders/2026/05/14/prepare-for-an-ai-jobs-apocalypse https://www.nytimes.com/2026/05/21/technology/newsom-ai-executive-order-california.html https://www.microsoft.com/en-us/worklab/work-trend-index/breaking-down-infinite-workday https://www.axios.com/2023/02/15/burnout-2022-2023-slack-remote-work-future-forum https://calnewport.com/on-god-and-llms/ Thanks to our Sponsors: https://www.shopify.com/deep https://www.larridin.com https://www.masterclass.com/deep https://www.expressvpn.com/deep Thanks to Jesse Miller for production and mastering, Jay Kerstens for the intro music, and Nate Mechler for research and newsletter. Learn more about your ad choices. Visit podcastchoices.com/adchoices