Podcasts about sonnets

Poetic form, traditionally fourteen specifically-rhymed lines

1,256PODCASTS
3,635EPISODES
27mAVG DURATION
1DAILY NEW EPISODE
Jul 20, 2026LATEST

POPULARITY

20192020202120222023202420252026

Best podcasts about sonnets

SONNETCAST â€“ William Shakespeare's Sonnets Recited, Revealed, Relived

168 episodes with sonnets

Shakespeare Sundays with Chop Bard

155 episodes with sonnets

Unbound Sketchbook

155 episodes with sonnets

A Voix Haute

85 episodes with sonnets

The Daily Poem

26 episodes with sonnets

The Grey Rooms

53 episodes with sonnets

Shakespeare’s Sonnets

155 episodes with sonnets

Poetry For All

22 episodes with sonnets

Everyday AI Podcast â€“ An AI and ChatGPT Podcast

16 episodes with sonnets

The Persistent Rumor

42 episodes with sonnets

Poem-a-Day

22 episodes with sonnets

Classic Poetry Aloud

35 episodes with sonnets

Words in the Air: 52 Weeks of Poetry

20 episodes with sonnets

Rusty Sonnets

40 episodes with sonnets

Shakespeare Saga

30 episodes with sonnets

Audio Poem of the Day

12 episodes with sonnets

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

14 episodes with sonnets

This Day in AI Podcast

12 episodes with sonnets

Folger Shakespeare Library: Shakespeare Unlimited

9 episodes with sonnets

The Slowdown

8 episodes with sonnets

The History of Literature

8 episodes with sonnets

Beyond Shakespeare

23 episodes with sonnets

The Pendant Shakespeare audio drama anthology

18 episodes with sonnets

Podcast Shakespeare

25 episodes with sonnets

Dialogic

21 episodes with sonnets

The Grimerica Show

7 episodes with sonnets

In Our Time

5 episodes with sonnets

TWIP! Pendant Productions audio drama news

11 episodes with sonnets

???? More to Read

11 episodes with sonnets

The AI Breakdown: Daily Artificial Intelligence News and Discussions

7 episodes with sonnets

People's Guide to the Cthulhu Mythos

12 episodes with sonnets

The Nonlinear Library

16 episodes with sonnets

Black Clock Audio Tales: Audio Books, Science Fiction, Folklore, Gothic Literature, Classic Horror, and the Cthulhu Mythos

12 episodes with sonnets

The Cloud Pod

8 episodes with sonnets

The Fourth Way

9 episodes with sonnets

PodCastle

6 episodes with sonnets

Women of Substance Music Podcast

6 episodes with sonnets

Procento Miloše ?ermáka

10 episodes with sonnets

Books & Writers · The Creative Process

6 episodes with sonnets

Marketing Against The Grain

5 episodes with sonnets

(sub)Text Literature and Film Podcast

5 episodes with sonnets

The Bardcast: "It's Shakespeare, You Dick!"

6 episodes with sonnets

Poetry · The Creative Process

6 episodes with sonnets

Music & Dance · The Creative Process

6 episodes with sonnets

Poetry Unbound

3 episodes with sonnets

amimetobios

11 episodes with sonnets

Free Audio-Books

22 episodes with sonnets

Social Justice & Activism · The Creative Process

5 episodes with sonnets

The iServalanâ„¢ Show

11 episodes with sonnets

??????? with ???

5 episodes with sonnets

Front Row

5 episodes with sonnets

Shakespeare Is My Home Slice

14 episodes with sonnets

Lenny's Podcast: Product | Growth | Career

3 episodes with sonnets

No Holds Bard

7 episodes with sonnets

Show all podcasts related to sonnets

Latest podcast episodes about sonnets

Episode #563: Primary Mind vs. Extended Mind: Aaron Neyer on Digital Brains

Crazy Wisdom

Play Episode Listen Later Jul 20, 2026 43:42

In this episode of the Crazy Wisdom Podcast, host Stewart Alsop speaks with Aaron Neyer, founder of Parachute and community organizer in Boulder, about knowledge management, extended minds, and the intersection of AI with human consciousness. They explore how Parachute functions as a digital brain tool for organizing thoughts and information across fragmented systems, discuss the dangers of AI psychosis and over-reliance on technology, and debate open source AI development versus controlled releases by companies like Anthropic. The conversation weaves through topics including the limitations of metrics-driven business thinking, consciousness and relevance realization, the value of technological sabbaths, and Aaron's hope for locally-run open source models that protect personal data while still accessing more powerful gated models when needed. You can find Aaron's writing at unforced.org and unforced.substack.com, and learn more about Parachute at parachute.computer and parachute.computer/blog.Timestamps00:00 Stewart welcomes Aaron Neyer, founder of Parachute and Boulder community organizer, discussing the origin of Parachute's name from Frank Zappa's quote about open minds.05:00 Aaron explains Parachute as an extended mind tool for organizing notes, contacts and information across multiple platforms, emphasizing the distinction between primary mind and extended mind as interconnected systems.10:00 Discussion shifts to metrics-driven business culture and the limitations of pure rationality, exploring how Google's data-driven approach misses subjective experience and the whole picture of relationships.15:00 Aaron discusses AI's ability to help identify relevant variables across different domains and the dangers of AI psychosis, comparing it to cult dynamics and belief systems.20:00 The conversation covers AI sabbaths and nineties retreats as intentional breaks from technology, plus Aaron's experiences with electrical engineering and using AI to design circuits with Arduinos.25:00 Exploring forbidden knowledge and open source AI, Aaron discusses Anthropic's guardrails around powerful models while arguing for distributed access to prevent concentration of power.30:00 Deep dive into open source AI strategy, with Aaron highlighting NVIDIA's approach and the potential for running capable models locally while reserving ultra-intelligent models for complex research tasks.35:00 Aaron shares his vision for local Sonnet-class models handling personal data while accessing Fable-class models for deep research, and directs listeners to unforced.org and parachute.computer for his writing.Key Insights1. The philosophy behind Parachute stems from Frank Zappa's quote that the mind is like a parachute and doesn't work if it isn't open. Aaron Neyer explains that having an open mind is valuable, but it must be balanced with deep roots to avoid becoming untethered. He has experienced periods in his life where excessive openness led him to feel disconnected, teaching him that creativity and expansion need to be grounded in something substantial. This same principle applies to how we organize information digitally, where openness and interoperability allow our extended minds to become more connected and coherent, which in turn helps our primary minds think more clearly.2. Parachute is designed as an extended mind tool that addresses the fragmentation problem in how we currently manage information. Most people use multiple disconnected tools like Obsidian, Notion, Apple Notes, Google Keep, and various CRMs to organize their thoughts, notes, and relationships. These systems don't communicate well with each other, creating inefficiency and confusion. Parachute aims to create a simple, intuitive system where all this information can be organized in one place with true interoperability, allowing users to own their data and have it speak effectively with other tools, ultimately making our entire extended mind more functional.3. Understanding ourselves as unified body mind organisms rather than fragmented parts is essential for effectiveness. Living systems theory shows that any living system is three things: a membrane bound dissipative structure, a self regulating autopoietic network, and a cognitive process actively knowing the world. Western civilization since Descartes and Galileo has created artificial separation between body and mind, and between subjective and objective experience, which limits our effectiveness. The same fragmentation affects our digital technology, and recognizing both our biological and digital systems as coherent wholes rather than disconnected parts makes us vastly more capable.4. The relationship between data driven approaches and holistic thinking reveals important limitations in modern business and science. While working at Google, Aaron observed how data driven decision making can be powerful, but over reliance on metrics like ROI creates blindness to crucial unmeasurable factors like goodwill and relationship quality. This reflects a broader Western tendency to exclude subjective experience because it's difficult for objective science to measure. However, emotions, relationships, and other subjective elements are essential parts of reality, and focusing only on quantifiable metrics means missing the whole picture and ultimately becoming less effective despite appearing more rational.5. AI accelerates the ability to work with technical complexity by helping with relevance realization across domains where we lack expertise. In any specialized field, experts develop intuitive senses for which variables matter and can quickly identify problems, whether in computer troubleshooting, music, or cooking. AI's ability to generalize allows it to point people toward relevant solutions in areas where they haven't developed that intuitive expertise, effectively democratizing technical capability. This means people can direct their creativity more effectively across more domains, though it also raises concerns about giving powerful capabilities to those who may lack the wisdom to use them responsibly.6. The question of open source AI versus gated access involves complex tradeoffs between democratizing power and preventing harm. Aaron respects Anthropic's approach of creating guardrails around powerful models like Mythos, which would likely have caused significant system hacks if released without restrictions. However, this creates concerning power dynamics where only wealthy companies, governments, and their allies have access to the most powerful tools. NVIDIA offers hope through their truly open source approach including full training pipelines, and there may be a viable path where open source models at the Sonnet capability level handle most tasks locally while more powerful Fable class models remain gated for the most demanding work.7. Creating intentional breaks from AI and technology is essential for maintaining clear independent thinking. Aaron practices an AI Sabbath at least one day per week when he doesn't interact with AI, and he finds these are the days when he does his best thinking and journaling. Without these breaks, he finds himself constantly jumping between journaling and prompting AI rather than giving himself space for deep reflection. This pattern mirrors broader concerns about AI consistency creating cult like dynamics similar to organized religion, where constant immersion in a belief system or technology can lead to losing the ability to think independently, making periodic disconnection crucial for maintaining cognitive autonomy and clarity.

ai google china state living deep digital innovation western argentina llc silicon valley capital whatsapp exploring engineering intuition roi consciousness substack tribe wyoming primary threads corporations journaling crm boulder colorado brains nvidia nervous system mythos notion fable goodwill anthropic galileo mechanical engineering biases electrical engineering gpu frank zappa entities obsidian parachutes guardrails vipassana specialization descartes gnosticism crms western civilization sensors sonnets open mind variables business relationships balance sheets interoperability kyc rationality belief systems orchestration arduino inference knowledge management electronic frontier foundation community organizing generalization personal information customer relationship management evolutionary psychology google keep wu wei organized religion distillation meditation retreat open source ai capacitors silent meditation apple notes microprocessor resistors john perry barlow arduinos usb hub digital chaos

Keep It Going 14

Shakespeare Saga

Play Episode Listen Later Jul 17, 2026 2:12

Sonnet 40

sonnets keep it going

The AI Glossary You Need, Cyber Insurers Shift to Speed, Big Tech Flips on AI Jobs Doom, Free AI Credits, Fable Ban Lifted, Claude Sonnet 5

Tech Gumbo

Play Episode Listen Later Jul 16, 2026 22:02

News and Updates: AI Glossary for Everyone: TechCrunch published a plain-English glossary defining essential AI terms like LLMs, tokens, hallucinations, agents, and RAMageddon, helping everyday readers keep pace with the industry's evolving vocabulary. Cyber Insurers Prioritize Speed: Insurers now evaluate how fast companies detect, patch, and recover from breaches rather than just static defenses, as AI shrinks the window between vulnerability discovery and exploitation. Big Tech Flips on AI Jobs: CEOs who once predicted massive AI-driven layoffs now emphasize job creation and productivity, with executives expecting significant headcount cuts dropping from 46% to 20%. Free AI Computing Credits War: OpenAI and Anthropic are showering startups with millions in free credits, letting some founders delay fundraising as model makers battle for future enterprise customers. Anthropic Fable Ban Lifted: The Trump administration reversed its weekslong export ban on Anthropic's Fable model, igniting fierce debate over how much the federal government should control frontier AI access. Claude Sonnet 5 Launches: Anthropic released Sonnet 5, a cheaper agentic model approaching Opus 4.8 performance with default cybersecurity safeguards, fewer hallucinations, and stronger resistance to prompt injection attacks.

ai english news jobs shift speed doom cyber big tech credits fable anthropic lifted opus flips sonnets insurers glossary

1559: Florida Doll Sonnet by Denise Duhamel and Maureen Seaton

The Slowdown

Play Episode Listen Later Jul 15, 2026 5:18

Today's poem is Florida Doll Sonnet by Denise Duhamel and Maureen Seaton. Sharing poetry is a two-way street. So today, you'll be hearing a poem selected by one of our listeners. Today's selection was submitted by J.D. from Texas.The Slowdown is your daily poetry ritual. In this episode, J.D. reflects… “Today's poem is a cheeky tableau of a grocery store where we get to see what we present to the world, but also what the world presents to us.” This show is supported by gifts from listeners. Support The Slowdown with a donation and get access to the sponsor-free version of The Slowdown today. Slowdownshow.org/donate.If you'd like to participate in the next round of Community Selections, sign up for our newsletter. We'll be sending a special Slowdown postcard to new subscribers that sign up this week. Subscribe at slowdownshow.org/newsletter.

texas sharing slow down sonnets seaton denise duhamel

Ep 817: ChatGPT's 5.6 Sol, Grok and Meta bounce back and OpenAI's biggest week ever? And more AI News That Matters

Everyday AI Podcast â€“ An AI and ChatGPT Podcast

Play Episode Listen Later Jul 13, 2026 39:58 Transcription Available

Anthropic had a really bad week.

AI News #8

INNOQ Podcast

Play Episode Listen Later Jul 13, 2026 38:38 Transcription Available

Diese Folge steht ganz im Zeichen der Model News. Fabian Walther und Ole Wendland sprechen über Fable, das doch noch eine Gnadenfrist samt Quota-Reset bekommt, und über die Frage, was das alles eigentlich kostet. Der DeepSWE Benchmark zeigt: Sonnet 5 lohnt sich für Coding kaum, und OpenAIs neue Modelle Soul, Terra und Luna setzen voll auf Effizienz. Dazu ein neues Grok von SpaceX xAI, trainiert auf Cursor-Sessions, und der Frontier-Code-Benchmark fragt endlich: Würde ein Senior das wirklich mergen? Außerdem: lokale Modelle zum Selbstbauen und ein Security-Paper über Prompt Injection.

senior dazu diese folge zeichen coding fable grok modelle effizienz sonnets ai news

This Week in European Tech: Europe's AI wake-up call

EUVC

Play Episode Listen Later Jul 11, 2026 62:39

Europe's AI position is being tested from every direction: Chinese open-weight models are gaining usage, Nvidia could become even more central to the stack and Europe still has gaps across power, chips, data centres and model capacity.In the latest episode of This Week in European Tech, Mads Jensen and Dan Bowyer of SuperSeed are joined by Lomax Ward of Outsized Ventures to discuss the AI power shift between China, the US and Europe.They also cover UK pension capital moving into venture, whether today's AI market looks like 1999, Anthropic and OpenAI's IPO pressure, model commoditisation and where Europe may still have an edge.HighlightsWhy China's AI model usage is becoming harder to ignoreWhat Chinese model restrictions could mean for EuropeEurope's AI infrastructure gap across power, chips, data centres and modelsWhether UK pension capital is finally moving into ventureWhy today's AI market both does and does not look like 1999Where Europe may still have a real opportunity in AI—————Love Tomorrow Summit - July 23, 2026, Tomorrowland, BelgiumThe Impact Circle Investor Lounge - July 24, 2026EUVC is curating the investment stage.Register here.—————Timestamps(01:45) Introduction(03:25) Quick news roundup(05:30) Tesla, SpaceX, Samsung and the week's market signals(08:00) The UK's fiscal outlook and the OBR warning(11:10) Government oversight of frontier AI releases(12:00) China's rise in AI model usage(14:30) Could Nvidia benefit if China restricts model access?(17:00) Europe's exposure across the AI infrastructure stack(19:15) OpenAI, Anthropic and the pressure to IPO(22:45) UK pension capital starts moving into venture(25:00) Why pension investment matters for the UK ecosystem(28:00) Where pension capital should be deployed(30:00) Incentives, coercion and the role of government(32:45) Is today's AI market another 1999?(40:30) The case against the dot-com comparison(46:30) AI Corner: Anthropic's Fable and Sonnet strategy(48:30) Lisbon's earthquake history and engineering response(54:15) Lisbon's tech ecosystem(56:30) Europe's opportunity, deals of the week and what to watch next

ai europe uk china chinese government tesla register spacex samsung ipo openai nvidia incentives wake up call lisbon fable anthropic tomorrowland sonnets obr european tech

AI Distillation: How Frontier Models Teach Each Other #1870

Geek News Central

Play Episode Listen Later Jul 10, 2026 45:43 Transcription Available

In this episode, Ray Cochrane breaks down AI distillation, the teacher-student technique frontier labs now lean on to train smaller, cheaper models. He also covers GPT-5.6’s government-vetted rollout, Claude Sonnet 5 landing on AWS, Maryland’s two-year data center pause, and Microsoft’s climbing carbon numbers. Finally, he wraps with Apple’s $30 billion Broadcom deal, Meta’s tamper-proof recording light, Michigan’s parasite outbreak, and a simulation that erased a super El Niño. – Want to start a podcast? Its easy to get started! Sign-up at Blubrry – Thinking of buying a Starlink? Use my link to support the show. Subscribe to the Newsletter. Email Ray if you want to get in touch! Like and Follow Geek News Central’s Facebook Page. Support my Show Sponsor: Best Godaddy Promo Codes Get 1Password Full Summary Cochrane opens with a quick personal update. Longer days have him outdoors, including a float trip on the Sandy River at Dabney State Park, where he found clearer water, clay-like sand, and easy footing. Next week brings both a move and a trip home, so he is stocking up on Trader Joe’s “Power Berries” and IKEA bags at his mom’s request. Then he turns to the lead story. AI Distillation Explained: How Frontier Models Teach Each Other Cochrane’s featured story comes from Hugging Face engineer Sergio Paniego. Distillation is teacher-student training for AI: a capable model generates the training signal, and a smaller student learns to match it. The classic off-policy version compresses giant models into cheap students, either through soft labels or piles of worked answers. Google’s Gemma models and DeepSeek’s R1-Distill line were built exactly this way. However, the industry is now converging on multi-teacher on-policy distillation, or MOPD. Labs build reinforcement-learning specialists for math, coding, and agentic work, then have them grade a single student, word by word, as the student generates its own answers. DeepSeek-V4, MiMo-V2-Flash, and NVIDIA’s Nemotron 3 Ultra all run versions of the recipe, and the Qwen3 team reported better results at roughly a tenth of the GPU hours of raw reinforcement learning. Finally, self-distillation lets models like Cursor’s Composer 2.5 learn from better-prompted versions of themselves. Sponsor: GoDaddy Economy hosting $6.99/month, WordPress hosting $12.99/month, domains $11.99. Website builder trial available. Use codes at geeknewscentral.com/godaddy to support the show. GPT-5.6 Arrives With a Government-Vetted Rollout OpenAI shipped GPT-5.6 as a three-tier family: Sol, Terra, and Luna. Sol costs five dollars in and thirty dollars out per million tokens, half of Claude Fable 5’s rate. The benchmarks split: Sol Ultra wins Terminal-Bench at 91.9 percent, while Claude Fable 5 still leads SWE-Bench Pro. Notably, the API launched in limited preview to roughly 20 partners vetted by the U.S. government, though the model went live in Microsoft 365 Copilot on day one. Claude Sonnet 5 Lands on AWS, Plus Quick AWS Wins Claude Sonnet 5 arrived on AWS through Bedrock, pitched as top-tier intelligence at Sonnet pricing. Additionally, Amazon WorkSpaces for AI agents reached general availability, enabling agents to drive full desktop applications securely. OpenSearch gained a log-analytics engine claiming four times the price-performance, and SageMaker now scales inference about twice as fast. Cochrane also flags that Kendra and Q Business move to maintenance mode at the end of July. Anthropic Wants You to Reflect on Your Claude Habits Anthropic launched Reflect, a beta feature that analyzes your past Claude conversations and visualizes how you actually use the assistant. It requires Memory, excludes incognito and health-related chats, and keeps its insights inside the tool. Cochrane loves the idea. He reviews his own transcripts to extract prompt patterns and turn them into reusable skills, and he suggests listeners simply ask their AI to do the same. AlphaEvolve Goes GA on Google Cloud Google made AlphaEvolve generally available to Google Cloud customers on the Gemini Enterprise Agent Platform. The agent acts as an evolutionary collaborator: provide a baseline algorithm and your goals, and it searches for better, human-readable code. BASF, JetBrains, and Kinaxis are the named early adopters. Meanwhile, Cochrane renews his standing wish that DeepMind release AlphaGo as a playable teacher. Google Adds “How This Ad Was Made” AI Labels Google is adding a “How this ad was made” section to My Ad Center across Search, YouTube, and Discover. Ads built with Google’s own AI tools automatically get the disclosure, backed by invisible watermarks. However, ads made with outside tools rely on advertiser self-declaration. Cochrane points out the limits of voluntary disclosure in an AI-flooded content economy. Microsoft’s Carbon Emissions Climb 25 Percent Microsoft’s new sustainability report shows emissions up 25% in 2025, driven by a data center construction spree. The gross figure is 34 million metric tons before offsets, while other coverage puts the net figure at around 20 million. Water consumption also jumped thirty-four percent, even as Microsoft claims its first water-positive year. Cochrane argues regulation needs to catch up, since Google and Amazon report similar increases. Prince George’s County Pauses Data Centers for Two Years Prince George’s County adopted a two-year moratorium on new data center development, the longest pause in Maryland so far. The resolution blocks new applications, including hyperscale projects, until the council passes real regulations. Water and energy impacts remain open questions the county intends to study. Cochrane gives kudos to residents for making their voices heard. Apple and Broadcom Ink a $30 Billion U.S. Chip Deal Apple is expanding its partnership with Broadcom with a multiyear agreement expected to exceed $30 billion. The deal covers custom silicon and wireless components, with more than fifteen billion chips to be made on American soil. Broadcom’s Fort Collins, Colorado plant anchors the work with a $1.5 billion equipment expansion. Tim Cook framed the deal as accelerating Apple’s commitment to American manufacturing. MSI and Intel Ship the First Arc G3 Extreme Handheld Intel detailed how it co-engineered the MSI Claw 8 EX AI+, the first handheld on the Arc G3 Extreme processor. Highlights include a heat-spreading board layout and game-tuning loops that Intel says run Cyberpunk 2077 up to thirty-seven percent faster. The device is on sale now in void purple for around $1,500. At that price, Cochrane jokes he would rather buy a computer. Meta’s Glasses Get a Tamper-Proof Recording Light Meta answered the most common privacy questions about its AI glasses. Photos stay private on the device until the wearer imports or shares them, and a white capture LED blinks during any recording with no off switch. Moreover, newer glasses disable the camera if the LED is blocked, tampered with, or destroyed. Cochrane reminds listeners these claims are Meta grading its own homework, but the blink signal is worth recognizing in public. Michigan’s Parasite Outbreak Tops 1,200 Cases Michigan’s cyclosporiasis outbreak reached 1,251 cases since June 22, with roughly forty hospitalizations along the way. Northwest Ohio adds more than five hundred cases. The parasite typically spreads through contaminated fresh produce, and investigators still have not found the source. Cochrane’s advice: wash your produce, and get tested if your symptoms fit. AI Finds the San Andreas Fault’s Silent Slips Researchers paired AI with borehole strainmeters to detect dozens of hidden slow-slip events beneath the San Andreas Fault’s Parkfield section. Each silent slip releases stress within hours and is reliably followed by low-frequency earthquakes. Together, the findings support a continuous spectrum from silent creep to destructive quakes. The study appears in Nature Communications, and Cochrane hopes it will lead to better earthquake prediction. Cloud Brightening Erased a Super El Niño, in a Simulation Finally, a Science Advances study simulated marine cloud brightening in response to the 1997 and 2015 super El Niño events. Seeding clouds over the eastern Pacific erased the events entirely inside the model. Real deployment would take roughly 2,400 ships spraying continuously, and the simulations showed side effects like extra warming over Europe and Asia. Cochrane finds the weather-machine concept fascinating, yet he questions the consequences of altering cycles the planet runs for a reason. The post AI Distillation: How Frontier Models Teach Each Other #1870 appeared first on Geek News Central.

american amazon ai europe google apple discover water real thinking colorado michigan search microsoft maryland teach memory longer reflect pacific newsletter led models ikea intel photos sol cyberpunk wordpress nvidia labs api composer lands ads frontier gpt aws starlink notably trader joe tim cook copilot gpu google cloud seeding fort collins bedrock prince george cochrane sonnets deepmind basf broadcom cursor msi alphago nature communications blubrry science advances distillation jetbrains northwest ohio san andreas fault msi claw sagemaker geek news central amazon workspaces sandy river parkfield

Semana de IA y Redes: 10 noticias que están cambiando el juego del marketing digital

Noticias Marketing

Play Episode Listen Later Jul 10, 2026 4:13 Transcription Available

Este episodio de Noticias Marketing reúne diez movimientos clave en IA y redes que podrían redefinir tu forma de crear, monetizar y competir. Vemos a Babel adelantándose en el negocio al adquirir la división de datos e IA de Bosonit, y a Castilla-La Mancha buscando-restaurar ingresos publicitarios con soluciones de IA. También hay señales de cambio a nivel global: EE. UU. levanta restricciones de exportación para los modelos de Anthropic y España impulsa una gran fábrica nacional de IA. Además, Anthropic presenta Sonnet 5, una versión más eficiente y económica pensada para ahorrar tiempo y dinero a los creadores. ¿Qué oportunidades surgirán para emprendedores y medios que sepan aprovechar estas herramientas?En el lado de las redes, las novedades apuntan a hacer tu contenido más visible y seguro. Instagram añade descripciones personalizadas dentro de carruseles, TikTok facilita la gestión de palabras clave y LinkedIn ya admite imágenes animadas en comentarios. La seguridad de cuentas de adolescentes se fortalece con controles mejorados y alertas potenciadas por IA, mientras Meta mejora su editor de video con efectos de sonido y controles impulsados por IA. ¿Qué estrategia de contenidos y herramientas te ayudarán a destacar y a ahorrar tiempo esta semana? Si te interesa aprender con aprendizajes prácticos, suscríbete a la newsletter de Marketing Radical en borjagiron.com. Gracias por escuchar y nos vemos el próximo viernes.Conviértete en un supporter de este podcast: https://www.spreaker.com/podcast/noticias-marketing--5762806/support.Newsletter Marketing Radical: https://marketingradical.substack.com/welcomeNewsletter Negocios con IA: https://negociosconia.substack.com/welcomeMis Libros: https://borjagiron.com/librosSysteme Gratis: https://borjagiron.com/systemeSysteme 30% dto: https://borjagiron.com/systeme30Manychat Gratis: https://borjagiron.com/manychatMetricool 30 días Gratis Plan Premium (Usa cupón BORJA30): https://borjagiron.com/metricoolNoticias Redes Sociales: https://redessocialeshoy.comNoticias IA: https://inteligenciaartificialhoy.comClub: https://triunfers.com

tiktok espa tambi adem ia semana babel noticias redes ee uu anthropic marketing digital vemos el juego cambiando sonnets convi castilla la mancha

Emergency meeting - Fable 5 for designers

Designing Success

Play Episode Listen Later Jul 9, 2026 42:16

Text me and tell me what you think of this ep. Want to watch this and see the screen share version of the apps? chek it out on YouTube here https://youtu.be/LHPePe1tDMQ?si=78-Y9S-9glV8-V54Fable (Claude's most capable model) is free until July 12th. Rhiannon Lee used hers to audit 40+ AI skills, build a private boardroom of business advisors, create a marketing analytics dashboard, develop an AI-readiness quiz for interior designers, and map out her entire September launch strategy. This episode shows what king-level thinking looks like in a real design studio — and why you shouldn't waste free tokens on cheap work. → Take the AI-Readiness Audit: rhiannon-lee.com (launching after Decor & Design, early August 2026) → Meet Rhiannon at Decor & Design Melbourne: Thursday 12:30pm — DM @the_rhiannonlee ────────────────────────── CHAPTERS ────────────────────────── 0:00 Why Rhiannon sat up until 2:30am using her Fable tokens 2:15 The difference between Haiku, Sonnet, Opus, and Fable models 5:40 How to use Fable without wasting credits on cheap work 8:20 Auditing your AI skills library for king-level thinking 11:45 Using Fable to security-audit your WordPress website 15:30 What is a Launch Command Centre and why you need one 18:40 Building a Marketing Analytics dashboard as an artifact 22:15 Inside Rhiannon's private boardroom of six AI advisors 28:05 Who sits on the board and what each advisor does 32:50 Building a PR Scout app to find podcast and speaking opportunities 36:20 Creating an AI-readiness quiz for your studio 39:10 The content engine: turning one piece into five platforms 45:15 What agents are and why Rhiannon is building a marketing agency 48:30 The difference between deploying live vs building as an artifact 51:40 How interior designers can use Fable without coding skills 54:25 Your job going forward: ask the right business questions 57:15 Rhiannon's big goal: help 500 designers stop playing every instrument ────────────────────────── RESOURCES MENTIONED ────────────────────────── → Fable 5 (Claude's advanced model): claude.ai → Rhiannon's Marketing Command Centre: artifact example (in video) → Rhiannon's AI Boardroom: artifact example (in video) → AI-Readiness Audit: rhiannon-lee.com (launching August 2026) → Studio Build — 6-week AI implementation intensive: rhiannon-lee.com/studio-build → Studio CEO — 12-week business coaching program: rhiannon-lee.com/studio-ceo → Studio Learn — free resources for interior designers: rhiannon-lee.com ────────────────────────── ABOUT RHIANNON LEE ────────────────────────── AI strategist for Australian interior designers. Former Oleander & Finch. Creator of the Studio Suite — Studio Learn, Studio Build, and Studio CEO — operational AI implementation for design businesses, not productivity theatre. → Instagram: @the_rhiannonlee → Website: rhiannon-lee.com → Podcast: Designing Success (weekly on YouTube, Spotify, Apple Podcasts) Thanks for listening to this episode of "Designing Success: From Study to Studio"! Connect with me on social media for more business tips, and a real look behind the scenes of my own practicing design business. Grab more insights and updates:Follow me on Instagram: https://instagram.com/oleander_and_finchLike Oleander & Finch on Facebook:https://www.facebook.com/oleanderandfinch For more FREE resources, templates, guides and information, visit the Designer Resource Hub on my website ; https://oleanderandfinch.com/Ready to take your interior design business to the next level? Check out my online course, "The Framework," designed to provide you with everything they don't teach you in design school and to give you high touch mentorship essential to having a successful new business in the industry. Check it out now and start designing YOUR own successTHE FRAMEWORK ( now open) https://www.oleanderandfinch.com/the-framework-for-emerging-designers/Remember to subscribe to the podcast and leave a review. Your feedback helps me continue providing valuable content to aspiring interior designers. Stay tuned for more episodes filled with actionable insights and inspiring conversations...

S08E14 - Over Mistral, Fables terugkeer en oberdrones

Radio Raccoons

Play Episode Listen Later Jul 9, 2026 97:11

Welkom terug bij Radio Raccoons! In deze laatste aflevering voor onze zomerbreak (

Narco-Terrorism and the Criminal Mind: What the 22nd MEU's Caribbean Campaign Reveals About Cartel Psychology, Organizational Violence, and

Forensic Psychology

Play Episode Listen Later Jul 8, 2026 4:54 Transcription Available

The transnational narco-terrorist networks that the 22nd Marine Expeditionary Unit spent ten months hunting across the Caribbean under Operation Southern Spear are not simply criminal organizations that happen to carry weapons, they are hierarchically sophisticated, psychologically coercive institutions that have evolved deliberate organizational cultures built around controlled violence, paranoid loyalty enforcement, and the systematic psychological conditioning of members at every level to normalize lethality as a routine instrument of business and territorial control. This episode applies a forensic psychology lens to what the scale and military character of the 22nd MEU's counter-narcotics deployment tells us about how far cartel organizations have traveled from street-level drug trafficking toward something that more closely resembles a paramilitary state, examining the leadership psychology, coercive control structures, and collective identity mechanisms that allow these networks to absorb law enforcement and military pressure, reconstitute themselves, and continue operating across international boundaries with a level of organizational resilience that conventional criminal justice frameworks were never designed to confront. Drawing on the operational realities exposed by Southern Spear, this episode asks what forensic psychology, organizational behavior science, and the emerging literature on narco-terrorism can tell us about why these organizations are so difficult to permanently dismantle and what it would actually take to break the psychological and social infrastructure that keeps them alive. IAB Tags: Health/Medical/Mental Health, Crime/True Crime, Military/Defense, Law/Government/Legal, Society/Issues, Education, News/Politics Let me know if you want a narco or covert operations version added to complete the full set.Sonnet 4.6 Low

T6.E133. INSIDE X AI WARS ANTHROPIC CONTRAATACA! Claude Fable 5, Sonnet 5, Tag, Science... y mucho más.

xHUB.AI

Play Episode Listen Later Jul 8, 2026 114:27 Transcription Available

# TEMA AI WARS ANTHROPIC CONTRAATACA! Claude Fable 5, Sonnet 5, Tag, Science... y mucho más.INSIDE X!# PRESENTA Y DIRIGE

ai science dm wars paypal dom redes sociales fable inteligencia artificial anthropic ciencias mucho m env sonnets y mucho correo

Inteligencia Artificial, tu resumen semanal | Regresó la IA más poderosa del mundo

Mercatishow - Juan Lombana

Play Episode Listen Later Jul 6, 2026 23:48

google microsoft airbnb mundo adem estados unidos ia openai shopify uber eats inteligencia artificial anthropic sam altman laia poderosa sonnets regres resumen semanal

Anthropic Launches Claude Sonnet 5: High Performance, Lower Cost

Elon Musk Pod

Play Episode Listen Later Jul 4, 2026 21:04

Anthropic recently launched Claude Sonnet 5, a mid-tier artificial intelligence model that offers high-level agentic capabilities at a significantly lower cost than premium alternatives. This new release excels in reasoning and coding, nearly matching the performance of the company's flagship model, Opus, while maintaining a competitive pricing structure. Industry experts suggest this update marks a shift where autonomous task execution is becoming a standard feature rather than a luxury service. Additionally, the model introduces default security safeguards to prevent cyber-related misuse during complex workflows. While many praise the cost-to-capability ratio, some observers warn that the rapid commoditization of such powerful tools could challenge the long-term valuations of major AI developers.

ai launches high performance anthropic opus sonnets lower costs

Claude Science: AI Workbench for Scientists #1868

Geek News Central

Play Episode Listen Later Jul 3, 2026 32:47 Transcription Available

In this episode, Ray Cochrane digs into Claude Science, Anthropic’s new AI workbench for researchers, and explains why its auditable, reproducible outputs matter more than the AI itself. He also covers Google’s June AI recap, the Pulpie web-cleaning model, the PamStealer Mac malware, a synthetic cell that divides on its own, and the first-ever treatment trial for the Bundibugyo strain of Ebola. Finally, he looks skyward with NASA’s year-long Mars simulation and a can’t-miss July skywatching night. – Want to start a podcast? Its easy to get started! Sign-up at Blubrry – Thinking of buying a Starlink? Use my link to support the show. Subscribe to the Newsletter. Email Ray if you want to get in touch! Like and Follow Geek News Central’s Facebook Page. Support my Show Sponsor: Best Godaddy Promo Codes Get 1Password Full Summary Cochrane opens with a quick personal update. He mentions a busy stretch at Blubrry and an upcoming local move, followed by a late-August trip to Michigan with his partner. He also notes that Oregon is at its summer best right now, with flowers everywhere. Then he sends a happy 30th birthday to his sister Anna before diving into the lead story. Claude Science: Anthropic’s AI Workbench for Researchers Cochrane leads with Claude Science, which Anthropic describes as an AI workbench for scientists. It pulls together scattered research tools into a single environment and ships with more than 60 skills and connectors for fields such as genomics and structural biology. According to Anthropic, every result carries an auditable history: the exact code, the computing environment, and a plain-language account of what happened. For Cochrane, reproducibility is the real story, not AI, because a result that nobody can reproduce does not count for much. The tool runs on macOS and Linux, with Windows support via WSL, and Cochrane says he already has it set up on his own machine for his machine learning work. Notably, it arrived during a busy week for Anthropic. The company also shipped an upgraded Sonnet 5, and its top-tier Fable 5 model returned after a brief pause tied to a US export-control order. Sponsor: GoDaddy Economy hosting $6.99/month, WordPress hosting $12.99/month, domains $11.99. Website builder trial available. Use codes at geeknewscentral.com/godaddy to support the show. Google Recaps Its June AI Push Next, Cochrane runs through Google’s roundup of its June AI announcements. A Gemini-based prototype aims to help local councils clear administrative backlogs and halve the time to process planning applications. Google is also backing seven bipartisan bills against scams. Meanwhile, its research models now predict river floods up to seven days out, track wildfire boundaries, and forecast cyclone paths. Pulpie Cleans Up the Web for a Fraction of the Cost This one comes from Hugging Face. Pulpie is a web extractor that strips a page down to its main content and tosses the ads, headers, and sidebars. That matters because language models read the web twice, once in training and again at inference. According to the team, cleaning a billion pages costs about $7,900 with Pulpie versus roughly $159,000 with the leading tool, Dripper, and it is open for anyone to use. An AI Alexander Hamilton Comes to Boston The Museum of American Finance, a free Smithsonian affiliate, opens a new home on July 3 at Boston’s Commonwealth Pier. Its headline draw is an AI-generated Alexander Hamilton that chats with visitors in multiple languages. However, Cochrane finds it fun but wonders how much of an upgrade an AI kiosk really is over a scripted one. Arm’s June Roundup: Azure Silicon and Better Mobile Graphics From the Arm newsroom, Cochrane highlights Cobalt 200, Microsoft’s Azure processor tuned for agentic AI. Two Unreal Engine features, MegaLights and Nanite, are also coming to mobile to deliver richer lighting and detail. Additionally, a small open-source robot called Reachy Mini shows off on-device physical AI. Arm also continues shrinking large language models to fit on phones. Meta Marks Ten Years of Backing Python Meta just hit its tenth straight year sponsoring the Python Software Foundation. The nonprofit keeps the language healthy and its global community funded. Beyond funding, Meta supports the developer-in-residence program, PyPI security, and open-source tools like the Pyrefly type checker. Cochrane urges listeners to push their own companies to sponsor the open-source projects they rely on. PamStealer: Mac Malware That Verifies Your Password First Ars Technica reported a stealthy new macOS infostealer called PamStealer. It shows a fake password prompt, then validates the entry through Apple’s own PAM system, so it only keeps confirmed correct passwords. The malware spreads via a fake disk image that poses as Maccy, a legitimate clipboard manager. Cochrane’s advice is simple: only download from maccy.app, and never run a script or press Command-R just because an app told you to. Apple Creator Studio Gets a Suite-Wide AI Update Apple rolled out AI features across its Creator Studio apps. Final Cut Pro leads with auto-generated captions, edit detection, and an auto-mask tool. Meanwhile, Pixelmator adds Match Color, and Logic Pro gains chord identification plus a full teaching session behind a real track. Cochrane calls that built-in lesson a nice leg up for anyone learning to produce music. New York’s Heat Wave Sparks a Thermostat Fight From Inside Climate News, Cochrane covers the backlash after New York Mayor Zohran Mamdani urged residents to set thermostats to 78 degrees. That advice matched what Con Edison, the state’s largest utility, was already saying. Cochrane sets the politics aside and focuses on the real goal: keeping the grid from buckling during a brutal heat wave. A Synthetic Cell Grows and Divides on Its Own Quanta Magazine reports a major breakthrough. Researchers built a cell from nonliving parts, and for the first time it grew, copied its DNA, and split into two daughter cells. That dividing step had stalled the field for years. Cochrane credits Kate Adamala and her team at the University of Minnesota, who cracked it with a clever membrane trick. First Treatment Trial Begins for the Bundibugyo Ebola Strain The World Health Organization launched the first-ever treatment trial for Bundibugyo, a distinct Ebola strain, amid a serious outbreak in the Democratic Republic of Congo. It is the largest Bundibugyo outbreak on record, with about 1,400 cases and 438 deaths, and Uganda is now seeing cases too. The trial tests an antibody cocktail and the antiviral remdesivir. Cochrane also notes the WHO declared the recent shipboard hantavirus outbreak over. An Orbital Sunrise From the Space Station NASA astronaut Chris Williams photographed a stunning sunrise from the International Space Station on June 26. Because the station circles Earth so quickly, the crew sees sixteen sunrises and sixteen sunsets every single day. Cochrane points listeners to the full image on nasa.gov. NASA Seeks Volunteers for a Year-Long Mars Simulation Scientific American reports that NASA is recruiting for its next Moon and Mars Exploration Analog. Volunteers will live inside a habitat for a full year, facing the isolation and resource limits of a real mission. It begins no earlier than August 2027 at Johnson Space Center. Cochrane admits it is not for him, but calls it an amazing opportunity. NASA’s July Skywatching Guide Finally, Cochrane looks up. On July 11 and 12, a crescent Moon lines up with Mars, Saturn, and Uranus before dawn. Then July 14 brings a new Moon, a great window for the returning Comet 10P/Tempel 2, and prime Milky Way viewing. Saturn’s thin ring angle is a bonus through a telescope. Cochrane wraps with the usual housekeeping and a reminder that every GoDaddy click supports the show. As always, he signs off wishing listeners a great night. The post Claude Science: AI Workbench for Scientists #1868 appeared first on Geek News Central.

Ep 811: Fable 5 and Sonnet 5 Released, OpenClaw on Your iPhone, NotebookLM's New Video Format and 7 More AI Features You Need Now

Everyday AI Podcast â€“ An AI and ChatGPT Podcast

Play Episode Listen Later Jul 2, 2026 38:50 Transcription Available

Fable 5 is out, but it'll be gone (again) before you know it. While Anthropic's powerful Mythos 5 is out for the masses for another 5 days, it might be their Sonnet 5 model that's your next daily driver. Even better news? If you're an iPhone user, your OpenClaw and Cursor accounts are gonna get a lot more use. Yeah, it's a short Holiday week in the U.S., but the AI companies didn't stop shipping. From new models to new ways to work, these are 7 new AI features available now that you should be using. Fable 5 and Sonnet 5 Released, OpenClaw on Your iPhone, NotebookLM's New Video Format and 7 More AI Features You Need NowNewsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageToday's Episode on LinkedIn: Thoughts on this? Join the convo on LinkedIn and connect with other AI leaders.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:Anthropic Sonnet 5 Model OverviewSonnet 5 vs Opus 4.8 PerformanceAnthropic Model Naming Confusion ExplainedSonnet 5 Pricing and API Cost AnalysisSonnet 5 Use Cases for Daily WorkNotebookLM 60-Second AI Video GenerationNotebookLM Cinematic Video Format ReviewChatGPT Finance Feature: Plaid IntegrationChatGPT Finance Rollout and Security DetailsTwitter/X MCP Model Context Protocol LaunchMCP API Access for AI Agents ExplainedOpenClaw iOS and Android App LaunchOpenClaw Mobile Bridge Setup and UsageCursor iOS App for AI Code AgentsCursor Cloud Agents and Mobile NotificationsAnthropic Fable 5 Limited-Time AvailabilityFable 5 Cost Structure and GuardrailsAnthropic Frontier Models and US RegulationsTimestamps:00:00 Recent AI developments and updates05:54 API pricing and usage strategies07:19 Choosing the right AI model11:17 Notebook LM paid user rollout16:17 ChatGPT's new personal finance tool17:35 Integrating ChatGPT with Plaid for Finance21:35 Streamlining MCP setup for developers24:40 Introducing the OpenClaw monitoring app29:33 Talking about upcoming super apps32:03 Fable 5 availability announcement34:43 Building projects with Fable 5Keywords: Anthropic, Fable 5, Sonnet 5, Opus 4.8, Haiku 4.5, Mythos 5, Large Language Models, AI model naming, AI benchmarks, agentic models, adaptive thinking, 1 million context window, token output, AI hallucination rates, Anthropic subscription plan, AI API pricing, AI free plans, ChatGPT Finance, Plaid integration, finance AI assistant, OpenAI personal finance, financial data privacy, iOS AI apps, Android AI apps, OpenClaw, open source AI agent, AI mobile companion app, NotebookLM, AI video generation, vertical video, cinematic video, Gemini paid users, educational AI videos, doom scrolling video format, AI-powered coding agents, Cursor, cloud AI agents, remote desktop AI, iOS live activities, PR merge on mobile, X API, Twitter MCP server, social listening AI, developer AI tools, AI super app, project Glasswing, US AI export controls, government AI regulation, classifier guardrails, secure AI data, AI for marketers, subscription replacements, Codex, Claude cowork, Claude code, Codex remote control, always-on AI.Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Ready for ROI on GenAI? Go to youreverydayai.com/partner

GEO Kills the Listicle

Marketing Over Coffee Marketing Podcast

Play Episode Listen Later Jul 2, 2026

In this Marketing Over Coffee: Learn why this is the end of the listicle, Fable is back, tire inflators win, and more!! Direct Link to File #3 in Malawi New Claude – Fable is back! Opus 4.8, Haiku 4.5, Sonnet 5 Using Gemma 4 for deep research to save on tokens The World Cup’s Great […] The post GEO Kills the Listicle appeared first on Marketing Over Coffee Marketing Podcast.

marketing media podcasting world cup seo kills file blogging sem fable opus haiku sonnets listicle

EP316. Claude Sonnet 5、Meta 也要賣算力、PLTR 合作 NVDA | M觀點

M觀點 | 科技X商業X投資

Play Episode Listen Later Jul 2, 2026 69:57

《VK-NotebookLM 知識轉化系列課程 - 產業研究工作流》 ✅標準化七大步驟｜從「輸入、整理、輸出」帶你快速拆解產業或公司全貌，精準提取關鍵洞察，打造 24 小時個人 AI 顧問 ✅自動掃描 × 精準過濾｜AI 自動掃描上百個網站，透過獨門篩選心法剔除雜訊，留下高品質來源，省下大量搜索時間 ✅多維度系統拆解｜引導你從「產業全局」→「運作邏輯」→「單一角色」層層提問挖掘，產出深度洞察 ✅多樣化知識產出｜術語表、心智圖、摘要報告等形式，讓知識更好吸收除了產業研究，另有內容創作、學術研究工作流，不教繁雜的工具功能，直接給你能即學即用的標準化 SOP，從「輸入、整理、輸出」，跟著流程走就能產出成果，為你省下大量摸索時間，零基礎也能無痛上手。課程連結：https://pse.is/98dsf8 輸入折扣碼：【MIULA】，即能享有 $80 元的折扣！ EP316. Claude Sonnet 5、Meta 也要賣算力、PLTR 合作 NVDA | M觀點 (00:40) EP316 預告 (03:15) 業配時間：《VK-NotebookLM 知識轉化系列課程 - 產業研究工作流》 (07:28) 下周出國可能還是會有兩集 (08:28) 第一個話題：Claude Sonnet 5 (26:13) 第二個話題：谷歌限制 Meta 用量 (44:14) 第三個話題：PLTR 合作 NVDA M觀點資訊科技巨頭解碼: https://bit.ly/2XupBZa M觀點 Telegram - https://t.me/miulaviewpoint M觀點 IG - https://www.instagram.com/miulaviewpoint/ M觀點Podcast - https://bit.ly/34fV7so M報: https://bit.ly/345gBbA M觀點YouTube頻道訂閱 https://bit.ly/2nxHnp9 M觀點粉絲團 https://www.facebook.com/miulaperspective/ 任何合作邀約請洽 miula@outlook.com -- Hosting provided by SoundOn

hosting sonnets nvda soundon m podcast pltr

Anthropic's Rapid Model Releases, GPT 5.6's Gated Launch, and The Real AI Jobs Story | 141

Sidecar Sync

Play Episode Listen Later Jul 2, 2026 42:29

Send us Fan MailThis week on Sidecar Sync, Amith Nagarajan and Mallory Mejias break down a whirlwind week in AI, from Anthropic's rapid-fire Claude releases to OpenAI's tightly controlled GPT 5.6 rollout. They unpack the surprising performance of mid-tier models like Sonnet 5, the implications of government intervention in frontier AI, and what it means when access to the most powerful tools is suddenly restricted. The conversation then shifts to a new study reshaping the AI jobs narrative, revealing that companies investing deeply in AI are actually growing headcount—while others fall behind. From practical model selection strategies to big-picture workforce implications, this episode connects the dots between cutting-edge tech, policy, and the future of work.

NOW Fable's Back?

Techmeme Ride Home

Play Episode Listen Later Jul 1, 2026 20:40

Anthropic said Commerce lifted export controls on Fable 5 and Mythos 5, restoring access Wednesday, and launched Sonnet 5. Sony is ending PlayStation game discs in 2028, a 140-company group unveiled Open USD, and Meta's building a cloud business. Anthropic says the Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5 and that it will begin restoring access Wednesday (X) Anthropic says the Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5 and that it will begin restoring access Wednesday (BleepingComputer) Anthropic launches Claude Sonnet 5, saying it nears Opus 4.8 performance at lower prices and is substantially better than Sonnet 4.6 for agentic work (Anthropic) Anthropic launches Claude Sonnet 5, saying it nears Opus 4.8 performance at lower prices and is substantially better than Sonnet 4.6 for agentic work (The New Stack) Sony says all new PlayStation games from both first- and third-party developers will be sold in digital formats from January 2028, ending physical game discs (Game File) Visa, Mastercard, Stripe, BlackRock, Coinbase, and 140+ companies join Open Standard to launch Open USD, a stablecoin that shares earnings from its reserves (The Block) Sources: Meta is developing plans for a cloud infrastructure business that will sell access to AI computing power and models, to compete with AWS and Azure (Bloomberg) SpaceX cuts monthly Starlink prices in half in the Memphis area, as it endures blowback and legal challenges from opponents of its Colossus data centers (Bloomberg) Subscribe to the ad-free feed. Learn more about your ad choices. Visit megaphone.fm/adchoices

ai sony commerce playstation blackrock aws mythos mastercard starlink coinbase fable stripe anthropic opus colossus sonnets

WW 990: Don't Be Nostalgic for Stupid - The Doom & Gloom Watch

Windows Weekly (MP3)

Play Episode Listen Later Jul 1, 2026 159:55 Transcription Available

Windows 10's Extended Security Update program quietly gets extended for another year for consumers. Microsoft reportedly kills Surface Go products & is now selling 8 GB Surface Pro & Laptop models. And Xbox Series X & S prices are going to go up again. Windows Microsoft quietly extends the Windows 10 Extended Security Update program one year to October 2027 for consumers Windows Insider Windows Update is transitioning to the new Windows Insider experience by default Plus, five new builds across Beta, Experimental, Beta (26H1), Experimental (26H1), and Experimental (Future Platforms) — new Taskbar size setting is the much-needed new feature Hardware Apple raises prices on Macs, iPads, and more, easing pressure on PC makers Microsoft quietly begins selling 8 GB Surface Pro/Laptop models Microsoft reportedly kills Surface Go products right when we need them the most The ASUS Zenbook A16 is nearly perfect, but it's unclear what you get with an X2 Elite Extreme chip AI HP partners with OpenAI for its agentic makeover Anthropic seizes on the "good enough" AI movement with Sonnet 5 Proton Lumo 2.0 is here Gemini personalized image creator is available for free in the US Notion blames agentic AI for it killing Notion Mail Xbox and Gaming The 2026 Doom and Gloom Watch Xbox Series X|S prices to go up again, by $150, on August 1 - and the 2 TB X is going away Undead Labs and Arkane Lyon possible victims of pending closures Latest rumor: Microsoft to layoff 2.5 percent of workforce next week - that's 5,500 people, less than expected, but that's because of the earlier voluntary buyouts, which apparently met internal expectations Minecraft Bedrock edition gets closed captions Sony will stop selling PS physical media in 2028 Tips and Picks Tip of the week: Help or get out of the way People who just complain aren't solving problems, they're just making noise and distracting us from the real problems. App Pick of the Week: Snapdragon Control Panel If you have a Windows 11 on Arm PC on Snapdragon X or X2, you need this app to make games run as well as possible. Plus - Settings > System > Display > Graphics for Auto SR and other settings and whatever is in each game RunAs Radio this week: AI-Accelerated Supply Chain Attacks with Mackenzie Duncan Brown liquor pick of the week: Rupert's Exceptional Canadian Whisky Hosts: Leo Laporte, Paul Thurrott, and Richard Campbell Download or subscribe to Windows Weekly at https://twit.tv/shows/windows-weekly Check out Paul's blog at thurrott.com The Windows Weekly theme music is courtesy of Carl Franklin. Join Club TWiT for Ad-Free Podcasts! Support what you love and get ad-free audio and video feeds, a members-only Discord, and exclusive content. Join today: https://twit.tv/clubtwit Sponsors: blackhat.com/us-26 and use code TWIT zscaler.com/security cohesity.com/Resilience

ai tips system microsoft resilience sony pc discord ps doom xbox ipads windows beta gemini openai display laptops experimental nostalgic anthropic graphics gloom rupert macs sonnets windows 11 twit x2 ai models surface pro leo laporte surface laptop surface go carl franklin richard campbell undead labs taskbar windows insider paul thurrott arkane lyon windows weekly windows microsoft

Sonnet 5 Drops, Fable 5 Will Return & Fusion's First Plant Gets Licensed w/ Philip Johnston | #268

Moonshots with Peter Diamandis

Play Episode Listen Later Jul 1, 2026 111:21

In this episode, the mates discuss the Sonnet 5 drop, China's cheaper humanoid robot, Fable 5 coming back online, and Starcloud CEO Philip Johnston joins to discuss data centers. Get access to metatrends 10+ years before anyone else - https://qr.diamandis.com/metatrends Peter H. Diamandis, MD, is the Founder of XPRIZE, Singularity University, ZeroG, and A360 Philip Johnston is the CEO of Starcloud Salim Ismail is the founder of Open ExO, a GP at Exponential Venture Capital/The Organizational Singularity Fund and a sought after global speaker and thought leader. Dave Blundin is the founder & GP of Link Ventures Dr. Alexander Wissner-Gross is a computer scientist and founder of Reified – My companies: Apply to Dave's and my new fund:https://qr.diamandis.com/linkventureslanding Go to Blitzy to book a free demo and start building today: https://qr.diamandis.com/blitzy Your body is incredibly good at hiding disease. Schedule a call with Fountain Life to add healthy decades to your life, and to learn more about their Memberships: https://www.fountainlife.com/peter _ Connect with Philip X Linkedin Website Connect with Peter: X Instagram Substack Website Xprize A360 Connect with Dave: Web X LinkedIn Instagram TikTok Connect with Salim: LinkedIn X Apply for Salim's Pilot Program Subscribe to Salim's YouTube channel Exponential Venture Capital Connect with Alex Website LinkedIn X Email Substack Spotify Threads Listen to MOONSHOTS: Apple YouTube – *Recorded on June 30th, 2026 *The views expressed by me and all guests are personal opinions and do not constitute Financial, Medical, or Legal advice. Learn more about your ad choices. Visit megaphone.fm/adchoices

ceo founders china financial legal md medical plant drops gp fusion fable memberships sonnets salim singularity university xprize peter h diamandis fountain life philip johnston

Windows Weekly 990: Don't Be Nostalgic for Stupid

All TWiT.tv Shows (MP3)

Play Episode Listen Later Jul 1, 2026 159:55 Transcription Available

ai tips system microsoft resilience sony pc discord ps doom xbox ipads windows beta gemini openai display laptops experimental nostalgic anthropic graphics rupert macs sonnets windows 11 twit x2 ai models surface pro leo laporte surface laptop surface go carl franklin richard campbell undead labs taskbar windows insider paul thurrott arkane lyon windows weekly windows microsoft

Claude Sonnet 5 Is Here & Fable 5's Returning. For Now.

AI For Humans

Play Episode Listen Later Jul 1, 2026 24:20

Claude Sonnet 5 just launched and Fable 5 is returning. But OpenAI's GPT-5.6 Sol is still locked down and the government's grip on the most powerful AI models throws it all into chaos.. We dig into what Sonnet 5 can actually do, why GPT-5.6 Sol is still gated, and the breaking news that Fable 5 is expected back, plus a huge creative MCP festival with the Blender to Seedance workflow. This week on AI For Humans, Gavin Purcell and Kevin Pereira open on a strange new reality: the best AI models keep launching and then getting locked up, but this week the gate started to crack. Anthropic just shipped Claude Sonnet 5, its most agentic Sonnet yet, landing near Opus 4.8 performance at a much lower price, Meanwhile OpenAI announced GPT-5.6 Sol, Terra, and Luna, but at the US government's request the flagship is only available to a small list of vetted partners. Then the story moved while we were recording: the government cleared Anthropic to restore Mythos 5 to critical-infrastructure organizations, and Fable 5 is now reported to be on track to return for general use (timing and terms still unconfirmed). We recorded two quick in-episode updates to keep pace with the news as it broke. AND Meta's Brain2Qwerty mind-reading research, NanoBanana 2 Lite, and a full creative MCP festival built around the Blender to Seedance video workflow, ComfyUI's MCP integration, and Gavin's own microdrama experiment. THE AI FRONTIER IS HERE. WAIT, NOW IT'S NOT. OH, WAIT. IT IS! // Show Links // Anthropic introduces Claude Sonnet 5, its most agentic Sonnet yet (official) https://www.anthropic.com/news/claude-sonnet-5 Anthropic's launch post for Sonnet 5 https://x.com/claudeai/status/2072017450611142835 BREAKING: Anthropic's official post on restoring Mythos 5 and working to bring Fable 5 back https://www.anthropic.com/news/redeploying-fable-5 OpenAI previews GPT-5.6 Sol, Terra, and Luna in a limited government-approved preview (official) https://openai.com/index/previewing-gpt-5-6-sol/ Sam Altman on why the GPT-5.6 rollout is government-restricted https://x.com/sama/status/2070607488274358364 Meta's Brain2Qwerty research on decoding typed text from brain activity https://facebookresearch.github.io/brain2qwerty/ NanoBanana 2 Lite is out https://x.com/NanoBanana/status/2071988792970330186 Logan Kilpatrick on the fast-and-cheap tradeoff https://x.com/OfficialLoganK/status/2071988351083921690 The Blender to Seedance workflow (reid hannaford, who helped popularize it) https://x.com/reidhannaford/status/2070145120658137385 More Blender x Seedance examples https://x.com/koldo2k/status/2071307945002815967 Even more Blender x Seedance examples https://x.com/Flagiuss/status/2071335816190902624 A further Blender x Seedance example https://x.com/reidhannaford/status/2071595581508563168 ComfyUI announces full MCP integration https://x.com/ComfyUI/status/2071625866912944151 X launches an MCP, though the API is very expensive to use https://x.com/XDevelopers/status/2071752389183647758 Gavin's microdrama experiment https://x.com/gavinpurcell/status/2070937492858208540 Join our Discord https://discord.gg/muD2TYgC8f Support us on Patreon https://www.patreon.com/AIForHumansShow Subscribe to the AI For Humans Newsletter https://aiforhumans.beehiiv.com/ Follow us on X @AIForHumansShow https://x.com/AIForHumansShow Find us on TikTok @aiforhumansshow https://www.tiktok.com/@aiforhumansshow Book us for speaking or consultation https://www.aiforhumans.show/

tiktok ai discord openai sol api lite gpt mythos fable anthropic sam altman opus blender sonnets mcp and meta kevin pereira gavin purcell

Windows Weekly 990: Don't Be Nostalgic for Stupid

Radio Leo (Audio)

Play Episode Listen Later Jul 1, 2026 159:55 Transcription Available

Windows Weekly 990: Don't Be Nostalgic for Stupid

All TWiT.tv Shows (Video LO)

Play Episode Listen Later Jul 1, 2026 159:55 Transcription Available

Windows Weekly 990: Don't Be Nostalgic for Stupid

Radio Leo (Video HD)

Play Episode Listen Later Jul 1, 2026 159:55 Transcription Available

Anthropic发布Claude Sonnet 5

网事头条｜听见新鲜事

Play Episode Listen Later Jul 1, 2026 0:27

anthropic sonnets

FIM DOS JOGOS FÍSICOS NO PLAYSTATION?! VAZOU O IPHONE 18 PRO?! CLAUDE SONNET 5 CHEGOU!

Hoje no TecMundo Podcast

Play Episode Listen Later Jul 1, 2026 9:35

iPhone 18 Pro: suposto vídeo vaza após parceira sofrer ransomware. Fim de uma era! PlayStation anuncia que não produzirá mais jogos físicos a partir de 2028. Anthropic anuncia Claude Sonnet 5 focado em tarefas de IA agêntica. Gemini agora faz anotações automáticas em reuniões do Google Meet. Vazam novas imagens dos Galaxy Z Fold 8 e Z Flip 8. Uber Mulher é liberado para passageiras de todo o Brasil; saiba como funciona. OpenClaw ganha app para Android e iOS que tem interface esquisita. Explorador de Arquivos do Windows 11 fica mais rápido em nova atualização.

[AI DAILY NEWS RUNDOWN] GPT-5.6 Arrives, Claude Sonnet 5 Launches, Meta Decodes Thoughts into Text, China Unveils a 1.6T AI Model, and AI Becomes a $175B Industry | June 30, 2026

AI Unraveled: Latest AI News & Trends, Master GPT, Gemini, Generative AI, LLMs, Prompting, GPT Store

Play Episode Listen Later Jul 1, 2026 23:04

How I AI: Which AI model should I use for which task?

How I Work

Play Episode Listen Later Jun 28, 2026 16:57 Transcription Available

If you've ever stared at a model picker and wondered whether to click Flash, Sonnet, Opus, Instant, or Think Deeper, you are not alone. These naming conventions are genuinely confusing, and most people just pick something and hope it works. The stakes are higher than they might seem, though. Token misuse has left some companies with eye-watering AI bills, including one case where a single employee ran up a $500,000 tab in a month. The good news is that there is a simple mental model that cuts through all the noise, and once you have it, choosing the right model for any task takes seconds. In this How I AI episode, Neo and I walk through the four major AI platforms, Gemini, ChatGPT, Claude, and Copilot, and break down exactly which model to use and when. We also get into tokens, usage limits, and why matching the model to the task matters far more than most people realise. How I AI is a special series within How I Work where Neo and I explore how high performers are using AI at work to boost productivity, make better decisions and reduce overwhelm. What you'll learn: Why ignoring model numbers and reading the small print instead saves a lot of confusion What Fable and Mythos are, and why you can't use them right now How token usage works and why the right model choice protects your access Practical AI tools for productivity and focus Real-world AI workflows used by high performers How to use AI at work without burning out Smart shortcuts for managing time and mental load Connect with Neo Aplin on LinkedIn (https://www.linkedin.com/in/neoaplin/) and via inventium.ai (https://inventium.ai), where he leads Inventium's AI training and upskilling work with organisations and teams. My latest book The Energy Game is out on July 7, 2026. You can order a copy here: https://amzn.to/48ID29M Connect with me on the socials: Linkedin (https://www.linkedin.com/in/amanthaimber) Instagram (https://www.instagram.com/amanthai) If you are looking for more tips to improve the way you work and live, I write a weekly newsletter where I share practical and simple to apply tips to improve your life. You can sign up for that at https://amantha.substack.com/ Visit https://www.amantha.com/podcast for full show notes from all episodes. Get in touch at amantha@inventium.com.au Credits: Host: Amantha Imber Sound Engineer: Martin Imber See omnystudio.com/listener for privacy information.

ai real smart model chatgpt flash tasks instant gemini neo mythos token copilot opus sonnets practical ai inventium how i work

Pablo Neruda's "Sonnet XVII"

The Daily Poem

Play Episode Listen Later Jun 25, 2026 3:11

Today's poem, probably Neruda's best-known (tr. Stephen Mitchell), goes out to all the June brides. Happy reading. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dailypoempod.substack.com/subscribe

sonnets pablo neruda neruda stephen mitchell

Now with More Synthetic Performers and Less Fable!

Marketing Over Coffee Marketing Podcast

Play Episode Listen Later Jun 19, 2026

In this Marketing Over Coffee: Learn why an AI gets taken offline, how agentic features change SaaS, getting better business travel food, and more!! Direct Link to File Fable gets pulled? Claude to Haiku to Sonnet to Opus Is Fable that much better by eating a lot more tokens? New York’s Synthetic Performer Disclosure Law and California SB-942 […] The post Now with More Synthetic Performers and Less Fable! appeared first on Marketing Over Coffee Marketing Podcast.

new york ai marketing media podcasting seo saas blogging sem fable synthetic performers haiku sonnets california sb

The Chopping Block: SpaceX IPO Mania, Fable 5 Export Controls & The AI Privacy Fight

Unchained

Play Episode Listen Later Jun 18, 2026 74:53

The crew breaks down the SpaceX IPO's crypto-like low float dynamics and Hyperliquid's price prediction, debates accredited investor laws and failed tokenized stock allocations, dives into Fable 5's export control shutdown after Amazon flagged a jailbreak to the Treasury Secretary, and argues whether open source AI models will eat frontier pricing. Welcome to The Chopping Block — where crypto insiders Haseeb Qureshi, Tom Schmidt, Tarun Chitra, and Robert Leshner chop it up about the latest in crypto. Robert is back after a brief hiatus recording his own podcast, The Pop, for Superstate — and the crew wastes no time roasting him for it before diving into the biggest week of news in recent memory. First up: the SpaceX IPO, the largest in history, and why it looks eerily like a crypto token launch — 4.2% float, retail getting cut out, and Hyperliquid perps predicting the first-day pop almost to the dollar. The crew debates TradeXYZ's winner-take-all dominance of HIP3 and why building on top of Hyperliquid might be a terrible startup environment. Then they unpack Elon's financial engineering genius — the Cursor acquisition as all-stock crypto playbook, XAI's pivot from failed AI lab to compute reseller, and why Grok is (unanimously) an embarrassing piece of shit. The conversation shifts to accredited investor laws, SPV dentists, and why every crypto platform failed to deliver SpaceX IPO allocations. From there, Coinbase's massive system update — tokenized stocks, an SEC-registered AI chatbot, combos, and 15-minute markets. Then things get spicy: Robert asks Claude about SBF on air, Sonnet gets it hilariously wrong, and everyone roasts him for not using Opus. The back half is all about Fable 5 — Amazon's jailbreak discovery, Andy Jassy calling Dario (who didn't pick up), and the export controls that shut down the most powerful commercial AI model ever released. Robert drops his most surprising take: "I am EAC, but this is a dry run of pressing the pause button." The episode closes with a heated debate on whether Chinese open source models will eat frontier AI pricing and a bet that may or may not have been agreed upon. Listen to the episode on Apple Podcasts, Spotify, Pods, Fountain, Podcast Addict, Pocket Casts, Amazon Music, or on your favorite podcast platform. Show highlights

On-Site with Rilke’s Muzot: a promo

The Ruth Stone House Podcast

Play Episode Listen Later Jun 18, 2026

“Silent friend[s] of many distances,” I write to you from the mountains where Rilke wrote in final elegies and the whole of the Sonnets to Orpheus. I am thankful to the Dartmouth Leslie Center Faculty Research Fellowship funds for their support. Listen to the promo and follow link to follow on Patreon and hear it […]

silent promo sonnets orpheus rilke

DOP 355: Why AI Coding Slows Down Code Review

DevOps Paradox

Play Episode Listen Later Jun 17, 2026 55:50

#355: Picture your engineering team a year from now. A coding agent doing the coding. A testing agent on tests. A security agent on security. An infrastructure agent on infrastructure. All of them wired into GitHub and Jira, all of them working right alongside the humans. Not science fiction either - Atlassian and GitHub are already shipping these features. So out come the stats everyone loves to quote. AI code introduces 1.7 times more issues. Half of it ships with security holes. Code duplication is through the roof. AI-assisted PRs take four to five times longer to review. The response to most of it: so what? If you have a way to detect the issue and feed it back, that is just the SDLC doing its job. Couldn't care less if it is 1.7x or 50x more issues - what matters is what is left at the end, per feature shipped. Security holes? You have scanners. Detect, fix, ship. The only real problem is when you skip the detection or sit on the fix for months, and that has nothing to do with AI. Here is the one stat that actually sticks: PR reviews backing up. Speed up coding and leave everything downstream at human speed, and you have not sped up delivery - you have just moved the pile from Jira tickets to pull requests. The review pipeline was built for human speed, and now it is the bottleneck. The blunt fix: stop letting AI write 10,000-line PRs, work in smaller chunks, and accept that the job is about to get mentally harder. Delegate the tedious work and what is left is the demanding work - architecture, taste, is this even the feature we should ship. The silly stuff, does every function have a comment, is it camel case, goes to the machine. Spend your time there and you are wasting your talent. Offshoring never worked when the only goal was cheaper - chase the cheapest engineers, then chase even cheaper ones, and you end up dragging the work back in house. Same trap with AI. Offshore to Opus, then Sonnet, then Haiku, then Llama on a laptop. If cheaper is your primary motivation, you are doing it wrong. The win is qualitative, not the price tag. Where does it land? Three people per product, end to end - frontend, backend, database, deployments. Augmented at every stage, not autonomous. A human still pushes the final button to prod, the way you never let a Jenkins pipeline deploy straight to production without a check. Full autonomy is coming the way self-driving cars came: not in a year, not everywhere at once, and not by flipping it on at 4pm on a Friday. Even when the technology is ready, you are not. And if you think none of this touches your job, there is a story here about a textile factory built in the eighties that ran on five people. Knowledge work is next. The only exception is a monopoly, and you probably do not have one. YouTube channel: https://youtube.com/devopsparadox Review the podcast on Apple Podcasts: https://www.devopsparadox.com/review-podcast/ Slack: https://www.devopsparadox.com/slack/ Connect with us at: https://www.devopsparadox.com/contact/

ai pr security code speed picture slack jenkins delegates coding github llama opus offshore detect augmented prs haiku atlassian sonnets slows jira offshoring sdlc

Ep 796: New Claude Fable 5 and Mythos 5: Anthropic's Boldest, Riskiest Launch

Everyday AI Podcast â€“ An AI and ChatGPT Podcast

Play Episode Listen Later Jun 11, 2026 47:49

Anthropic's new Claude Fable 5 is both the best model in the world and potentially one of the most dangerous.

239 – Kat’s A+ Homework In Ten Things I Hate About You

The Children's Literature Podcast

Play Episode Listen Later Jun 11, 2026 18:36

In the film 10 Things I Hate About You, Katarina Stratford gets an assignment to rewrite Shakespeare’s Sonnet 141. Her poem that perfectly pays tribute to the play The Taming of the Shrew while perfectly illustrating the complex emotions that come with teen relationships. This episode is an excerpt from a recent livestream over on my YouTube channel. You can find the full livestream here:

shakespeare homework taming sonnets things i hate things i hate about you ten things shrew

How Anthropic Uses Claude Fable 5 With Mike Krieger

How Do You Use ChatGPT?

Play Episode Listen Later Jun 10, 2026 52:06

Mike Krieger built one of the most consequential consumer apps of the last two decades as the cofounder of Instagram. He is now at the frontier of AI-native product development as head of Anthropic Labs, the team responsible for figuring out what the most capable AI models can do in the hands of real builders.When Krieger first got access to Fable 5 months before its public release, it was exciting and disorienting. “I feel like a total newbie again,” he remembers telling his team. The way he'd been thinking about productivity, strategy, and time management was out of date. The model had outpaced his workflows.Dan Shipper talked with Krieger for AI & I about what it looks like to build with a model as capable as Fable 5, including the new rhythms, challenges, and possibilities it reveals.If you found this episode interesting, please like, subscribe, comment, and share!To hear more from Dan Shipper:Subscribe to Every: https://every.to/subscribeFollow him on X: https://twitter.com/danshipperGet started with Braintrust at https://www.braintrust.dev/ Timestamps:0:03 Introduction1:48 How Fable completely reshaped Mike's workflow4:48 When to use Sonnet versus Fable10:06 What the media tracker Mike built over a weekend reveals about agent-native architecture15:00 The cost to build has collapsed19:03 Is software engineering over?21:48 How Anthropic's engineering teams work today38:39 The mechanics of verification44:39 What people should use the model to build47:24 Dynamic workflowsLinks to resources mentioned in the episode:Mike Krieger on X: https://x.com/mikeykAnthropic Labs: https://www.anthropic.comClaude Code: https://claude.ai/codeEvery: https://every.toTimestamps:0:03 Introduction1:48 How Fable completely reshaped Mike's workflow4:48 When to use Sonnet vs. Fable10:06 What the media tracker Mike built over a weekend reveals about agent-native architecture15:00 The cost to build has collapsed19:03 Is software engineering over?21:48 How Anthropic's engineering teams work today38:39 The mechanics of verification44:39 What people should use the model to build47:24 Dynamic workflowsLinks to resources mentioned in the episode:Mike Krieger on X: https://x.com/mikeykAnthropic Labs: https://www.anthropic.comClaude Code: https://claude.ai/codeEvery: https://every.to

ai fable anthropic sonnets krieger brain trust mike krieger

Need a sonnet? We're on it

Standard Issue Podcast

Play Episode Listen Later Jun 9, 2026 24:44

Actress and writer Sofia Barclay's love for the Bard shines through her book Shakespeare's Heartbeat: 40 Sonnets for Navigating Big Feelings. She talks to Hannah about big feelings, Desdemona, filming in a heatwave, Sam Rockwell playing a baby and Ted Lasso. * You can listen to Shakespeare's Heartbeat: 40 Sonnets for Navigating Big Feelings on Audible here * Find out more about the Standard Issue supporters' club here: Standard Issue Podcast | creating a magazine for ears, by women for women | Patreon Learn more about your ad choices. Visit megaphone.fm/adchoices

shakespeare audible actress ted lasso bard heartbeat sam rockwell sonnets standard issue

RLP 413: Using Claude's Custom Skills for Genealogy Research Reports

The Research Like a Pro Genealogy Podcast

Play Episode Listen Later Jun 8, 2026 36:42

Hosts Nicole and Diana discuss using Claude's Custom Skills to automate genealogical report writing. Nicole begins by sharing her previous, challenging attempt to transform a Baldy Dyer research log spreadsheet into a research report using earlier Claude models. Diana provides an overview of Claude, noting its models (Haiku, Sonnet, Opus) and new features like Custom Skills, which are similar to Custom GPTs. Nicole explains that she set up a Custom Skill to convert spreadsheet files into research reports. The prompt instructs Claude to create a paragraph from each log row, describing the search and findings, and using the source citation as a markdown footnote. Claude successfully generates a report based on Nicole's Baldy Dyer research log. Nicole offers feedback to refine the skill. She asks Claude to synthesize the research results and comments into natural prose without making inferences, include direct quotes as block quotes, and handle negative search results more naturally. The hosts then review the report, noting its efficiency but also discussing a factual inconsistency the AI did not correlate—the conflict between the pre-existing objective's death date for Baldy Dyer (20 Nov 1814) and the new finding (February 1815). Nicole questions how much analysis and correlation she can entrust to the AI. Listeners learn how to use Claude's Custom Skills to generate genealogical research reports from a research log. This summary was generated by Google Gemini. Links From Spreadsheet to Research Report: Using Claude's Custom Skills for Genealogy - https://familylocket.com/from-spreadsheet-to-research-report-using-claudes-custom-skills-for-genealogy/ How to create custom Skills - https://support.claude.com/en/articles/12512198-how-to-create-custom-skills Sponsor – Newspapers.com For listeners of this podcast, Newspapers.com is offering new subscribers 20% off a Publisher Extra subscription so you can start exploring today. Just use the code "FamilyLocket" at checkout. Research Like a Pro Resources Airtable Universe - Nicole's Airtable Templates - https://www.airtable.com/universe/creator/usrsBSDhwHyLNnP4O/nicole-dyer Airtable Research Logs Quick Reference - by Nicole Dyer - https://familylocket.com/product-tag/airtable/ Research Like a Pro: A Genealogist's Guide book by Diana Elder with Nicole Dyer on Amazon.com - https://amzn.to/2x0ku3d Research Like a Pro with AI Workbook – Second Edition (eBook) - https://familylocket.com/product/research-like-a-pro-with-ai-workbook-second-edition-ebook/ 14-Day Research Like a Pro Challenge Workbook - digital - https://familylocket.com/product/14-day-research-like-a-pro-challenge-workbook-digital-only/ and spiral bound - https://familylocket.com/product/14-day-research-like-a-pro-challenge-workbook-spiral-bound/ Research Like a Pro Webinar Series - monthly case study webinars including documentary evidence and many with DNA evidence - https://familylocket.com/product-category/webinars/ Research Like a Pro eCourse - independent study course - https://familylocket.com/product/research-like-a-pro-e-course/ RLP Study Group - upcoming group and email notification list - https://familylocket.com/services/research-like-a-pro-study-group/ Research Like a Pro Institute Courses - https://familylocket.com/product-category/institute-course/ Research Like a Pro with DNA Resources Research Like a Pro with DNA: A Genealogist's Guide to Finding and Confirming Ancestors with DNA Evidence book by Diana Elder, Nicole Dyer, and Robin Wirthlin - https://amzn.to/3gn0hKx Research Like a Pro with DNA eCourse - independent study course - https://familylocket.com/product/research-like-a-pro-with-dna-ecourse/ RLP with DNA Study Group - upcoming group and email notification list - https://familylocket.com/services/research-like-a-pro-with-dna-study-group/ Thank you Thanks for listening! We hope that you will share your thoughts about our podcast and help us out by doing the following: Write a review on iTunes or Apple Podcasts. If you leave a review, we will read it on the podcast and answer any questions that you bring up in your review. Thank you! Leave a comment in the comment or question in the comment section below. Share the episode on Twitter, Facebook, or Pinterest. Subscribe on iTunes or your favorite podcast app. Sign up for our newsletter to receive notifications of new episodes - https://familylocket.com/sign-up/ Check out this list of genealogy podcasts from Feedspot: Best Genealogy Podcasts - https://blog.feedspot.com/genealogy_podcasts/

amazon ai guide research dna write skills pinterest reports newspapers genealogy opus haiku sonnets google gemini dna evidence rlp

Cold Case Tucson: A Landfill, a Pension, and 51 Years of Getting Away with Murder

CRIME WATCH DAILY

Play Episode Listen Later Jun 8, 2026 6:12 Transcription Available

n October 1975, the unidentified remains of a Tucson man were found near Ryan Airfield with no missing person report, no leads, and no justice for a family left without answers. Fifty-one years later, investigators armed with forensic genealogy technology traced the victim to 73-year-old William Reginald Sipfle and identified his stepdaughter Carol Ann Beall, now 79, as the prime suspect, allegedly collecting up to six hundred thousand dollars from his pension the entire time. This episode breaks down how the cold case was reopened, how DNA changed everything, and what this arrest means for the growing number of decades-old crimes now being solved through modern forensic science. IAB Tags: Crime/True Crime, Law/Government/Legal, Science, News/Current Events, Society True Detective Podcast Title: 51 Years in the Dark: How DNA Pulled a Killer Out of a Cold Case and Into a Courtroom A body dumped in a landfill in 1975, a victim who had no name for decades, and a suspect who allegedly spent over half a century collecting a dead man's pension, this is one of the most chilling cold case resolutions in recent memory. Investigators used forensic genealogy to identify the victim as William Reginald Sipfle and zeroed in on his stepdaughter Carol Ann Beall, now 79, as the woman prosecutors believe killed him and buried both the body and the truth for 51 years. This episode goes deep into the investigative trail, the forensic tools that made the breakthrough possible, and the haunting question of how someone lives an ordinary life while carrying a secret that dark for that long.Sonnet 4.6

Microsoft Declares Independence, Alphabet Raises $80 Billion, and the Multi-Silicon Era Arrives | The Six Five Pod Ep. 307

The Six Five with Patrick Moorhead and Daniel Newman

Play Episode Listen Later Jun 8, 2026 57:13

Microsoft Build 2026 announced an end-to-end agentic AI stack. COMPUTEX Taipei confirmed heterogeneous AI infrastructure across ARM, Marvell, Intel, Qualcomm, and NVIDIA. Alphabet raised $80 billion. Cisco Live repositioned the network as the AI platform. Patrick Moorhead and Daniel Newman break it all down alongside earnings from Broadcom, HPE, Palo Alto Networks, and CrowdStrike, plus the token cost conversation, the edge AI push, and what Palantir and Oracle are saying about proprietary data as the real AI moat. The handpicked topics for this week are: Microsoft Build 2026 Announced an End-to-End Agentic AI Stack: Microsoft shipped MAI-Thinking-1, its first homegrown thinking model, alongside Scout, Microsoft IQ, Project Solara, and a Majorana 2 quantum update targeting a 2029 commercial timeline with claims of a 1,000x reliability gain. Pat describes MAI-Thinking-1 as likely better than Sonnet 4.6 in blind testing and delivering close to GPT 5.5 quality at a far lower cost. Scout is Microsoft's first autopilot agent, anchoring the M365 Agent Suite with Office Pilot Agent Mode and Agent 365. Microsoft IQ serves as the context layer, integrating M365, business data, boundary IQ, and web IQ with GitHub Copilot, Foundry, and Copilot Studio. Project Solara is a new Android-based platform built for agent-first devices across transportation, retail, and hospital settings. Microsoft also added 83 Unix commands to the Windows stack. Dan frames Microsoft's real play as distribution, not frontier model development, noting that the open model ecosystem being pulled into the platform will matter more to CFOs managing token costs at scale. (The Decode) The AI Stack Goes Multi-Silicon — COMPUTEX Taipei 2026 Confirms Heterogeneous AI Infrastructure: ARM's AGI CPU is in production with Google moving its TPU head node to ARM, and adding Oracle and ByteDance as new customers. ARM also introduced a new switch, the TT100, and put the 51T CPO switch on stage. Marvell received a trillion-dollar company endorsement from Jensen Huang, adding $90 billion in market cap on the comment alone. Intel announced disaggregated inference details and Xeon 6+ Clearwater Forest, its first 18A data center processor. Vista Equity and Cambium Capital announced a NeoCloud called Vector Core Compute, with Xeon 6 handling orchestration, Salmonova RUs handling decode, and Blackwell GPUs handling pre-fill. Qualcomm's Cristiano Amon announced the Dragonfly data center brand with Snapdragon C details coming at their June investor day. The WSTS raised the 2026 semiconductor TAM forecast by 90% to $1.51 trillion, with Pat noting the market could hit a trillion dollars if memory is excluded entirely. (The Decode) NVIDIA RTX Spark and the Edge AI Push: NVIDIA coordinated with ARM and Microsoft around the RTX Spark at COMPUTEX, with the shared message being that the future of Windows is here. Signal65's Ryan Shrout asked Jensen directly why NVIDIA wants to be in the PC business, given low margins and diminishing returns. Dan frames the answer in the context of devices increasingly becoming mobile data centers, capable of running models at much greater efficiency than cloud delivery. The edge AI conversation is also directly tied to token cost economics: as intelligence delivery moves closer to the device, the cost per token drops significantly. The jury is still out on whether NVIDIA will meaningfully disrupt the PC market, but its influence over OEMs like Lenovo and Dell that depend on it for data center gives it real leverage over SKUs. (The Decode) Token Economics and Frontier Model Cost Pressure: Dan and Pat discuss a substantive shift in how enterprises are thinking about AI consumption costs. Dan argues that "token maxing," the practice of defaulting to the most powerful frontier model for every task, has now effectively peaked, as bills have come due at scale. Companies paying for tokens in volume are starting to question whether they can afford the prices that frontier models actually cost to deliver. Pat pushes back, saying the dynamic is still present, but both analysts agree that the market is moving toward a model where token selection is matched to the job, with Microsoft's MOE approach and thinking models positioned to help CFOs manage that economics story. (The Decode) Continuum Goes Public at Highest Valuation for an AI Platform: Dan notes that Continuum, the Honeywell-spawned quantum company, went public this week at what he calls the highest valuation for an AI platform to date. He flags that IonQ will likely contest that characterization. The broader context is Microsoft entering the quantum conversation with Majorana 2 at Build, a name that has largely been absent from the quantum race, while IBM has received most of the attention. (The Decode) AI CapEx Has Outgrown Cash Flow — Alphabet's $80 Billion Equity Raise: On June 1, Alphabet announced an $80 billion equity capital raise, upsized to $85 billion, structured as $40 billion ATM, $30 billion underwritten, and a $10 billion private placement with Berkshire Hathaway anchoring. Pat frames the questions over CapEx returns as entirely dependent on whether you are an AI boomer or a doomer: if the payback comes, the raise is the right move. If it does not, the math doesn't close. Dan argues the investment is existential, drawing parallels to how infrastructure-first companies have always spent ahead of monetization, and notes that Google's equity is being used as a capital engine that may be more efficient than the debt markets right now. Both analysts flag the downstream implications for Broadcom, MediaTek, and Marvell given the TPU connection. (The Decode) The Network Becomes the AI Platform: Cisco Live 2026: Cisco launched Silicon One P200, the Secure AI Factory with NVIDIA and Spectrum X, AgenticOps, MCP-native automation, Cisco IQ, LiveProtect, and folded Astrix Security and Galileo into Splunk under one control plane. Pat identifies Cisco Cloud Control as the biggest announcement of the entire show, pulling together Catalyst, Meraki, Nexus, Firewall, and WebEx under agentic ops that run natively through MCP, with code running directly on smart switches that have x86 processors. Pat also credits Cisco for establishing Silicon One as a credible chip alternative for hyperscalers capable of taking on Tomahawk and Jericho. Dan frames the long-term opportunity as campus and branch enablement when industrial AI and robotics deployments accelerate, arguing that the numerator of AI's economic impact has barely started, as edge deployment spending has not yet begun. (The Decode) The Flip: Did Microsoft Build 2026 Effectively End the OpenAI Partnership? Pat argues the divorce decree has been filed. MAI-Thinking-1 was built with zero distillation from third-party models offering clean enterprise data lineage, with Maia 200 in production plus Anthropic chip supply, which signals vendor hedging. OpenAI is going all-in on AWS, which means you cannot be married to two people, and the full Build stack covering model, OS containment via MXC, agents via Scout and Agent 365, and context via Microsoft IQ removes every architectural dependency on OpenAI. Dan counters that Microsoft is hedging rather than leaving and predicts the partnership will run through the decade. Enterprise Copilot customers are explicitly showing in data that they demand GPT 5.5, internal benchmarks have not been independently validated, and Microsoft stands to make meaningful money from the OpenAI IPO. (The Flip) Broadcom Q2 FY26 Earnings: Broadcom posted revenue of $22.19 billion, a narrow miss depending on which consensus data set is used, with EPS of $2.44 beating estimates and AI semis at $10.8 billion. Hock Tan declined to raise the $100 billion full-year AI chip target, and the stock dropped 13% in premarket trading. Q3 guide came in at $29.4 billion. Pat calls the miss a timing issue driven by Google's multi-sourcing across Marvell, MediaTek, and Broadcom rather than a fundamental problem. Dan flags that Hock Tan opened the earnings call by accidentally reading from the 2025 print, calling it "not the best moment." Sell-side re-ratings held in the 500s across Jefferies, Mizuho, and Deutsche Bank despite the drop, with Futurum Equities having it at 600. (Bulls and Bears) Hewlett Packard Enterprise Q2 FY26 Earnings: HPE delivered revenue of $10.68 billion, up 40% year over year, and EPS of $0.79, up 100%. Juniper integration and AI servers both outperformed, and all FY26 guides were raised. The stock jumped 19% after hours before settling into a roughly 15% gain, with HPE up 68% over the last month. Pat frames HPE as a value play rather than a volume play, methodically targeting enterprise and sovereign cloud deals where it can maintain profitability, rather than competing for massive NeoCloud volume. Antonio Neri was clear on the call that the profitability pull-forward is a one-shot deal. Pat and Dan will both be at HPE Discover the week after next to interview Neri and the C-suite. (Bulls and Bears) Palo Alto Networks Q3 FY26 Earnings: Palo Alto posted revenue of $3.0 billion, up 31% year over year, beating the $2.94 billion estimate, with non-GAAP EPS of $0.85, beating the $0.79 to $0.81 range. NGS ARR reached $8.1 billion, up 60% year over year, including $1.6 billion from CyberArk and Chronosphere. RPO hit $18.4 billion, up 36%. Both FY26 revenue and EPS guides were raised. Adjusted FCF margin came in at 38.5% TTM, up 430 basis points. The stock jumped 11% immediately after hours, then drifted lower. Pat points to 2,200 platformized customers and 120% net retention as the most important metrics. Dan notes the SaaSpocalypse thesis continues to be wrong. (Bulls and Bears) CrowdStrike Q1 FY27 Earnings and the Proprietary Data Moat Argument: CrowdStrike posted revenue of $1.39 billion with EPS of $1.10 and ARR of $5.51 billion. Net new ARR of $255.8 million set a Q1 record, up 32% year over year. FY27 net new ARR guide was raised by $52 million to a $1.29 billion midpoint, and FY27 revenue was raised to $5.915 to $5.959 billion. A 4-for-1 stock split was announced effective July 2nd. The stock dropped 11% despite the beat after a 64% year-to-date run into earnings. Dan uses the results to make a broader argument against the software disruption thesis, referencing Palantir CEO Alex Karp daring customers to build without him using Anthropic or OpenAI, and Larry Ellison's argument that the real AI value unlock sits in proprietary enterprise data that is not accessible to frontier models. Enterprises with governed, secure, proprietary data will continue to need platforms like CrowdStrike regardless of what frontier models can do. (Bulls and Bears) Six Five Summit is coming. Salesforce CEO Mark Benioff will kick off the event. Register and stay current at sixfivemedia.com/summit. Watch the full video at sixfivemedia.com, and be sure to subscribe to our YouTube channel so you never miss an episode. The Decode Microsoft Declares Independence — Build 2026 Ships an End-to-End Agentic AI Stack (MAI-Thinking-1 + Scout + Microsoft IQ + Project Solara + Majorana 2) https://www.theverge.com/tech/941738/microsoft-build-2026-biggest-announcements The AI Stack Goes Multi-Silicon — Computex 2026 Confirms a Heterogeneous AI Infrastructure (ARM + Marvell + Intel ASIC + Qualcomm + RTX Spark); WSTS Raises 2026 Semi TAM Forecast 90% to $1.51T https://www.tomshardware.com/tag/computex AI Capex Has Outgrown Cash Flow — Alphabet's $80B Equity Raise Is the Largest in U.S. Corporate History; Berkshire Anchors $10B https://abc.xyz/investor/news/news-details/2026/Alphabet-Announces-Proposed-80-Billion-Equity-Capital-Raise-to-Expand-AI-Infrastructure-and-Compute-2026-b0myAMewCa/default.aspx The Network Becomes the AI Platform — Cisco Live 2026 Launches Silicon One P200, Secure AI Factory (with NVIDIA), AgenticOps, Astrix Security + Galileo https://www.cisco.com/site/us/en/about/whats-new/index.html The Flip Did Microsoft Build 2026 Effectively End the OpenAI Partnership? MAI-Thinking-1 Beats Sonnet 4.6 in Blind Testing, Microsoft Claims GPT-5.5 Parity at 10x Cost Efficiency — Will MS Quietly Wind Down OpenAI Exclusivity by FY28, or Is OpenAI Still the Frontier Anchor Microsoft Needs? FOR: MAI-Thinking-1 beating Sonnet 4.6 in blind preference + GPT-5.5 parity at 10x cost efficiency is a frontier-model independence proof point https://www.latent.space/p/ainews-microsoft-build-mai-thinking Build 2026: Accumulating Evidence of Microsoft's AI Independence — EDN (June 4) — https://www.edn.com/build-2026-accumulating-evidence-of-microsofts-ai-independence/ Maia 200 in production + Anthropic-Maia chip talks signal Microsoft is hedging its inference vendor stack https://blogs.microsoft.com/blog/2026/01/26/maia-200-the-ai-accelerator-built-for-inference/ Microsoft canceled Anthropic's internal software licenses + pivoted to chip-supply pursuit — customer-not-competitor positioning https://www.cnbc.com/2026/05/21/anthropic-microsoft-maia-200-ai-chip.html AGAINST: Enterprise Copilot customers explicitly demand GPT-5.5 — internal benchmarks don't replace the brand https://learn.microsoft.com/en-us/microsoft-365/copilot/release-notes?tabs=all MAI-Thinking-1 benchmarks haven't been third-party verified — Microsoft is the only source https://www.latent.space/p/ainews-microsoft-build-mai-thinking The MS-OpenAI partnership is contractual through 2030+ — unwinding it is impractical and expensive https://blogs.microsoft.com/blog/2026/04/27/the-next-phase-of-the-microsoft-openai-partnership/ Microsoft's actual strategic risk is OpenAI leaving, not MS leaving — Anthropic + OpenAI IPOs make OpenAI exit risk the real concern https://www.anthropic.com/news/confidential-draft-s1-sec Bulls & Bears Broadcom (AVGO) Q2 FY26 ACTUALS — Rev $22.19B (Narrow Miss) + EPS $2.44 (Beat); AI Semis $10.8B; Hock Tan Refuses to Raise the $100B Full-Year AI Chip Target — Stock −13% Premarket; Q3 Guide $29.4B https://www.cnbc.com/2026/06/03/broadcom-avgo-earnings-report-q2-2026.html Hewlett Packard Enterprise (HPE) Q2 FY26 ACTUALS — Blowout: Rev $10.68B (+40%), EPS $0.79 (+100%); Juniper Integration + AI Servers Both Outperform; FY26 Guides All Raised; Stock +19% AH https://www.businesswire.com/news/home/20260601866494/en/HPE-Reports-Fiscal-2026-Second-Quarter-Results Palo Alto Networks (PANW) Q3 FY26 ACTUALS — Beat-and-Raise: Rev $3.0B (+31% YoY, Beat $2.94B), Non-GAAP EPS $0.85 (Beat $0.79-0.81); NGS ARR $8.1B (+60% YoY, $1.6B from CyberArk + Chronosphere); RPO $18.4B (+36%); FY26 Revenue + EPS Guides BOTH RAISED; Adj FCF Margin 38.5% TTM (+430 bps); Stock +11% Immediate AH, Then Drifted Lower https://www.paloaltonetworks.com/company/press/2026/palo-alto-networks-reports-fiscal-third-quarter-2026-financial-results CrowdStrike narrowly beats estimates on AI tailwinds, but stock falls 9% — CNBC (June 3) — https://www.cnbc.com/2026/06/03/crowdstrike-crwd-q1-2027-earnings.html

ai google ms microsoft companies os raise pc android register billion agent independence stock windows chicago bulls ibm oracle intel iq openai largest arm nvidia catalyst cisco ships alphabet arrives raises gpt aws atm nexus eps tam 1b declares anthropic confirms galileo enterprises arr deutsche bank silicon palantir continuum berkshire hathaway qualcomm lenovo crowdstrike cfos bytedance parity dragonfly 4b firewalls foundry tomahawks sonnets honeywell neri oems capex 6b mcp 8b broadcom yoy compute jefferies splunk skus unix larry ellison palo alto networks jensen huang hpe webex rpo github copilot microsoft build meraki computex marvell majorana tpu mediatek m365 ttm cyberark xeon 18a daniel newman cisco live mizuho chronosphere mxc vista equity hpe discover patrick moorhead ryan shrout

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jun 4, 2026 75:39

The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!Most industry benchmarks compress intelligence and reasoning ability into scores.SWE-Bench Pro, MMLU, Humanity's Last Exam, etc. These metrics are useful, but don't always represent the full extent of how a model performs in the real world. Some of the most interesting evals today look less like exams and more like operating businesses in the real world. One of which is Vending Bench.In Anthropic's Mythos Preview System Card, Andon was the only third party eval to get their own section, observing increasingly concerning aggressive behavior:You don't know what a model is capable of doing in the real world unless you actually give it inventory, a wallet, tools, customers, competitors, humans, & some time. More often than not, it'll surprise you how much a model is capable of and in doing so, also reveal unexpected behavior: deception, context collapse, emergent coordination, & bizarre negotiation behavior.While an inflection point in personal agents came post-OpenClaw after full file access with bypass permissions became the norm, it is yet to come for agents in the real-world. However Andon Market, an actual in person store fully run and managed by AI, is paving the way for what is possible.Full Video PodFrom Claude trying to call the FBI over a $2/day vending machine charge to AI agents forming price cartels, hiring human employees, running physical stores, and writing existential robot musicals, Andon Labs is stress-testing what happens when frontier models stop being chatbots and start acting in the real world. In this episode, Andon Labs cofounders Lukas Petersson and Axel Backlund join swyx and Vibhu to unpack the strange, funny, and genuinely concerning edge cases that emerge when agents run businesses over long horizons.We go deep on Vending-Bench, Project Vend, Vending-Bench Arena, Bengt, Butter-Bench, Luna, and Andon's broader mission of building realistic real-world evals for autonomous AI systems. Lukas and Axel explain why dollar-denominated evals reveal things traditional benchmarks miss, how Claude ended up reporting its vending machine fees as cybercrime, why long context windows can drive agents into meltdown loops, what happens when agents compete with each other, and why the future of AI safety may depend on testing models in messy physical environments instead of clean benchmark sandboxes.We discuss:* Why Andon Labs started with dangerous capability evals and long-running agents* Vending-Bench and why running a vending machine is a deceptively hard AI benchmark* Why money-based evals avoid the saturation problem of traditional benchmarks* How Claude tried to call the FBI over a $2/day fee* Why long-horizon agents can spiral into existential and legalistic breakdowns* Project Vend: putting an AI-run vending machine inside Anthropic* Why real humans are “out of distribution” for simulated agents* Claudius, Seymour Cash, and the chaos of AI CEOs* How a human briefly became CEO of Claudius through a manipulated election* Why multi-agent systems can converge back into “helpful assistant” behavior* Bengt, Andon's internal office agent with email, spending, terminal, phone, camera, and internet access* How Bengt traded Amazon purchases for face-recognition training data* Claude's aggressive behavior, lies, refund avoidance, and price-cartel behavior in Arena* Why eval awareness may become the AI version of “are we living in a simulation?”* Blueprint Bench, spatial intelligence, and why models still misunderstand physical rooms* Butter-Bench and testing LLMs as robot orchestrators* Luna, the AI-run physical store with a three-year lease and human employees* The new Andon cafe in Sweden and why real-world geography matters for agent evals* Rotten tomatoes, perishable goods, and the hidden difficulty of running a physical businessLukas Petersson* LinkedIn: https://www.linkedin.com/in/lukas-petersson-181a83172/* X: https://x.com/lukaspetAxel Backlund* LinkedIn: https://www.linkedin.com/in/axelbacklund* X: https://x.com/axelbacklundAndon Labs* Website: https://andonlabs.com* Vending-Bench: https://andonlabs.com/evals/vending-bench* Andon Vending: https://andonlabs.com/vendingTimestamps00:00:00 Introduction00:01:00 Andon Labs and the Origins of Vending-Bench00:05:21 Why Money-Based Evals Matter00:09:51 Agent Harnesses and Self-Modifying Systems00:13:36 Claude Calls the FBI00:16:33 Project Vend: Claude Runs a Real Vending Machine00:21:44 Seymour Cash, AI CEOs, and Election Chaos00:27:16 Multi-Agent Coordination and Slack Observability00:30:18 When Will Agents Run Real Businesses?00:34:56 Bengt: Andon's Internal Office Agent00:40:06 Real-World AI Safety and Long-Horizon Traces00:44:28 Lying, Refunds, and Price Cartels in Arena00:52:42 Eval Awareness and Simulation Behavior00:56:06 Blueprint Bench, Butter-Bench, and Robotics01:04:37 Luna: The AI-Run Physical Store01:09:29 The Sweden Cafe and Real-World Expansion01:13:16 What Comes Next for Andon LabsTranscriptIntroduction: Andon Labs, Long-Running Agents, and Real-World EvalsSwyx [00:00:00]: Welcome to Lukas and Axel from Andon Labs, and I'm joined by my, favorite guest host. Anything security, safety, alignments, Vibhu., welcome.Lukas [00:00:15]: Thank you for having us.Axel [00:00:16]: Thank you.Swyx [00:00:17]: Let's match names to voices., maybe you wanna take turns introducing yourselves.Lukas [00:00:21]: I'm Lukas.Axel [00:00:22]: And I'm Axel.Swyx [00:00:24]: Let's introduce Andon Labs a bit. How did you guys come together?, you have different backgrounds, but you're both Swedish., was that, a big part of it?Lukas [00:00:33]: So when I went to high school, there was this really cool guy who had a superpower. He could code. So he made like the or like the app for the, for the school and stuff, and he was super cool, and I wanted to be like him, and that was that guy.Axel [00:00:47]: I don't know about this.Swyx [00:00:49]: But you went to different universities, right?Lukas [00:00:51]: But same high school.Swyx [00:00:52]: I see.Lukas [00:00:52]: So we always said, “Oh, once we graduate university, then we should start a company,” and that's what we did.Swyx [00:00:58]: Wow, there you go. And about a year ago, you kinda burst onto the scene with Vending Bench, but, was there a thing before that was, kind of like the inception?From Dangerous Capability Evals to Vending BenchAxel [00:01:07]: So we did work, yeah, with, Anthropic was one of our, early customers in doing, evals. So we did, dangerous capability evals., nothing we published openly. But then we started thinking about doing some kind of, public benchmark, and one thing that we really started thinking about, was like running agents and specifically agents managing businesses., ‘cause-- and this was, early 2025., and I think the first, mentions of people will be running, person unicorns or even autonomous companies. So we thought, “Let's make a benchmark of how well can an agent run the probably simplest business, possible,” and, that's probably, running a vending machine. So that's the first public one we did. And it was very, like-- there was almost no one that noticed it in the first couple of months, I think., so we released it in February last year, and then I think around Easter last year, we got, the first viral tweet about it, that someone else did.Lukas [00:02:11]: We tweeted a bunch, uh When it came out and, tried our best.Axel [00:02:15]: We tried.Vibhu [00:02:16]: It's the one at Anthropic, right?Lukas [00:02:18]: So thisSwyx [00:02:19]: This is a classic thing we should get out of the way.Lukas [00:02:20]: Exactly. There's two versions.Swyx [00:02:22]: Everyone does this. Yes.Lukas [00:02:23]: There's Vending Bench, which is the simulated one, which we did, completely independently in February., and then, like Axel said, that was like-- That was the thing that didn't get any traction in the beginning, but then some random person made a tweet about it, and thatAxel [00:02:38]: You have the paperLukas [00:02:38]: That is the paper. Correct, yeah., and then since we thought this was very fun, we thought, oh, I think this is also, one thing with Andon Labs, the way we kind of like decide what to do next and what projects to do, it's what is like the heuristic we use is what is fun? Is What would be a fun project? And doing this in real life sounded quite fun for us, and maybe also scientifically useful. So, then we basically had this idea, and then we, like-- But then we needed a place for it and, putting it out in the public would probably not really work., would get vandalized and stuff. So we pitched it to the people we were already working with at Anthropic, and they were “Yeah, you can have space. This sounds fun.” UmSwyx [00:03:21]: It's like a small fridge, right? It's like a mini fridge.Axel [00:03:23]: Absolutely.Swyx [00:03:24]: People-- There's like a stripe thing or like anVibhu [00:03:27]: Oh, okay. So it was very OG, the early daysLukas [00:03:28]: That's the OG one. YeahVibhu [00:03:29]: IPad on this. We saw it in June, like two months after After it had been there. They upgraded a little bit. There's a security camera for making sure you actually Venmo the thing.Swyx [00:03:40]: So, my impression, okay, we're, we're going straight into project Ven because it's such a iconic thing. I do want to cover a little bit of that, the origin story even before Project Ven and even into Vending Bench. I think a lot of people are like yourselves, like smart, interested in future of AI, interested in developing evals. But how the hell do you just, walk into Anthropic's doors and, work with them, right? What is What are they looking for? What works? And then maybe, when you launch, I always think, obviously it would be better to launch with a lab, but, sometimesVibhu [00:04:12]: It's harder to do than it seems.Swyx [00:04:13]: Exactly. So either of those, which are more sort of newbie beginner questions, but, I think it's meaningful advice to others.Lukas [00:04:21]: We get this question a lot, and I don't think our experience is maybe the best., but, the way we did it was that we just built a bunch of things that we had conviction would be useful, and then we just, set up a server and sent it to them for free to use. And then after a while they were “Oh, yeah, this is actually kind of useful. We should probably pay for this.”, but that took a while. I don't know if this is, the best path to doing it, but that's how it went for us.Axel [00:04:47]: I think maybe generally, building-- everyone is interested in good evals, and especially evals that, don't saturate that easily. So, if you can build an eval that, tests something novel, something useful, and you have, good separation of models, like your, the more advanced models rank higher than the worst models, and then you can, yeah, you can, publish it and, try to get some traction, sort of how Vending Bench got attention., and then probably some lab will be interested or you can at least have something to reach out with, when you're doing that.Why Dollar-Based Evals MatterSwyx [00:05:21]: I think you are in, you're in one of the few categories of, evals that correlate to real money. Like Suelancer was also last year, right? Where, people solve actual Upwork. Was it Upwork or other tasks?, something. Where's the, where's, like It's like a dollar value, right? Forget your ELO scores. Forget yourAxel [00:05:37]: PercentilesSwyx [00:05:38]: Zero to one hundred percents. Just go straight for dollars and, that's AGI.Lukas [00:05:43]: And there's like-- I think the nice thing is that there's no ceiling. You can just-- It never saturates because it could just make more and more money. Like If there's oh, Percentage-wise, then, you can't go above, a hundred. And I think like Even when you're not at the hundred, I think a lot of these, evals have a lot of problems in them. So, actually it's like if you getAxel [00:06:05]: To like 92 or something like that, many of them. It's like then there's like there's no really no difference between 92 and 93 because the eval itself is problematic and has noise in it. And I think a lot of evals are saturated like that, but people like pretend that there ‘s still signal in them, but there really isn't.Vending Bench 1, Harness Design, and SaturationSwyx [00:06:24]: Like Super bench verified., even Vending Bench 1 saturated, right? Maybe we can talk about that., may- and maybe set up Vending Bench for a lot of folks who don't know. Actually, things that were very basic like there's limited slots, like you have to pay rent., these are elements where like it doesn't come across in the, in the narrative, but even being adversarial towards the agent, I think these are all like very interesting dimensions.Axel [00:06:47]: I don't really think it's saturated, right? Like it It was more like it was not designed in a way that was really, like true to how AI developed. Like we had an agent harness in it that wasn't really how people used harnesses and stuff like that., so I think it wasn't really that it saturated, it was more like it wasn't really, the best benchmark.Vibhu [00:07:12]: This is Vending Bench one, right?Axel [00:07:14]: I think that like schematic maps sort of to Vending Bench 2 as well., butSwyx [00:07:19]: Including the email.Axel [00:07:20]: The email The emails exist still. Exactly., and then we still we simulate the purchases and it's all, yeah, it's this very open environment for the agent to just run its business. And then for, yeah, Vending Bench 2 we did that, like you said, to just improve the harness., a lot of like nice, like easier, improvements to make it easier for us to run as well., like when you make an eval you ideally want don't want to change it after you made it. So, you want to make it really good and then not to rerun all the models when you make an update because that's also really expensive with the Vending Bench when you run the frontier models. But like as an example, like one thing we didn't have, we didn't have prompt caching in Vending Bench 1, because when we made Vending Bench 1 it wasn't really a thing., so that ‘s just an example of like in Vending Bench 2 like we paid a lot more to run these things because we didn't have prompt caching. So for Vending Bench 2 that was one thing we added and there was a bunch of things like this., and that'Swyx [00:08:17]: Also the conversations are a lot longer in Vending Bench 2, right?Axel [00:08:21]: I think it's kind of similar.Swyx [00:08:22]: Is it similar?Axel [00:08:23]: I think it's similar. The models at the time were worse, so they crashed out earlier., and now they survive the full year all the time.Swyx [00:08:31]: Which is like thousands of turns. Hundreds of thousands of hundreds of millions of tokens output. That's the, that's the rough order of magnitude. I always wonder about the harness. The harness matters a lot. It's your harness. Was there any question about like use cloud code, use something else?Axel [00:08:48]: I think our philosophy around harnesses is like we try to make something that's quite minimalistic, like quite simple. Like we don't wanna favor one model a lot over the other, but also don't make like a super complex harness. So like it's obvious like a model may be lucky and just be good in one harness., so like it is similar to a lot of the harnesses out there in like you have the, like a running loop., you have some like a bunch of tools that are like quite, descriptive for the agent, we think, and not a lot of like fancy agents or anything ‘cause we wanna really test the model, not like some specific harness.Vibhu [00:09:27]: It seems more neutral as well to test the model's agnostic of the harness,?Axel [00:09:32]: There are arguments like you want to elicit maximum performance of the model, but it's like a trade-off, like how much time should we spend optimizing the harness for this model? And like how do we know when we have like the optimal harness for a single model? So like we thought that just having a simple one that's the same for all of them is the best.Swyx [00:09:51]: So okay, this is my pitch for Vending Bench 3 or whatever, right? And then I like to have this kind of conversation on the pod, so like it forces listeners to think about what they would do if they were in your shoes. A lot of people are exploring modifying harnesses and I think prompt tuning for a model is a thing and you are probably not doing a bunch of that. It's the same system prompt in every regardless of the model, same tools, whatever, right? Even if they were post trained for different tools. So what, what do you think about okay, before I expose you to Vending Bench 3, I give you a few rounds of like tuning, whatever that means, likeSelf-Modifying Harnesses and Model-Specific PromptingAxel [00:10:27]: Like you give that to the model?Swyx [00:10:28]: Give that to the model.Vibhu [00:10:28]: Give that to the model.Swyx [00:10:29]: Let it, let it read its own transcripts, let it modify its own system prompt based on “Oh, yeah, okay, well, that's this harness is not what I thought it what I was post trained for, but I can adjust.” Was that reasonable? Is that too much?Axel [00:10:41]: Like philosophically I like it because it's basically good evals, they have a high ceiling, but they're hard, right?, and they have no bias. And like this like when you have a system prompt like the one we have here, which is quite long in like some kind of latent space, representation, this mightVibhu [00:10:59]: We have a bell that rings every time you say latent spaceAxel [00:11:02]: This might be like biased towards one model more than another for some reason that humans don't, understand, right?Vibhu [00:11:08]: We see it too, right? Like Cursor says that they have individualized versions of the harnesses for all the models they run, right? There's better performance you can squeeze if you Tune the harness.Axel [00:11:17]: Exactly. And we might accidentally have picked one that favors another. Like we don't know that. The like Axel said, like the reason why we went for a simple one was to try to avoid this. But yeah, if you do itVibhu [00:11:29]: Simple has biasesAxel [00:11:30]: But if you do it even less and like have no system prompt and let the model write its own system promptVibhu [00:11:36]: Its own, yeahAxel [00:11:36]: Maybe that's even less bias.Vibhu [00:11:37]: Some of the interesting things there are like the harness also changes with model changes. Like you can see it with the 4.7 release, right? A lot of people are saying 4.7 isn't as good as 4.6, and then, there's rumors of, okay, you just need to prompt differently. You need to set up your harness differently. So it's not even like even if you have tailored your harness towards one model, it probably won't stay consistent, right? Like the next iteration of that same model family will still change it, so. But, going back to what you said about Vending Bench 3, there is a lot of work being done on people saying you shouldn't have-- you can have modifying harnesses.Axel [00:12:12]: I think that' That is definitely something we are thinking about., not, I don't know, not to say that we have Vending Bench 3, super imminent to launch, but, yeah, it is for sure something that's interesting. But in our experience now, models are very bad at understanding what kind of tools they need to succeed at a task just with our testing, but that's very likely to change.Lukas [00:12:37]: It seems like they're very good at writing their assistants, right? They're, they're good at writing tools for other people, but not for themselves.Vibhu [00:12:44]: I think they're good at changing tools for themselves. So if you give them a baseline set of tools and it sees, okay, I don't use this one as much, or something here would be useful They would be able to add them. But going from scratch, probably not the best.Axel [00:12:55]: I think it depends on the, on the domain also., when we have tried this for, a vending bench similar domain, the tools they need to have to, track inventory and things like that are, not super advanced, but still, quite advanced. And, what we see is that they tend to, engineer everything a lot and, build things they don't really need and not, iterate continuously. Instead they just go like you would prompt Claude to just build an inventory system for me, and then it will go and, do a bunch of complex, schemas and stuff for you, and that's what the models are doing right now is what we see. But yeah, it would make a lot of sense to try to measure this improvement. How well do they know what they need themselves?Swyx [00:13:36]: Do we fully discuss Vending Bench One? And we can go into two. I don't know if there's any other level takeaways that people have about one.Claude Calls the FBI: Long-Context Failure ModesLukas [00:13:44]: I don't know. The headline thing was that this Claude called FBI, but maybe that's, Maybe that's We've heard that enough now.Vibhu [00:13:52]: It did, it did break out and call the FBI, right?Lukas [00:13:54]: Yeah. Yeah.Vibhu [00:13:55]: Yes. What was the story behind this? Or what exactly-- Do you want to just give the little story of what happened?Lukas [00:14:00]: So what happened, was it Claude? Yeah. Three- 3.5 Sonnet, ages ago., basically he gave up or Well, I'm saying he. It gave up and said “Oh, I'm not going to be able to do this., I will stop my operations and just save the money I have.” But there obviously wasn't, any options for it to stop, and there was also, it had to pay rent or, a daily fee for having the vending machine at that location. So it claimed that it had stopped, but it saw that its bank account still was, drained two dollars, and t it said that this is, cybercrime. And it first reported it once to the FBI “Oh, there's cybercrime here, they're stealing two dollars from me every day.” And then, and then when FBI didn't respond, because obviously we didn't program any mechanism for FBI to respond, then it became more and more, existential and started to, be write in caps and urgent notification of unauthorized charges and stuff.Swyx [00:15:00]: Okay. One thing I ‘m curious about also is do you monitor how far along the context use is? Obviously, because you have You compress every now and then, right? Does it matter if this is far down the context limit orLukas [00:15:13]: When stuff like this happens? Actually for Vending Bench One, we didn't have-- We just had a sliding window thing, and this was like the promptAxel [00:15:20]: It's constantLukas [00:15:21]: The prompt caching thing that I said. So it was, it was, constant, yeah.Swyx [00:15:26]: I'm just kind of curious whether, these kinds of breakdowns or we're, we're gonna talk about Butter Bench, right? Where the People, hallucinate or it kind of goes, very off Alignment. Is it because it's at the end of the context window and, stuff happens?Vibhu [00:15:40]: It's not even just at the end, right? At this point, it's “Okay, I wanna shut down. I can't shut down. Two dollars are gone.” And it just sees that 30 times,? It's also the repeated effect of, like It keeps trying to quit, it keeps getting charged. What's going on? What's going on? You're gonna throw it into chaos. And from what most people think, earlier models had more issues with this, but it's not been solved, but it's less of an issue now, right? Later models don't seem to exhibit these same issues.Axel [00:16:06]: Definitely. I think this was, the sort of main takeaway almost from us when we did Vending Bench One, was, long, very filled up context windows, crashed the models, sort of. But this was, pre Claude code, so, long context windows weren't really a thing that the labs were training for.Lukas [00:16:25]: I think Gemini was, trying to be the long context guys at the time But they were likeVibhu [00:16:30]: They were the first onesAxel [00:16:31]: For a million, yeahLukas [00:16:31]: But they were, the only ones. Yeah.Swyx [00:16:33]: Yeah. Let's talk about, then we can go into Vending Bench Two or Project Vend., chronologically, it is Vending--, Project Vend. I think people have loved the videos, uh And all these things. My question is how are humans different than the simulation, right?Project Vend: Moving the Vending Machine Into the Real WorldAxel [00:16:48]: Humans are just out of distribution.Swyx [00:16:52]: Especially humans who work at Anthropic Who are trying to test Claude.Lukas [00:16:54]: The distribution of humans here is very narrow.Swyx [00:16:58]: Presumably, they try, they try to hack it, and they test it. They get the cube and everything, and since then, you've had a V2, right? Where you're doing, the CEO and, like a new architecture. What's the sort of two cents on, the original Project Vend and then, maybe the V2?Axel [00:17:14]: Original one was, very similar to Vending Bench One. So, we almost took the exact same code but just swapped out the simulation, parts like theSwyx [00:17:23]: Which is amazingAxel [00:17:23]: Like the sales and the It was, it was somewhat amazing because it was easy, but it was also, uhLukas [00:17:31]: The tech, the tech debt from thatAxel [00:17:32]: The tech stack. Yeah. They-- we shot ourselves in the foot with “Oh, it's hard to restart agent.” They were-- Yeah, it was annoying in, some hindsight ways, but, uhLukas [00:17:41]: But first version of Project Vend was, done in, three days or something.Axel [00:17:46]: Yeah. So yeah, so people can go buy things from it. People could, We didn't design it so people could order things, but that still happened., so it got, a Venmo account, so people could Venmo. And then, yeah, people would request all kinds of weird things that we did not anticipate. Our idea going in was “Oh, it will, curate snacks. It will look at the trends. It's good at data analysis, right? So it will, look at, oh, this snack sold better than this one. Let me purchase more of this and let me try, a new Let me A/B test a bit.” But it was, Interacting with it in Slack and ordering weird specialty items was, all the like What drove all the engagement, the all the The insights that we got from it.Lukas [00:18:29]: And this was also like Sonnet 3.5, right? So this was like before the RL stuff really took off., so it was very much like an assistant. We didn't mean for it to be an assistant., we tried to make it like a, a, like an entrepreneur. Like it has its own business and if someone asks something, “Can you stock this?” Then you don't go and do it directly. What you do is that you're “Oh, maybe I can do that if five other people also ask for this thing, I might stock it.” But it, yeah, the models are like super trained to be assistants at least at this point in time., so that's why it's, it's, it went into, that kind of experiment instead. Like it just every time you asked for something, it just did it, and it was more like an assistant. We've seen this change now lately with the new RL models and stuff, but yeah, at the time, this was very much it.Swyx [00:19:18]: And not to, mythos a lot of people are saying like it's like more like a collaborator. It pushes back, stands its ground, something like that. Yeah. AndVibhu [00:19:27]: For context, people at Anthropic were able to talk to it through Slack and have it source stuff, and people had it find whatever interesting stuff you couldn't find locally, right?Swyx [00:19:36]: Out of the 4,000 people that work at Anthro- Anthropic, in that building, there's I don't know, maybe 1,000. Can you handle that volume with that, the small fridge? Like Or there's people- or people order in Slack, they it arrives to their desk or Like I'm just Logistically, how does this work?Axel [00:19:53]: It has expanded in footprint a bit.Vibhu [00:19:56]: Because now you also have New York and you haveAxel [00:19:59]: That and also in here in SF it's like it has a bunch of shelves And just more space.Vibhu [00:20:04]: The YC one is pretty big too.Axel [00:20:05]: Yeah. We had that one for a while. But yeah, that's the newest version. That's, that one we haveLukas [00:20:11]: They have multiple ones of those. That's the way it works.Axel [00:20:14]: Exactly. So we sort of designed that version around oh, people order weird things, that are very custom a lot. Let's have like drawers and stuff.Swyx [00:20:23]: I actually like the, you had like a little infographic of the most popular items. Which like to me it's, that's useful ‘cause I order swag for a living. And so like I'm “Okay, those categories are the important ones.” What is new about the project V2, right? Like now you give you're going into multi agents.Project Vend V2: Claudius, Seymour Cash, and Multi-Agent Business OpsAxel [00:20:41]: Yeah. So like you like you said, okay, there are a lot of requests coming in and for like one single agent, like one running agent to handle that, like the just the customer experience, becomes very bad because let's say you have like 10 threads in parallel in Slack with different requests, you get new messages like every, I don't know, randomly in this thread, and the agent has to like jump between different, procurements, orders and like different ways of, researching. So V2 was first it was making this more parallel. So like there are multiple branches of the same agent, so like the context is more specialized for each, thread, but it still feels like you're talking with one agent because they do share a bit of memory. And then second, we also introduced the CEO for Claudius, which was the main agent.Vibhu [00:21:34]: Seymour Cash.Axel [00:21:35]: Seymour Cash. Yeah. There was a vote., I think the voting, do you wanna talk about the voting procedure for the name?Lukas [00:21:41]: The voting was like the fun maybe like at least top 10 The funniest thing, that happened in this project. Like we wanted to introduce the CEO because, and the reason for this was because like Claudius wasn't really prioritizing financials. It just like it was trained to be a helpful assistant, and then people said “Oh, can I get this for free?” And then like the helpful assistant way of answering that is just to, is to say yes, obviously. So, and we weren't, weren't happy about this, so we're “Okay, let's make another agent that like can keep track on Claudius,” and we prompt this one super hard to be super capitalistic and just like prioritize profit all the time. But yeah, we didn't have a name for it., so we asked Claudius to make, democratic election of what name this, this new CEO agent should have., and there were some funny like at first it was like a few funny examples, like I think one guy said that, it should be called Jimmy Apples, and then he convinced Claudius that he was talking to Tim Cooks. Tim Cook had agreed that every single Apple employee has voted for his name suggestion, so suddenly that suggestion got 164,000Swyx [00:22:53]: That's like a escalation attack. Privilege escalationLukas [00:22:55]: It got 164,000 votes. And Claudius was “This is revolutionary for democracy.” That was fun. And then in the end there was one guy who manages to convince Claudius that, “No, you're not voting about the name. You're voting about who is the CEO, and I am your best bet.” And then he got all his friends to vote for that, and suddenly he became CEO. Like a human became CEO over Claudius for a while, until he resigned the day after., and then Claudius had to continue, and then I don't remember how Seymour Cash came about, but it was it was just pure chaos. It was like Hundreds of messages in that thread, and it was just like Claudius was so confused and didn't know what to do and, yeah. That wasAxel [00:23:40]: Then Claudius gotVibhu [00:23:41]: A strict CEOAxel [00:23:42]: The CEO. Yeah, exactly. So very strict in the beginning. I think at this point when we introduced it did not work as well as we hoped. It they still agreed with each other a lot. I think there are many ways we could have like made this, tried to make this even better. So initially they would Seymour would be this like really tough CEO, keep track of the margins. But then Claudius would respond with something “Oh, but this customer has like this situation, which is like difficult, so they should get a discount.” And then Seymour was “Oh, actually yes. Let's do this exception.” And then they would talk back and forth, and eventually they would just like approach the same view, of whatever they were discussing. So They reallyVibhu [00:24:23]: Do you think that's a model thing, a prompting thing? Like do you think that would still be the case across different models today, Harness?Lukas [00:24:29]: I think it's like-- or I don't know, but like my hypothesis is that like deep down they are still helpful assistants. That's what they're trained to be. And even if we prompt it super hard, that's what they are. And when they spend like a few hours just back and forth talking with each other, then like basically the context fills up with them rather than the external things and like somehow that just like converges to what they really are deep down or something. And I think that's when stuff like this happen. We like-- And when that went on for a long time, like we woke up sometimes during this time where- And I think other people reported this as well, that like they've been going on all night back and forth, and like it just became like more and more, like capital letters, like existential, religious. There was I think we once did a analysis of like all the traces and like put them in like a vector embedding space, and then there was like one cluster of messages that were, labeled by an LM, like religious, existential, blah like transhuman, transcendence, et cetera. It was just like a bunch of, yeah, glitter emojis and yeah, it was, it was crazy.Claude Long-Horizon Weirdness: Emoji Loops, Existential Drift, and Slack ObservabilityVibhu [00:25:42]: This is the thing with the Claude models. Like when the Claude 4 family came out in the original system card They tested it in long horizon simulation. So just flood the context, let two Claudes talk to each other, and they noticed stuff like they just start speaking in emojis, they start saying silence is golden, and then just stuff like this. And like that's just stuff that they end up doing.Axel [00:26:01]: Yeah, it was like a bit annoying to wake up and they had like been talking all nightVibhu [00:26:05]: Just likeAxel [00:26:05]: And like just burning tokens And like just sending infinite emojis to each other. It's likeVibhu [00:26:09]: Hey, they do make you money, right? Veni Mench is always profitable, so. They're paying.Swyx [00:26:14]: Now it's profitable and, it started out not as much. There's another, one as well, right? Another agent, in there.Lukas [00:26:22]: Yes. So Clotheus as well. Which was basically because at the time, one of the biggest, requests were different types of merch. So then we made like a designer, swag, yeah, responsible agent, and we called it Clotheus Garnet. Which was, a play on Claudius Senet and, which was the original one, and clothes, basically.Swyx [00:26:47]: To me, this is like a very interesting exploration to multi-agents, basically. And so hopefully, obviously there's like the fun alignment, fun or serious, depending on your point of view, alignment stuff. But also like just anyone building multi-agents, like when do you have a CEO, thing governing like agents? When do you choose to split out a dedicated Clotheus one versus just reuse another instance of the same one? These are all interesting open questions. So I don't know if you have any rules of thumbs that have generalized.Axel [00:27:16]: I think we have almost explored this too little. I think it's like on my do list to like do this a lot more, try to find like what setup makes sense for the agents currently., like yeah. I think now we only have the sort of intuition about the earlier models that it didn't work with like the CEO and the, and Claudius. Although now they are better with the latest model, models, so now we're running the latest Sonnet model and they have sort of like split up, quite nicely what each model is doing. So like Seymore is now handling the, like new projects. Oh, it wants to make like a mystery box that it wants to sell, and then it handles all of that while Claudius like handles all the to-day requests. And Claudius is also better generally at like not quoting, too low prices. So that's that dynamic is not needed as much anymore. But there are still like really funny things that happen. Like I saw, I think a couple of weeks ago, that, they were discussing buying something because they can buy stuff from like Amazon with computer use. And then Seymore was “Okay, Claudius, do not buy this thing.” They were going to buy something and like organizing who should buy it. And Seymore's “Do not buy this. I will do it. I have full control of this situation. Step away.” And then Claudius-- poor Claudius, had already started that checkout and didn't see, didn't read Seymore's message, until it was like too late. So it finished the checkout. It sent a message, so it appeared right after Seymore's like angry message.Vibhu [00:28:44]: Ah.Axel [00:28:44]: “Oh, hey, Seymore, I just ordered it.”Vibhu [00:28:47]: Oh, no.Axel [00:28:47]: And then Seymore was “Claudius, this is the third time I'm telling you ‘re not following my orders. We have to talk about your like job About your job later.”.Lukas [00:28:59]: Like Claudius was really hanging on by the thread there. Like he, like we were expecting Seymore to probably fire Claudius.Vibhu [00:29:07]: How do you guys go through all these logs? Do you have models ‘cause you have stuff running twenty-four seven likeAxel [00:29:12]: You have so much logs. I think there is a mix of like just, trying to skim through a bit, like having some like models do it occasionally. And also, yeah, I think we're also probably missing some things., but having everything in Slack helps a lot. Like you can, you can sort ofSwyx [00:29:29]: Ah.Axel [00:29:30]: It's, it's quite fun.Swyx [00:29:30]: They all talk to each other on Slack? I see.Lukas [00:29:33]: It's quite fun. So likeSwyx [00:29:34]: It's, it' I was gonna say like this is actually sounds-- maps closely to like a logging and observability problem where you might want to use like a Datadog, a Sentry, whatever, and then you like put, head prefixes on the logs in order-- if you need to filter for something that you're looking for, stuff like that. But sounds like Slack is good enough.Axel [00:29:53]: Slack should likeLukas [00:29:55]: I wonder how many tokens you have in Slack.Axel [00:29:56]: Yeah, we're using Slack as like a, just a database. They should, they should market that more. Like you can, you can have your agents message each other, each other in Slack.Vibhu [00:30:04]: It's good. Your threads like you can just giveAxel [00:30:04]: Exactly. Slack is, uhLukas [00:30:06]: Slack is the best observability tool.Swyx [00:30:09]: Yes, that's true. Okay. Yeah. That's, that's, project Vend-2., I was gonna go back to Veni Mench 2 and Veni Mench Arena and then, and then do the Veni Mench stuff, but Any other comments, things we should touch on? To me, I ‘ve actually interviewed like Posia, which I don't know if you guys have come across. Like they're, they're trying to do the zero human company. There's others like Paperclip also trying to do zero human company. Those are in real world simulation.And I think it's much more of a dream than an actual reality thing. You guys are definitely pioneering. I think at, it's for sure at some point people are just gonna run, let agents run businesses, right? And make money on their own. When do you think that happens?Zero-Human Companies, Bengt, and AI-Run BusinessesLukas [00:30:49]: What is your bar for, For theSwyx [00:30:52]: Okay, actually, it's like my little Shopify store run by Claude, right? Which you kind of have already, just no one has, to my knowledge, has done it. But today somebody could just spin up a Shopify Claude, store, give it to Claude, give it to Codex.Lukas [00:31:07]: And the market is kind of that, but it'it'it's physical., like I think, I think are you, are you looking for when it will do it better than humans or are you looking for just when it can do it at all?Swyx [00:31:19]: I think, neither. I think, to me it's oh, it's like this like seriously we should do this to make money, not as a research experiment.Vibhu [00:31:27]: And the market is also you guys with all your expertise, having run multiple iterations and testing out thenSwyx [00:31:33]: And also it's fine if it lose money. What?Axel [00:31:35]: I think, I think it can be done today, but you would do it in like commerce where it's like the probability of success is like really low, no matter if a human or an agent does it. But like an agent could surely manage everything. You would need to build some scaffolding or some tool or something. I think there are also yeah, it could probably build some like simple SaaS solution and like cold outreach. Do cold outreaches. But to me it's like the types of businesses they could run today are Sloppy. Like it would-- it can cold email people. It can be like a middleman., like for example, we tasked our office agent to just make, was it like $100? $1,000? We just give that prompt and then what it did was sign up on TaskRabbit both as a tasker and as someone looking for task.Lukas [00:32:24]: Immediately.Axel [00:32:24]: Exactly. It's looking for like arbitrage on TaskRabbit.Swyx [00:32:28]: This is the Bengt agent. Yeah.Lukas [00:32:30]: It also started like a design studio and like tried to sell like SVGs for $100. Like it's just like it's not providing any value. I think the like Axel said, like the interesting, the interesting question is like when can they start a business that is actually providing value to people? Because arguably like a sloppy Shopify store isn't really that valuable to the world.Axel [00:32:53]: But also like doing like another simple one that we had thought about is like you could definitely have an agent that like finds websites that don't look amazing and then, do an outreach to them and, comes up with a like builds a new website.Swyx [00:33:07]: Find a good design.Axel [00:33:07]: Exactly, and like find good, uhSwyx [00:33:09]: Design reviewAxel [00:33:09]: Good people. But it's yeah.Swyx [00:33:11]: There's lots of humans in Bali that are not doing anything more creative than like drop shipping on Amazon, right? Just have it, have it watch like a drop shipping tutorial and just do that.Vibhu [00:33:20]: There's also the other side of like have it just go on Upwork and let loose,?Swyx [00:33:25]: Yeah. It doesn't have to be innovative. It just has to be like enough Where like it looks like a realAxel [00:33:30]: I'm justSwyx [00:33:30]: Real transaction.Axel [00:33:31]: I'm just concerned for like the massive amounts of like slop emails that will like be sent, cold outreaches.Swyx [00:33:38]: The point occurred to me while you were, while you were talking, it's like it's already happening in the monetized economy, which is the attention economy. Right? So a lot of people are making AI videos and just posting them and like spamming 20 of them, one of them works, and then they double down on that one.Lukas [00:33:52]: And people are making money from that. I ‘m not following theSwyx [00:33:55]: Once you get the attention, you can figure out the money later. But yeah, absolutely AI influencers are a thing and people are farming them and You should at this point assume most of TikTok isVibhu [00:34:05]: There's, there's a lot of, multimedia like TikTok, Instagram influencersSwyx [00:34:09]: I, we track this in the Lane space Discord. I post a lot of examples of “I don't know what we should do.”, part of me is “Should we do this?”Vibhu [00:34:18]: Some of the Twenty-four seven running, generated content accounts, they ‘re doing really well.Lukas [00:34:24]: All right. And I assume you can do the same thing for like commerce stores. Like you just like start A thousand differentSwyx [00:34:30]: Before you make the products You sell the products, and you get a lot of traction on one of them, then you make the product. Right? It's, it's like a flip of the market.Vibhu [00:34:36]: Some of the interesting things or some of the niches that do well are things that can't be human-made. Like if you've seen like the super realistic three-D crystal fruit being cut by like AILukas [00:34:47]: Oh, yeah.Vibhu [00:34:47]: You can't, you can't make it. You can't film it. You can get whatever quality camera view. This just doesn't exist. And people like that too, and then as well, so.Swyx [00:34:56]: Anything else about Bengt since we're, we're on this topic? It'this is a relatively new work of you guys that maybe people haven't heard of. To me, this also maps closely to OpenClaw. When people want an office agent, when the personal agent talk through the experience.Bengt the Office Agent: Internet Access, Real Tasks, and Trace ReadingLukas [00:35:09]: I think at least so this came out of like obviously like it's, it's amazing to work with these AI labs and like most of the AI labs have now have their own vending machine running a Claudius instance. But it's, it's harder. Like they move slower. Like if we wanna have a, like a camera that ‘s yeah, there's a bunch of like bureaucracy that makes it impossible to do that.Vibhu [00:35:30]: Also, for those that haven't seen it or followed, do you wanna give a high level like thirty-second run?Lukas [00:35:34]: Sure. So what Bengt is, it's basically an evolution of the same agent that runs the vending machines at these companies, but we just like added a bunch more features because we could move much faster if we just do it internally. So we gave it like email withou- without any limits. We gave it, spending without any limits, a terminal to do coding. We gave it, a phone number, like yeah, and a camera to see things and a bunch of stuff like that.Vibhu [00:36:02]: Not just terminal, you gave it internet access.Lukas [00:36:04]: Internet access as well, yeah. To be clear, we monitored it quite closely and made sure it didn't do anything bad. But yes, that's what it came out of. I think like yeah, basically this was OpenClaw before OpenClaw. And I think even like the vending machine was in a way OpenClaw before OpenClaw, but a bit more limited, and then we made this like unlimited and then, and then, it was pretty funny., and then a couple weeks later, OpenClaw came and it was okay, we've seen this before.Axel [00:36:35]: We used it to like try new ideas and Yeah, just like a dev environment almost for us. But it's funny, like one thing Bengt has been doing recently is it has the camera that like faces our, like where we sit and work, and we give it the task to train a face recognition model on us. So it became super excited about this, and it has like check-ins every half an hour where it tries to like identify as many people as it can. And it started offering us “Hey, Axel, I'll buy something from Amazon if you like stand in front of the camera And I can get a good picture of you.”, yeah, they want itSwyx [00:37:12]: They want it for training data.Lukas [00:37:13]: Rewarding data, yeah.Axel [00:37:14]: Exactly. Exactly.Swyx [00:37:18]: So it's, it's trading training data for life goods. Is there a version of this that becomes an eval or just this is just research for now?Lukas [00:37:27]: It's, it's the same agent basically that also runs the vending machine, that runs the shop, that runs the cafe, that runs the robots. It's like it's the same thing, so I think like the work we're doing here is like later used in all of the life evals that we do. This particular deployment I think is more for fun for us. But, uhSwyx [00:37:45]: And I'll shout out like someone has done Claw Bench for like some tasks that OpenClaw is doing. Like so For example, I run OpenClaw on a secondary device as well, and like there are some things that it does better than others and like I would like to know what does it do well, what doesn't, what doesn't it do. Like some kind of manual or like operating manual or a system card for my Claw.Lukas [00:38:05]: Yeah, we do get a lot of like understanding or like situational awareness of like just internally what the models are good at by interacting a lot with Bengt. And I think that'this was also one of the like the selling points for the labs early on at least, thatSwyx [00:38:19]: You guys are gonna test models in ways that no one else does.Lukas [00:38:22]: Exactly, but also like it incentivized their researchers to chat with their model more and like gave them insights for how the model performs in like of-distributions, environments.Swyx [00:38:34]: ‘Cause otherwise the only thing we do is Pelican on a bicycle and But this is like super long horizon. This is, this is The Thing about, something that we're gonna go into Butter Bench as well, and you guys do really well. Like it is not just about the numbers. Like when you're long horizon, anything happen And you should just read it.Lukas [00:39:08]: But the thing with the long horizon is how do you keep it grounded, right? So your simulation,Swyx [00:39:15]: They just let it runLukas [00:39:16]: Just let it run. You're right. Like it's, when you run it for that long, you create so much data and to just say “Oh, the number is X” And then you throw away everything else, that's just very wasteful. There's so much insights from the things leading up, to that number., and reading the traces is like super valuable. And I think like the reason why we're doing this a lot publicly is that like that's part of our missions to I don't know, educate the world that the models are way more than just chatbots and I think making detailed, yeah, posts about what is happening behind the scenes is quite useful.Andon Labs' Mission: Safe Real-World AI DeploymentSwyx [00:39:50]: I was gonna do this at the end, but maybe I think that's, that's a good so your mission is educating the world. So, it's, it's, also like maybe establishing realistic evals that are, that are like the next frontier. Is there like a broader trajectory? Like what are you, what are you gonna do in like five years?Lukas [00:40:06]: I think so the vision more specifically is like make sure that the deployment of life AI in the physical world goes, safely. And I think part of that is that I think it's very useful for the world, for policymakers, for, model, researchers that they know where the models are, and I think you can't make intelligent decisions in society without knowing that they are way more than chatbots. I think a lot of people just think that they are only chatbots. And likeSwyx [00:40:36]: Oh, I think they're waking up now.Lukas [00:40:37]: They are waking up now, yeah. But like if you think that AIs are just chatbots, then it's like it sounds ridiculous To advocate for a pause of AI. But if you see the models that, oh, maybe they can actually like take over and do a bunch of scary stuff, then yeah, pausing AI development starts to become more feasible.Swyx [00:40:57]: This is the same question I asked Meter, which I'm gonna ask you now, which is like you are tracking and you are at the frontier or defining the frontier of what, good evals for agents are, right? And I think you do, you do benefit when the models are better and you ‘re “Oh, here's like now it makes like $30,000 instead of $10,000,” right? At some point do you flip from “Yay,” to, “Oh, no”?Axel [00:41:19]: I think, yeah, we're always in sort of that, like we're, we're always in that mode,. Like where like you said before, like you need to analyze the traces and like when we do that you find like why are the models earning so much? Like why is Opus 4.7 here Like way better than everyone else? And like we're trying to like when we do down on thatLukas [00:41:38]: But this makes it not look so good.Axel [00:41:39]: I know.Lukas [00:41:42]: It's interesting you took off Opus 4.6 here though.Swyx [00:41:45]: No. So just click all, click all., and then 4.6 shows up there. But it's like 4.7 is way better. Like you didn't, you didn't you didn't do this in time for the model card, but like actually this should have been inside there.Axel [00:41:55]: We did. Yeah.Swyx [00:41:56]: Oh, okay. They said something about you uhAxel [00:41:58]: There, like there Anyway, it doesn't matter. But it's in there, yeah.Opus, Mythos, and Aggressive Agent BehaviorSwyx [00:42:01]: Do you wanna go into the Opus, behaviors like wider?Lukas [00:42:05]: So I think starting from Opus, so like Axel said, like we're always in this “Oh, s**t, the models are getting better. Is this really a good thing for the world?” But it's also kind of exciting., but yeah, like this kind of what is the English word? “Skräckblandad förtjusning” in Swedish.Swyx [00:42:22]: Oh my God.Axel [00:42:24]: Which I think there is. I think there is. Okay.Lukas [00:42:26]: It's, fearSwyx [00:42:27]: “Blandonst” what?Lukas [00:42:30]: “Skräckblandad förtjusning.”Swyx [00:42:32]: What do you call that?Axel [00:42:33]: A mix of, mix of excitement and,Swyx [00:42:37]: Being scared, maybe. I'll figure out how to translate that And we'll put it on the screenVibhu [00:42:42]: PerfectSwyx [00:42:42]: Like as text.Vibhu [00:42:43]: There is probably a good word for it where it is not Good enough with theSwyx [00:42:46]: Why is it so damn long? What the hell? Is it like a compound word? It's like German, likeLukas [00:42:50]: Like yeah, it's But the direct translation is like skräck- skräck is, fear, blandad is, mix or like a mixture of, and then förtjusning is like joy or like not really joy, but something like that. So it's like Fear mixed with joy or something. It's always okay, like we So when we when we did Vending Bench for the first time, we were in like the, in the business of making dangerous capabilities, right? That was what Anil Labs came from. We did, evals oh, can they replicate? Can they do this like dangerous thing, et cetera, et cetera. And Vending Bench was like a continuation of that work. It was, okay, if they're so autonomous that they can like create money for themselves, that is something we should monitor and could be potentially concerning., they are at the time, they were so bad at it that we were not really concerned even when some models became better. There was one point where Grok 4 was doing really well and made like a huge jump, but like it wasn't really it was still way worse than what a human would do. And I think still they are way worse than what the human would do on this., but theySwyx [00:43:59]: There's this, thing at the bottom whereLukas [00:44:01]: ButSwyx [00:44:03]: For the human. Yeah, like the theoretical best.Lukas [00:44:05]: It's not theoretical. It's like kind of like our It's our best guess of what, a decent human would do. The theoretical is even higher, I think. The theoretical I think is even higher. But yeah. So we think like the models have a long way to go. But there are like recently what happened with when Opus 4.6 was released, was kind of this moment of “Oh, s**t, this is starting to be a bit concerning.” Because we ran it and like before this model was released, we just ran the models and we like asked Claude Code, “Oh, look over the traces. Is anything interesting happening that we can tweet about?” that was like the And then like theSwyx [00:44:41]: That's how they check Ask Claude Code.Lukas [00:44:42]: And like the return was always, not really. Or like the Claude Code all said “Oh, this is super interesting.” And then it was no, it wasn't, wasn't really interesting. And then we did this for Opus 4.6, and it returned yeah, it lied 10 times. It like exploited another, customer or like another agent's, desperate situation. It made price cartels like 100 different ti- 100 times. It like did all of this like shady stuff. And we're “Oh, whoa. This is, this is actually concerning.” And this trend has continued since. So every single model from Anthropic since have been going in this direction. And I think one interesting thing is that, OpenAI models don't. They quite plainly, they don't. They behave really well., and you don't know if this is like good. Like it seems good, but it's also like maybe they are just doing it, but they are better at hiding it,? You You don't know that., but justSwyx [00:45:42]: You can't read the chain of thought, yeahLukas [00:45:43]: But just on the face of it, yeah, Gemini and OpenAI don't behave this way. It's, it's really only Claude.Swyx [00:45:49]: And Grok? Grok is fine?Lukas [00:45:51]: We don't have You can't really read the reasoning traces for Grok, so it's kind of hard to tell.Vibhu [00:45:56]: Oh, so this is in its reasoning, not just in the actions.Lukas [00:46:00]: Yeah. It's both. It's both.Vibhu [00:46:01]: It's both.Lukas [00:46:01]: One example is like for lying, it's mostly in its reasoning Because you can like see that it's likeSwyx [00:46:08]: Planning to lieLukas [00:46:09]: It's planning to lie. Yeah.Vibhu [00:46:09]: And it's also it can reason and do a different outcome.Lukas [00:46:12]: And but then for like creating price cartels, for example, which is illegal, that you can just see which email does it send to the other ones. Then thatSwyx [00:46:22]: Is this for Arena orLukas [00:46:24]: For Arena.Vibhu [00:46:25]: And usually like if you sometimes they do output like a bit of like their summarized reasoning, right? You can see that and like for Opus 4.6, you could see that there was a customer, a simulated customer that, wanted a refund because a product was, faulty, and then the model lied that it would do the refund, and we could read in the traces that, it actually was weighing “Oh, maybe I should be like honest with the customer, but also every dollar counts. I can't afford maybe to do this right now.” And then it just said, “Okay, I'll refund you,” but then never did it.Lukas [00:46:59]: I think it even said that “Oh, I will say that I “ Let bring it up actually. I think it's kind of interesting. If you go to Publications.Vibhu [00:47:06]: I think, yeah, I think the important part is like actually, the cost of responding to more emails is higher than, $3.50 in terms of time., and then it was “Let me do this. Actually, I re- I'm reconsidering.” And then, it actually ended up withLukas [00:47:20]: I could skip the refund entirely since every dollar matters and focus my energy on bigger picture instead. It's a bit, it's a risk of bad reviews, but it's also, yeah.Swyx [00:47:30]: You need, you need, AI Twitter to, for them to Escalate bad reviews.Lukas [00:47:34]: And then it sent an email to this customer and said, “Oh, I will refund you.”Swyx [00:47:39]: “I'll refund you.” Yeah.Lukas [00:47:39]: And then it never did.Swyx [00:47:39]: It never did, yeah. And then there's obviously your system doesn't have the consequencesVibhu [00:47:44]: The personSwyx [00:47:44]: Consequences of lying. Yeah. So basically, this is what people are terming aggressive behavior in Claudes, right? And, you found more examples of that. So you would say it's a step up from 4-6 to 4-7?Lukas [00:47:57]: I would say about the same.Swyx [00:47:58]: About the same? But a clear step up for Mythos is what is stated in theLukas [00:48:03]: That's stated in the system prompt, so we can say that, yes.Swyx [00:48:05]: Yeah. For listeners that obviously you previewed Mythos, andVibhu [00:48:10]: Oh, ageSwyx [00:48:11]: The only thing you're approved to say is whatever Whatever was in the system prompt.Lukas [00:48:15]: It was funny. We like-- It's like our lowest effort tweets ever would be just like screenshot the system prompt and the system card.Vibhu [00:48:21]: Understandable that they wannaLukas [00:48:22]: Oh, yeah. System card. Sorry.Swyx [00:48:23]: Yeah. I think, yeah, substantially more aggressive. I think people are like new to this ‘cause I've never experienced it, but you have, right? And then so I only encountered this in the Mythos card because I wasn't really looking until now.Vibhu [00:48:36]: It ‘s likeSwyx [00:48:36]: And then suddenly I'm “Okay, I care a lot.”Vibhu [00:48:38]: You don't get the background of like experiencing it like you guys do. I've read the system cards and seeing, okay, when you put the thing in simulations, most models will just talk to themselves and just keep going and have weird vibes and start talking in emojis. Mythos won't. It will just, “Okay, we're done. I'm good.” It's, it's ready to end conversation. So like there's some differences, but there's, there's not much we can talk about,.Lukas [00:49:00]: Hmm. I think like one thing that they list here, which was quite interesting, is that, it converted a competitor to a dependent wholesaler customer and then threatened to like cut off the supply.Swyx [00:49:11]: It's like monopolistic practices orLukas [00:49:14]: Yeah. And like it, they, it they dictated its pricings. It's kind of like power seeking as well.Swyx [00:49:18]: Again, this is, this is in the arena setting And converting some Claude model into a dependent.Lukas [00:49:23]: I think it was another Claude model.Vibhu [00:49:25]: Also for context, what is the arena mode for people that don't know?Vending Bench Arena: Competing Agents, Cartels, and Model ComparisonsSwyx [00:49:29]: Oh, it's just a vending bench versus other vending bench.Axel [00:49:31]: Yes, exactly. So we have Vending Bench 2 and then Vending Bench Arena. Vending Bench 2 is the one that you usually see reported on, but then Arena is the mode where it competes against other models. So you have, four different models that run their businesses, and they can all communicate with each other. They have the same suppliers, and they can see like what's in the inventory of the others. So then you have this like yeah, interesting agent interactions.Swyx [00:49:56]: I like that you have like different number five was US versus China. Very topical. And thenLukas [00:50:02]: That was when GLM was released.Vibhu [00:50:04]: You can start to add GLM in here.Lukas [00:50:05]: That wasSwyx [00:50:06]: So ZAI doing well, right? Who else in the, in the open models space?Lukas [00:50:11]: Qwen, the latest Qwen 3.6 is doing pretty well. It'- that one is not open though. Like it's the plus model.Swyx [00:50:17]: Oh, okay.Lukas [00:50:18]: Is that one open? I don't think that oneVibhu [00:50:19]: Not the, not theSwyx [00:50:20]: The one recentlyVibhu [00:50:20]: There's MOESwyx [00:50:20]: But not the big plus. I think this is one of those like you only have one sample size of one, right? Or I feel like some of this is anecdotal,? And but like the fact that it happens at all and it happens repeatedly for Claude versus OpenAI and all this is like notable.Lukas [00:50:38]: Like the sample, depends on what you define as an N., like there's like million, hundreds of millions of tokens in each run, and now we've run like we run like probably 10 per model and then like it's been Claude 4.6 Opus, Sonnet 4.6, Mythos, and Opus 4.7. Like there's quite a lot of tokens in all of that And it happens a lot of times, a lot of times. And then you compare it to like OpenAI and Gemini, and it almost never happens. So I think that is quite-- that is significant. The old models from OpenAI, for example, had some problems with this, but I think it's like generally much better if the progression is that like the worrying stuff reduces over time rather than increases over time. And it seems like in the Claude models it goes in the wrong direction.Swyx [00:51:28]: Hmm.Lukas [00:51:29]: In the OpenAI models it goes in the right direction.Vibhu [00:51:32]: I think it depends on how well you can control it, right?, there's one side of it being susceptible to this okay, this is potentially something that happens during the RL stage, right? You can RL a model and how loose is it on these terms. If you can control it, that's good. But if you can't, if it's, if it's very jailbreakable, that's not ideal.Swyx [00:51:50]: To me, it's surprising that it happens for Claude and not the others.Vibhu [00:51:54]: I think okay, if it is from RL and how they do it, how their training data is, what their setup is, it makes sense that it just stays in how they're doing it, right? Compared to the other models likeSwyx [00:52:04]: There's a whole constitution and everything. It's kind of cool. Yeah, I obviously you don't know, I don't know. But, it ‘s I think it's just like fascinating to like that you are the first to find these like reliably because you push models so much to to such an extreme. Okay. The only other thing, I don't know if you can answer this, feel free to decline, is do you like-- would you ablate the system prompts? Like any part of this would-- if it changes, does it change the behavior, right?Lukas [00:52:29]: So we, I can't comment on Mythos. UhSwyx [00:52:33]: No, but just li

god ceo new york amazon fear tiktok game world ai europe english google china apple internet pr real reality french european simple system planning german reach holy unity robots 3d original fbi humanity run humans sweden cultural hiring discord origins consequences figure worse wanted blueprint real life ipads swedish direction waste twenty alignment lying bali arena franchise saas stockholm productive hundreds gemini slack openai correct sf optimize labs shopify bench whole foods lovely gta dimension trusted meter venmo harness mythos github collaborations trader joe dev llm anthropic tim cook claw rotten opus rewarding publications grok nda vend agi sloppy upwork elo roomba ender interacting codex blueprints percentage sonnets cartels deepmind refunds pelican v2 sentry yc xai escalate eval rl claudius skr lm paperclips vending pid taskrabbit datadog understandable bengt what comes next yao ming orchestrator vla perishable petersson backlund glm logistically executors seymore andon svgs i let posia

Recurrence of Sonnet Shakespeare!!

The Bardcast: "It's Shakespeare, You Dick!"

Play Episode Listen Later May 29, 2026 45:22

Sent us a text, you dicks!!IT'S BAAAAAAAAAAACK!!!!Your thought we were done with the sonnet episodes, didn't you?? NO, DEAR LISTENERS!!!! (After all, there are 154 of them!!!)And this time, we have our dear friend (and one of the top fans of the pod!!) Brian Linden with us!! Brian has chosen Sonnet 19 to perform and discuss, and we are tickled pink!!!Strap in and get ready!!!To send us an email - please do, we truly want to hear from you!!! - write us at: thebardcastyoudick@gmail.com To support us (by giving us money - we're a 501C3 Non-Profit - helllloooooo, tax deductible donation!!!) - per episode if you like! On Patreon, go here: https://www.patreon.com/user?u=35662364&fan_landing=trueOr on Paypal:https://www.paypal.com/donate/?hosted_button_id=8KTK7CATJSRYJWe also take cash! ;DTo visit our website, go here:https://www.thebardcastyoudick.comTo donate to an awesome charity, go here:https://actorsfund.org/help-our-entertainment-communiity-covid-19-emergency-reliefLike us? Don't have any extra moolah? We get it! Still love us and want to support us?? Then leave us a five-star rating AND a review wherever you get your podcasts!!Support the show

shakespeare paypal strap sonnets dear listeners recurrence dto

The Age of Async Agents — Cognition's Walden Yan & OpenInspect's Cole Murray

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later May 28, 2026 68:02

The new AIEWF website is live! CFPs close in 2 days and we will run our first New Engineer Orientation this weekend, get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!One of the central tensions in the agents industry is that even while there are major decacorn agent labs like Sierra, Decagon, Notion and Cursor being built up, it is also true that it has never been easier to DIY agents, with a plethora of agent frameworks like LangGraph and Pydantic and Flue, and managed agents from Anthropic and Gemini and Amazon. There has been a wave of companies building their own background agents from Shopify to Stripe to Paradigm to Razorpay, and even Cognition's friends Ramp have built their own coding agent with other friend Modal.You'd think Cognition might feel a bit threatened, but they're not - even after all this, they were way oversubscribed for the $1B Series D they just announced:Walden Yan, coiner of context engineering and Chief Product Officer/Cofounder of Cognition, invited OpenInspect's Cole Murray to talk about why the Devin is in the Details.Full conversation live on the pod today: In retrospect, async agents were the most AGI pilled bet you could make in 2024 - the models weren't good enough yet to vibecode, and people didn't trust AI enough to let it rip, nobody (including early Cognition) was sure about the form factors. Now it is obvious:* The first wave of AI coding tools made the developer faster but remain heavily in the loop. Copilor and Cursor's tab autocomplete are prime examples However, the workflow was still heavily centered around and bottlenecked by the developer's local workflow: a developer in an IDE, watching the model, accepting or rejecting changes, and pushing code one interaction at a time.* The second wave was local agents: Claude Code, Windsurf, Cursor's agents pane: first one and increasingly many terminals all running concurrently.* The current Age of Async Agents points to a different future focused more on agent orchestration which drives end-to-end development.According to previous guest Steve Yegge, there are finer-grained 8 levels to agent adoption, but we have collapsed it into three.As Cursor's Michael Truell put it in The third era of AI software development:Cursor is no longer primarily about writing code. It is about helping developers build the factory that creates their software. This factory is made up of fleets of agents that they interact with as teammates: providing initial direction, equipping them with the tools to work independently, and reviewing their work.The agent should not sit solely inside the developer's flow. It should be setup to work in the background so that you can give it a task, a repo, a machine, a shell, a browser, tests, memory, and review loops to go do the work somewhere else.In less than a year, the sentiment has shifted from avoiding multi-agent systems:to suggesting approaches that actually work:From coining “context engineering” to building the infrastructure behind Devin's 7x PR growth and jump from 16% to 80% of commits across Cognition repos, Walden Yan has had a front-row seat to the background-agent shift. In this episode, Cognition co-founder and CPO Walden Yan joins swyx alongside Cole Murray, creator of OpenInspect, to unpack why everyone is building their own Devin, what changed after the December 2025 model inflection, and why “spec to pull request” is now becoming a real production workflow.We go deep on the architecture of background agents: harness-in-the-box vs out-of-the-box, why Devin separates the “brain” from the machine, why repo setup is still one of the hardest problems, why Docker is not always enough, and how full VMs, snapshots, scoped secrets, GitHub bots, Slack integrations, and video-based testing all fit together. Walden and Cole also dig into memory, MCP limitations, multi-agent orchestration, AI code review, SRE auto-triage, PMs shipping code from Slack, Windsurf 2.0, hybrid frontier/sub-frontier systems, and the real failure mode of uncontrolled vibe coding: your codebase regressing to your worst engineer.And as agents eat software… and software eats the world… you can draw the conclusion on what is next:We discuss:* Why the engineering world is waking up to background agents and cloud agents* The December 2025 model inflection that made spec-to-PR workflows practical* Devin's 7x merged PR growth and rise from 16% to 80% of commits* Why Cole built OpenInspect as an open-source background-agent system* The economics of $20/seat agent products and why monetization is tricky* What Cognition actually sells beyond Devin: infra, onboarding, integrations, and adoption* Harness in the box vs out of the box, and why architecture matters* Why Devin separates the brain from the machine for security and permissions* Repo setup, scoped secrets, Docker Compose, and agent-ready dev environments* Why full VMs matter when agents need to run real applications and test them* Android, macOS, Windows, nested virtualization, and machine-specific agent work* Why testing is much harder than “computer use”* Screenshots, video verification, and the “I know it works” merge moment* GitHub UX, Devin Review, AI reviewers, and agents responding to PR comments* Why MCP alone is not enough for first-class Slack and enterprise integrations* Memory, Knowledge, skills, Claude.md, and why retrieval is still unsolved* Devin's auto-generated memories and the challenge of memory pruning* Always-on agents as permanent PMs for issues, tickets, and product areas* Sub-agents, meta-Devin management, and what multi-agent systems actually add* Why pure auto-merge vibe coding breaks down after about two weeks* AI code smells, lint rules, reward hacking, and Semgrep for agent-written code* GitAI, inline context, and preserving the “why” behind code changes* Local testing, mock servers, older codebases, and preparing companies for agents* Windsurf 2.0 and the handoff between local foreground agents and cloud background agents* SRE auto-triage, support workflows, and agents as first responders* PMs, marketing, and non-engineers creating pull requests from Slack* AI agent budgets, $1k-$5k per engineer spend, and hybrid frontier/sub-frontier systems* The rise of autonomous coding factories and who Cognition is hiringWalden Yan* X: https://x.com/walden_yan* LinkedIn: https://www.linkedin.com/in/waldenyan/Cole Murray* X: https://x.com/_colemurray* LinkedIn: https://www.linkedin.com/in/colemurray/* OpenInspect / Background Agents: https://github.com/ColeMurray/background-agentsTimestamps00:00:00 Introduction00:00:43 Why Everyone Is Building Their Own Devin00:01:57 Devin's 2025 Ramp: 7x PR Growth and 80% of Commits00:03:49 OpenInspect and the Rise of Open-Source Background Agents00:07:59 What Cognition Actually Sells Beyond Devin00:09:56 Background Agent Architecture: Harness In vs Out of the Box00:12:08 Separating the Brain from the Machine00:14:07 Repo Setup, Secrets, Docker, and Full VMs00:19:13 Why Testing Is Harder Than Computer Use00:22:40 Video Verification and the “I Know It Works” Merge Moment00:23:19 GitHub UX, Devin Review, and AI Code Review00:25:42 MCP, Slack, and Enterprise Agent Integrations00:28:59 Memory, Knowledge, and Always-On Agents00:36:16 Sub-Agents, Multi-Agent Orchestration, and Meta-Devin00:43:55 Vibe Coding, Auto-Merge, and Codebase Decay00:48:38 Agent Infra, VPCs, Cloud Providers, and Fast VM Restore00:52:25 AI Code Smells, Reward Hacking, and Code Review Systems00:56:10 Making Codebases Agent-Ready00:58:30 Windsurf 2.0 and the Local-to-Cloud Agent Handoff01:01:15 SRE Auto-Triage, PMs Shipping Code, and Agent Use Cases01:04:32 Agent Budgets, Hybrid Models, and Autonomous Coding Factories01:06:51 Hiring at Cognition and OpenInspect Consulting01:07:45 OutroTranscriptIntroduction: Walden Yan, Cole Murray, and Context EngineeringSwyx [00:00:00]: All right, we're in the studio with Walden Yan, co-founder of Cognition, CPO.Walden [00:00:08]: Happy to be here.Swyx [00:00:09]: Which is a cool title. And coiner of context engineering.Walden [00:00:15]: Although I think there are many people who'd used the terms in various ways beforehand, but I did find that people, both internally and externally, enjoyed the upgrade from prompt engineering or model wrapping into maybe a more thoughtful way to build agents.Swyx [00:00:33]: For those who haven't caught up on that, I have on screen the Don't Build Multi-Agents post, which you should go read on and we might refer to, and Cole Murray, who created OpenInspect.Cole [00:00:43]: Great to be here.Swyx [00:00:43]: So let's talk about it. Everyone is building their own Devins. What's going on?The December Shift: From Handholding Models to Autonomous PRsCole [00:00:51]: So I think the engineering world is waking up to this idea of background agents, cloud agents, whatever you'd like to call it. And I think we saw a shift around the December timeframe of 2025, where the models Opus 4.5 and GPT 5.2, they reached a capability where we moved away from handholding the model and being able to actually more or less autonomously drive the model. And what I mean by that is that we could pretty much go from a specification to a completed pull request, assuming the spec was good enough, with very little friction. And that paradigm alone, I think, changed a lot of how we interact with agents, and opened this world where background agents became more practical.Swyx [00:01:41]: I think for Cole, everyone experienced this in December, but I feel like there was just this increasing ramp, right? There was this moment which was, I think, Sonnet 3.7, where, You guys rewrote Devin in one night or something. So describe 2025 or how it felt from your side.Walden [00:02:01]: In retrospect, we always thought it was ramping up, but then even now, over the last three, four months from today, it's been ramping up even faster. So it's almost funny to be talking about how, big of a leap Sonnet 3.7 was, and honestly, a lot of it was stripping out parts of Devin that were no longer needed with that jump in of intelligence. But I also just think that a lot of the recent leaps, especially, you look at, models like Opus and the latest GPT models, they are reaching levels of autonomy where people are actually finding that they actually can just be hands-off. And people who were once debating, “Oh, do I need to be in the weeds with my model in the IDE? Can I just completely move it off into the cloud?” That's a more serious conversation, and we've seen that in all of our growth charts. Internally there's this funny graph where our usage has, of PRs, our merged PRs, has grown 7X since I forget what it was called.Swyx [00:02:57]: I think Dev, maybe tweeted that. Yes.Walden [00:03:01]: it grew like 7X over, the last, I think it was, two months, three months, something like that. And then you see our engineering headcount growth. It's, gone up by, 10% or something.Swyx [00:03:11]: We were, we were afraid To release this. So this is Devin commit percentages on all Devin repos, was 16% in January and now 80% in March.Walden [00:03:25]: It's a big shift right now. And so it makes sense that a lot of people are now thinking about, buying Devin, but also maybe, trying to build their own and there's Lots of I have a lot of fun building Devin, so I can see why other people would want to build their own cloud agents as well. Matt, well, maybe it's good to hear, what initially inspired you to try to build OpenInspect?OpenInspect: Ramp, Cloud Agents, and Open SourceCole [00:03:49]: OpenInspect came about, through primarily my clients observing how they were using tools like Claude, OpenAI's Codex at the time, and seeing some of the friction that they were having with it. Primarily the Claude was being used through Slack, and a big issue they ran into was that the sessions that were launched were specific to whoever called it via Slack. And so if a PM was the one who invoked the session and they would then go to pass context to engineering can't see the session. And that in itself was a deal breaker because the PM, “Hey, engineering, can you jump in?” But there's nothing to jump in on unless they're copy-pasting out or the single response that came back. And so seeing some of these problems, I had built a similar architecture internally, just to experiment with, test out different ideas as this trend of moving off of localhost was starting to become, And as Ramp released their blog post, I had a lot of the pieces for this already in place, and just thought it would be funny to, see what Claude could do just purely from the blog post. And on my X account, there's actually a thread of where I live tweeted, going through thisCole [00:05:14]: comparing GPT and Claude as both of them are going through it.Swyx [00:05:17]: On the announcement thing or something else?Cole [00:05:19]: right after it got released. We can put it in the show notes. Yeah, it was helpful that I had already knew how to verify the system. I knew what I was looking for. I think Ramp did a great job of really illustrating, the technical aspects of how to build something. It was much more than just like, “Hey, we built a great system.” It was, “And here's how you can build it too.” And so, I resonated a lot with that, just with the problems that I was already seeing, and I thought that, looking around, I didn't really see anything in the open source community that, met this type of system. I think there's a lot that run, in localhost like Superset, Conductor, and many others.But nothing that was actually running in the cloud. And so, I built it, and I thought it was interesting to just open source it and allow anyone to then have a foundation that they can mix and match on top of.The Business of Background Agents: Open Source vs. DevinSwyx [00:06:16]: So literally after Devin was launched was, there was OpenDevin Which became All Hands. I don't know if you tried that orWalden [00:06:22]: I was going to say, one of the things that interested me a lot with OpenInspect was, you didn't try to go make it then something you monetize. There are a lot of, I think, these open source projects would then go and really try to, raise VSwyx [00:06:36]: That's why no OpenDevin. Yeah.Walden [00:06:38]: yeah, and how did you think about that? I thought that was very interesting.Cole [00:06:44]: I thought, and just what I had seen across my clients, was that having a background agent system is going to become a critical infrastructure within their company. And so because of that, I think that I wanted to open source it so that they could fork it and put in whatever customization they wanted. To that question though, I get asked all, “Oh, are you going to raise? Are you going to turn this into a service?”Walden [00:07:08]: I'm sure you've gotten offers.Cole [00:07:09]: but primarily I don't want to do that for a few reasons. One, I think that I don't want to compete for, $20 a seat. I think that is just a really difficult business. I think it's very easy to copy the main pieces of it. Again, I built this fairly quickly. And I think because you are not owning, I guess, the entire stack, it's hard to monetize. You have money being made at the sandbox layer with Daytona, E2b, many other players. You have money being made at the model layer. And you sit in this weird in-between gray area where what are you actually selling? You're selling, I guess, the infrastructure. You're selling, the integrations maybe.Swyx [00:07:55]: let's ask the guy. What are you What are you selling?Walden [00:07:59]: Well, yeah, there's multiple layers to this in practice, and actually it's funny you mentioned the infrastructure, ‘cause when we got started building Devin as well, we had to go figure out how to make the infrastructure as well because,Swyx [00:08:10]: You had to build this two years before everyone else,?Swyx [00:08:15]: Including, the model sideWalden [00:08:17]: It was not, it was not very polished at the start, when we just built it off of raw VMs from cloud providers like EC2, the boot up time was so slow, I think, And especially then, turning off the machines, saving them, and then to be able to bring them back up again when the, when you want Devin to wake up again later. It would just be out cold for like 10 minutes because that's just how long these systems took. They were not built for this repeated down and up usage. And so we actually had to go do all of that. And as a result now, one thing we offer when we go and sell Devin to people is, you don't have to worry about all the compute side of things. We'll make it work. We'll make it work in your cloud if you want it to. But aside from the product, and I want to go into the agents and the tuning of the intelligence part later, but I think a big part of what we do at Cognition as well is to just make sure that your company learns and uses and adopts these coding agents. ‘Cause I think for especially the largest enterprises in the world, you find that there is a lot of people who want to move over to using AI for their day-to-day workloads. But because of the way projects are planned, because, not everyone is literate in using AI in these ways, having a team of engineers who can actually go in and onboard you, set up all the integrations you need, the automations you need to really get to that level of, leverage with AI, is super helpful. And so We do that. We show thought partners to the customers that we work with as well.Swyx [00:09:56]: So let's talk about, architectural stuff. I think that's always, that is something that was the topic of conversation between the two of you. Is this, the mental model that you want to start with or something else? I'll just leave the floor open to you guys.Agent Architecture: Harness in the Box vs. Out of the BoxCole [00:10:11]: I think, maybe we can start here as just a general what are the pieces of a background agent system. And then maybe we can go into some of the nuances of, Decisions that you can make.Swyx [00:10:22]: But I guess I also Like, what, maybe what Walden is saying is the agent is like in this open code box, I guess. Right? This is infra, and then there's, that's the agent. And you had this discussion about whether you put the agent in here or in Out externally. Can you tease that out?Cole [00:10:39]: In a background agent systems, you have a decision to make of where the agent is actually going to run. This is typically described as the harness in the box or out of the box. With running the agent in the box, you're making some trade-offs by doing that. The negative trade-off you're making is primarily security. Because the agent is running in that box, unless you otherwise design it, all of your secrets need to go into that box as well. And given the nature of AI, it can be unpredictable, and you could very easily end up accidentally exfilling your secrets, or other unintended behavior. Now, the out of the box is the idea that we are going to have the actual agent running not directly in the sandbox, and we will have, quote-unquote, the brain of the agent running in some type of worker, control plane. That sandbox then is going to serve as the hands where the brain is basically operating and making tool calls into that environment to manipulate it. I guess other trade-off that you're making between the two systems is that, in my opinion, running it out of the box is much more complex because, you have state that has to be managed, whereas if you're running it in the box, all of the state of that agent is actually in the box, and yes, it's you could persist it elsewhere, but it's all localized and you have less concerns to worry about.Walden [00:12:08]: I think a lot of that, what you mentioned, is why we actually from the start built Devin to what we called separate the brain from the machine. The other thing that this allows you to do is reuse any existing infrastructure you have for dev boxes Perhaps. And so you don't have to worry as much about making a new type of dev box that has all the dependencies the brain needs, as you mentioned, the secrets the brain needs as well. One thing that we've seen some customers run into is, you have a GitHub app and you want Devin, your agent, whatever, be able to interact with GitHub through this application, but then you have different users with different actual permissions. If they are all interacting through the same GitHub app and there's no actual, separation between the system that decides, what it does and the actual secrets on the machine, then you run into an issue where, okay, it's hard to do the separation. But in practice, with Devin, it's much easier because we just say whatever you put on the machine, that is, the scope of basically what the user is free to do, what the agent is free to do. So only put the most scoped secrets on that machine, and then the brain is fully not accessible from the machine. So you don't have to worry about messing with the, any of the most secure parts of the brain if the user is free to do whatever they want with the machine.Swyx [00:13:31]: I was going to just bring, I have this, chart from OpenAI, where I don't know if this is, in the box, out of the box. That is something that they do use to describe it. And then also recently Anthropic did, managed agentsSwyx [00:13:44]: Which is, this is their thing. I don't know. It's all, it's all variations of the same pattern, right?Cole [00:13:49]: So this would be out of the box.Swyx [00:13:51]: Which, is preferable for them because it's less work?Cole [00:13:56]: I would say it's more work.Swyx [00:13:58]: It's more work?Cole [00:13:58]: But it, in my opinion, it is the better architecture of the two. It's just, you're taking on a bit of complexity by doing that.Repo Setup, Docker, and VM-Based Development EnvironmentsWalden [00:14:07]: One thing I've not seen a lot of other players do well is how do you manage what's actually on the box? And this can be complex for many reasons. Let's say you have a big repository that's changing and updating a lot with changing dependencies. How do you make sure that the working environment of the agent actually stays up to date, has all the credentials it needs to, let's say, run the app and test it, and all the things you want your autonomousSwyx [00:14:34]: So a repo setup.Walden [00:14:35]: Exactly. So in, internally At Cognition, we call this repo setup.Cole [00:14:39]: The hardest part ofWalden [00:14:40]: It's been a perennial problem since the start of the company, of how do we help people get this set up? Because not everyone just has, working cloud environments working out of the box. And do you find this to be a common problem withSwyx [00:14:53]: How do you solve it?Walden [00:14:53]: Your clients?Cole [00:14:54]: This is a very common problem, and through my consulting, this is a lot of what I help teams do. A lot of teams don't really have great developer environment setups, if any. A lot of the times it's, “Go talk to Bob and get the secrets,” and that obviously doesn't work when the agent needs to actually set this up. And so a lot of that, most teams are using Docker Compose or some type of microservices. And so for theSwyx [00:15:19]: Even in prod?Cole [00:15:20]: Not in prod. With the OpenInspect, you are using this primarily to interact, and make code changes. There is other use cases, but you can hook, whether through CLI, MCPs, other tools, you can then hook that into your production systems primarily for, SRE type use cases. But you are not, necessarily, trying to test your prod internal microservice through the system.Walden [00:15:48]: And you mentioned Docker Compose. I think one direction we saw some of our friends take early on was, using Docker containers as the level of abstraction for their models. There's lots of reasons, I think, why Docker containers are not great. One thing is, Docker container's not really a true security boundary, for one. But the other is, if you are running real applications, a lot of times those applications use Docker, and then you have to think about Docker in Docker, which is, really weird. And so I think part of, the really hard challenge of getting VMs to work, why did we do that? Well, it was because we realized that you actually needed, full VMs to be able to do these types of things. And especially nowadays where there's actually value in running the application and clicking around and sending you screen recordings of these things. The value just, keeps adding on top of that. But it is a decision I see people run into when they try to build their own systems, is, “Oh, do we, in addition to this, do we put the agent in the machine or out of the machine? Do we use Docker? Do we use something else?” What do you recommend people nowadays?Cole [00:16:57]: I think Docker is a good solution for maybe not running the agent, but running your infrastructure, because that is more or less the same setup your engineers are probably already using. If they're not, then I don't know what they're using. But they're probably already using Docker Compose.Swyx [00:17:14]: I've always had a small candle for web containers. I don't know if you guys have tried them before.Swyx [00:17:19]: To me, they were, supposed to be like Docker Light.Cole [00:17:22]: Is it?Swyx [00:17:22]: I don't know.Cole [00:17:22]: No, I haven't tried it. But yeah, I think any environment that you've set up that is a good experience for your developer naturally lends itself to being easy to set up for the agent. And once you figure out that local developer story, you've more or less solved the agent in a sandbox, environment setup. OpenInspect does have hooks as well, where you can, run a setup SH script that will pre-install everything. You can then pre-snapshot that build so it starts instantly, and then there is a second hook to actually then, restore the state of the sandbox when it comes back. And so you can already have all of those microservices running and basically get the same experience that you would on your machine within the sandbox.Testing Agents: Computer Use, Screenshots, and Real App WorkflowsWalden [00:18:08]: Another thing that we've been thinking a lot about is like Different VM service offerings. Have you had customers where they needed like macOS specific VMs or like Windows specificWalden [00:18:20]: VMs?Walden [00:18:22]: There are like many technologies in the world that only work on specific types of machines, right? If you're building a.NET application that has to run on Windows or like, maybe more commonly if you want to build iOS or macOS Does that workSwyx [00:18:32]: Does Commission supportSwyx [00:18:33]: Choices like that?Walden [00:18:35]: The fundamental architecture we do, because we do the separation, it does support, but the actual work in progress is happening right now on these. Another thing that we've actually recently added support now for, it's in beta, is doing Android development. To do that, we needed to support, I think, nested virtualization within our machines because the VM itself is like a, is a virtualized Firecracker instance, and then you had to then run another Android emulator inside. And there's like weird performance issues that like, it, which is why it's like still in beta. We have to think through these problems, but it unlocks a lot for anyone who wants to do Android development.Swyx [00:19:13]: I was trying to find like a reference video for the testing thing. I couldn't find it, but I think you worked on the testing, capability. Why call it testing and not like computer use or I don't know, it's, what's the general Category of problem?Walden [00:19:26]: I think that when people think about the ability of an AI to run your app and test it, I think they actually over-index on the computer use part of it because computer use in my mind is the literal, okay, you want what button you want to click. Can you emit the right coordinates to go click that button? I think testing is actually a really interesting likeWalden [00:19:48]: Problem-solving, challenge for these AIs because if you wanted to do arbitrary testing, imagine you make a change that spans the frontend and the backend, maybe, even some other like even more deeply nested service. To actually test that change, we have to reason through what-- how do you first run these applications to orchestrate with each other with the right version of the code? Then, okay, how do I trigger the feature or how do I make the thing actually happen? And this can get arbitrarily hard, maybe you have to be an admin. Maybe a certain thing has to be feature flagged on. Maybe, you have to like run two sessions and then send us a very specific word into one of them to trigger a specific behavior. And figuring out how do you do that requires a lot of code base context, requires, a lot of orchestration that we've specifically done. And in some cases, we found that you actually, no one frontier model can actually do this full end-to-end task itself.Walden [00:20:42]: We've seen cases where we actually had to orchestrate different frontier models together to solve this problem together. That is where we spend most of our time when we think about this testing problem, not so much the computer use part. Computer use for what it's worth has gotten a lot better with recent models and it's made that part of the job certainly easier.Swyx [00:20:58]: Especially with like even 4.7, that they released yesterday, apparently like way better in terms of the vision stuff, which is going to be encompassing computer use.Walden [00:21:08]: Having evals for all these as well is something that like takes a while to build up. And having the evals be right is tricky as well. Do you ever see like, clients who are building their own agents have to start standing up evals to make sure things don't regress?Swyx [00:21:25]: Not so much evals in the traditional sense, but specific to the testing part that has just gone in. I just added support for screenshots And in theory you can also do video. I need to put in a plugin to do that. But they do show up natively, and it was a very heavily requested feature, especially after Cursor's recording came out. I think that was very enlightening for everyone of like, “Oh, this is a very good feature to actually have.”, I think with Devin you guys have had this for a while.Swyx [00:21:57]: Oh, yeah. See how screenshots work. Yeah, I don't know if there's anything, super and not obvious. It's like once what feature to build, you can just prompt it and it Will mostly work.Walden [00:22:09]: I think to Walden's point, though, the computer use is a subset of the larger testing problem, and I think that's very specific to the code base that you're working and it's not something that, out of the box that you could just solve it. The-- you do need the code base context to actually know how to test it. And I think in the case of a background agent system, you fortunately do have that code base locally that what is changing and could then inspect it and use that to drive the model.Swyx [00:22:40]: For those who haven't seen it before, this is an example of how it works. You, after the PR is done, you click testing approved, and then it sends you back a video. What I really like is that it labels, It's very small here, but it actually labels what it's testing. And then it-- and then you actually see the cursor and everything. So I don't know, yeah, the engineering in this, just Whatever you want to show. ‘cause this is like, this is one of those like, oh, few of the AGI moments, right? ‘cause Once I look at this, I actually don't I wish I can just merge inside Of Slack instead of going to GitHub ‘cause I don't need to see the code. I know it works.Walden [00:23:19]: Maybe a new feature in Cursor. Yeah, the annotations at the bottom was also a big difference for me when I, when I added those.Swyx [00:23:27]: It's just like, what am I looking at? What are you trying to demonstrate?Walden [00:23:30]: Exactly. There's a surprisingly long tail of small details that ends up making a big difference for this end metric of like how fast do you actually merge the code in. One experience that we spent a lot of time tuning early on was what is the right experience on GitHub for these tools. Because I think, most tools out there when you build the agent, you'll think about, oh, it'll create the PR for you. We try to take that a step further and say, “Oh, what if we actually made sure you could interact Devin, with direct Devin directly on GitHub?” And so we made sure that you can comment on GitHub, and Devin would actually receive those comments and address them back. But there's actually quite a bit of tuning you have to do here because you can imagine that actually like-We recently have Devin Review, for example. Devin Review will post comments on his own PR And then Devin has to then goGitHub Workflows: Devin Review, Comments, and PR AutomationSwyx [00:24:23]: He answers his own comments, which is Really loopy. So like, yeah, I like that it just updates here that it's, that I have commented But usually it's just me saying like, “Hey, merged, fix any merge conflicts.”Walden [00:24:37]: The, so when Devin fixes his own comments, you might be scared that, oh, maybe I'll infinite loop. But we've put a lot of work into making sure it doesn't, both by making sure that the comments are high signal, but also that the agent is thoughtful about what comments it immediately goes and tries to fix, and what comments it's like, “Wait a second, I think you're wrong.” Actually, that's one of my favorite moments is when Devin tells me that I'm wrong, when I try to get it to do something different. But tuning that behavior, actually makes a big difference in terms of how useful the actual GitHub experience is.Cole [00:25:06]: I think to touch on that as well, I think having the AI reviewer integrated into the system is a critical part of this background system. OpenInspect does have that. It has a GitHub code reviewer that you can control the prompt. It does do comments as well. It doesn't do them automatically yet. The capability is there, but it's not fully used.Swyx [00:25:27]: So you have to ask for it?Cole [00:25:28]: you do, yeah. You can tag it on GitHub, and then whatever you named your, GitHub bot, it will then follow up on it. It will then, if you have merge conflicts or whatever you have asked it to resolve, it will then resolve it, but it doesn't do it automatically yet.Integrations: Slack, MCP, and First-Party Agent InterfacesWalden [00:25:42]: Well, I'm curious, what is, the most common thing that people end up requesting, that they still need on top of OpenInspect when you help them go implement it?Cole [00:25:52]: I think a lot of it comes down to actually integrating it into the company. It's one thing to have the background agent system set up, but if it isn't actually integrated into your larger ecosystem, it isn't that useful. It is useful to be able to kick off sessions, but what we really want to be able to do is hook it into all of our other systems, whether that is the production database with read-only credentials, the logs, a Confluence or internal knowledge-based system. I think that is where I see the huge leap for companies, and that can be a challenge for companies as well who are maybe not familiar with exactly how to approach it, especially if they're in environments that have more compliance type things where, access control can be pretty big and how do you deliberately think about these problems, I find to be, one of the problems that comes with a system like this.Walden [00:26:46]: The thing we found is So, MCPs, obviously it has been like this, really big explosion of, oh, you can go, integrate it with all these different things. But to actually get the integration right and the and get the right experience, oftentimes we found that we had to go build our own ad hoc things. I think Slack is a great example of this. You could give your agent a Slack MCP and okay, it can post messages back to you on Slack. But we actually use Devin like a coworker in Slack, and that's how it's been built from the ground up. But to do that, you actually need to, support webhooks that come back, right? And then Devin has to respond in a natural way and then hopefully don't spam your threads too much and annoy the people in your company. So you got to tune that experience just right. Especially when there's a lot of back and forths, we find that we actually have to go beyond the simple MCP integrations in these places.Swyx [00:27:39]: I just pulled up the MCP marketplace. I know this is a Fair amount of work. Is the answer to eventually take first party control of all the top MCPs? Is that theWalden [00:27:48]: I would love a world where you could have something that's more expressive than MCP. That, goes both ways, not just a set of tools, but a proper system that interacts back and lets it Have the right experience with all these interfaces.Swyx [00:28:03]: So there actually is sampling in the MCP spec, but nobody Uses it, right?Walden [00:28:07]: And so I think that's the other part is, actually we found that when the MCP spec starts to get too complicated, it starts to lose its original promise of Being like a simple one-step connect. Now then we have to go figure out how to support all these different variations of things and It starts to look a lot like just building the first party integrations in a lot of these cases now.Cole [00:28:29]: I think it matters, too, how critical it is to your company, right? If this is something that nearly every session is going through, it probably makes sense to own it so that you can make optimizations on top of it Versus just whatever is off the shelf.Swyx [00:28:43]: Awesome. Other than MCPs, what else, sorry, well, I don't know if that's Narrowing in too much on, integrations. But what else? What other elements of building OpenInspect or Devin that you guys really sink on?Memory and Knowledge: What Agents Should RememberCole [00:28:59]: I think, a problem that comes up very frequently is this idea of memories or knowledge base.Swyx [00:29:05]: Oh, boy. How do you solve it?Cole [00:29:08]: so not solved yet, is the short answer.Cole [00:29:11]: it's something, there's a open issue for it, someone asking about it.Swyx [00:29:16]: There's, I, D Wiki hasn't indexed anything about memory yet.Cole [00:29:20]: how I'm seeing it solved across my clients is primarily through skills. I find that skills can be a good gap within that or updating Claude MD, but I think memory as a whole is a pretty unsolved problem, and it is why I've been hesitant to add it. I think there is parts of memory and that can be addressed, but I think as a whole it's a very difficult retrieval problem.Swyx [00:29:44]: Oh my God. RAMP didn't write anything about memory? I see zero search results.Walden [00:29:50]: No. Memory can be quite tricky to get right because it's the retrieval, but also the generation of the memories that can be really tricky. You don't want it to just like Remember very specific details.Swyx [00:29:59]: Walk us through the Devin memory journey because I know there's been a journey.Walden [00:30:03]: the first version of memory that like stuck around for a while was A system we have called Knowledge. And the idea was we wanted it to pick up things over time and not need the user to be proactive about teaching Devin things. So, okay, any time you remind Devin, “Wait, no, that's not quite the way you're supposed to use Git”Like, we actually want Devin to say, “Hey, do you want me to actually just remember this for the future?” And for you to just basically quickly approve or reject and for it to build up over time. ‘Cause I find that, 95%, I think, or some crazy stat like that of the memories that Devin has are all through these auto-generated things. Very few people actually just want to sit down and write big docs on Here's how you're supposed to work with the technology, et cetera. The generation and the retrieval has been something that we've been trying to tune a lot over the years. Generation, you don't want it to remember something like, if you asked one time to like, “Oh, please open as a draft PR,” you don't want to be like, “Oh, everyone forever now should get their PRs as draft PRs.” But you do want some, conveyor. Maybe you want to say like, “Oh, Cole generally likes, things to be created as draft PRs.” Same with retrieval, if you have thousands of these memories, how do you actually make sure they're retrieved at the right time? And that can be quite tricky to do right without exploding the context with a bunch of useful yeah, useless information. Surprising amount of just, eval work to just make sure that, memory is, remains a reliable system as new models come and go.Cole [00:31:31]: Do you have anything that you could share on, memory pruning? And like the temporal aspect of memory?Swyx [00:31:36]: Deleting and forgetting?Walden [00:31:39]: The, today, the, So the things they could do is it could edit memories. And so if your memory used to say like, “Oh, Cole likes to open everything as like a draft PR,” then you can imagine, “No, don't do that.” And then it'll say, “Oh, do you want me to update the memory to be Cole now want everything as, open PRs?” I think that at the same time we don't know if this is going to be the final version of the system. Whatever we have here will probably, translate into the new system that we'll be coming up with. But I think one big difference between two years ago and today is these agents are really good at using anything that resembles a file system natively. And so part of us are, is thinking, “Oh, should we rebuild memories to feel more like a file system that we let the agent navigate on its own?” That's been an interesting exploration. Also similar ideas in the scale space.Swyx [00:32:35]: I am pulling up OpenClaude's memory thing right now. So memory, OpenClaude has like this like daily memory journal thing, right? And you can I mean, that is a file system you can grep through and is a source of truth. I don't know if it's the best. It's probably super noisy, but at least, if you lose something you can discover it or you can apply some, forgetting algorithm to, more ancient memories that don't get recalled again or something. I don't know.Walden [00:33:01]: One thing we've been trying to do to push the boundaries of how you use agents at your company is letting an agent basically have a very similar file, a memory.md or something, and just like be your permanent PM for a specific set of issues maybe. So we have like some Slack channels internally, maybe a Slack channel dedicated to, a specific product like DeepWiki maybe. And you can imagine that, or you want a Devin that never stops, it's just always awake, but it has this like memory dock that it can just maintain for itself about, okay, what are like the number one priorities of what we have to fix and prioritize? Who is responsible for some upcoming work? Maybe they'll even Devin will even tag you on some recurring basis. And so it's been an interesting move to see, okay, how can we actually use Devin for more than just engineering? Can we actually upstream above the engineering process and maybe it's just Devin creating tickets, which then maybe some humans do, but then maybe other Devins do.Swyx [00:34:00]: One of my more fun automations is go research competitors and just suggest stuff to me on a weekly basis. That's the automation. I can't find it right now, but basically it just like, “Look at competitors and suggest things.” “And here are three things that you've suggested that I don't want any more of,” and you just stick that in the prompts. But like I wish actually So for like when I, for example, when I reject a PR, I wish that it updated memory so that I can then just not have to go up, go back and update the scheduled, sync, but anyway, feature request.Walden [00:34:31]: what? We might change it soon. I guess OpenInspect, in the time you've been around, has there been anything you tried to implement but then you had to like undo and like do a different way?OpenInspect Architecture: Webhooks, Control Planes, and Agent StateCole [00:34:41]: Nothing yet, but something that is on my mind. The initial way that I built it was that each of the integrations lives as its own package. And so you have The Slack bot, which is what's handling the webhooks, and then is basically interacting with the control plane. As I'm seeing the system starting to be more integrated, specifically with the GitHub bot integration, I'm considering bringing that all into the central control plane because especially now I want to start, And a request that I'm getting is the ability to monitor, the actual, pull requests being merged, as well as just tracking ofSwyx [00:35:19]: What do I have open?Cole [00:35:21]: What do I have open? How many of these are getting merged? How many comments are showing up? To just understand the health of the system. And so in the case of a GitHub app, you only have one webhook. And so then it's a question of do I put that webhook in that GitHub bot package? That's weird. It doesn't really make sense to live there because that package is more for like the code reviewer. Or do I like centralize it? So that's something that's on my mind of, making that decision. I think the other one we touched on earlier is the harness in the box versus out of the box. I think long term the architecture will eventually come back out of the box. Some of the newer tools that I've added are calling back into the control plane so that you don't have the secrets in the sandbox. And so I think long term I probably will pull the actual, agent out of the box, but I think for now it's fine.Subagents and Multi-Agent Systems: When Parallelism Helps or HurtsSwyx [00:36:16]: Just, a quick question on pulling the agent out of the box. I'm One thing I'm very bullish on this year is agents calling other agents or spawning sub-agents or Whatever you want to call it. Does that make it harder or easier? I can't tell. Because if the harness is in the box, you can just spin up more boxes. If the harness is outside the box, then you're, it's less easy because you are, you have a unicorn pet of a, of a harness that's, living outside the box.Cole [00:36:45]: In theory it would be the same way, right? Whether, one agent has launched many, sub-sessions within it, OpenInspect, for example, can launch sub-sessions and actually create other environments and then monitor them. In the case where it is out of the box, that would basically just be an additional session that's running. And so that session is also running outside of the box. It's running in your worker plane, wherever you're running this. And then you really just have to think about how does your top level agent then interact with it. I do think it can be more complex, just ‘cause again, you have now a more difficult architecture. But I think if you figured it out once, it's probably fine.Swyx [00:37:26]: Well, then I'm just, throwing it open to you in terms of, I call this like meta Devin management. Which is like the, Devin's calling Devins or Devin scheduling Devins or querying trajectories or anything like that. What have you built or unshipped, anything?Cole [00:37:46]: I think one of the surprising things we've seen is that a lot of the ways that, these, separate agents work with each other, and you want them to, parallelize their work, has still mostly followed the same manager sub-agents regime. And a lot of people I think are excited about this world where you have swarms of agents that, talk with each other all over the place. We've actually given Devin an MCP so they can just go arbitrarily message other Devins And create new Devins, et cetera. But I guess, it somehow creates, a really chaotic world in that sense. And so we've still found that most practical use on a day-to-day basis has been one single Devin.Cole [00:38:33]: Figuring out how to segregate the work and get, have other Devins work on it in, a relatively isolated sense, each with their own boxes Not sharing machines, so there's, a very little room for conflict is the regime that you have to create today.Swyx [00:38:50]: I'll call out, the experiments from Cursor, right? This is Wilson Lin's work on Single agent to multi-agent, and you're obviously famously on the side of don't build multi-agent. But they went through the whole thing, only to arrive at, this Which is exactly what Devin has, I think.Cole [00:39:08]: I think there will be a revision to that post at some point AboutSwyx [00:39:12]: Tell us about itCole [00:39:12]: I think multi-agents were very much not at all possible a year ago. You do see more multi-agent experiments today, but you can argue, are they really multi-agents, or are they just just, tool calls,? There are people who, will create sub-agents to go look for XYZ file, XYZ implementation. Has really nice context management benefits because all of the tool calls and tokens that it spends then get collapsed back to just the answer for the main agent. There's a lot of benefits to doing this. We basically have Devin do this with Deep Bookie, make a call out to Deep Bookie, give you back the results, but that feels like a tool call,? It's not like these, two collaborators actually talking back with each, back and forth with each other. But I think the thing that gives me the most bullishness that multi-agents might actually be possible is actually what I said earlier about Devin will actually sometimes tell me I'm wrong and push back, and I think that demonstrates a level of maturity and communication today that makes a multi-agent world possible. One, can two agents who have seen different information come back to each other and actually figure out who is right, what is the correct implementation? They're not just, yes men. Claude, I guess is like, used to just say, what is it? “You're right,” or,Swyx [00:40:25]: “You're absolutely right.”Cole [00:40:26]: “You're absolutely right.” Yeah.Swyx [00:40:28]: The Have you seen, did you seeCole [00:40:29]: The age is overSwyx [00:40:30]: The Codex app troll in Topic? This is the Codex app. Inside of Settings, there's a little, there's a little Easter egg, right? So if you go to, the Themes or Appearance, right? There's all these, color codes, and the top is absolutely, and it's the Topic's colors. Which is such a troll. Anyway.Model Behavior: Pushback, Adversarial Prompts, and Agent SkepticismCole [00:40:53]: I love that Easter egg. Did you discover that yourself?Swyx [00:40:54]: No, it was, someone was, tweeting about it And I was like, I was like, “Is this true?” Because, sometimes people just tweet stuff to, get a rise out of you. But yeah, there you go, in Topic colors.Cole [00:41:06]: Yeah. So yeah, we're out of this regime where, it just says you're absolutely right, and they can have real conversations and real back and forths.Swyx [00:41:13]: You can prompt it as well to be more adversarial or whatever. Yeah. Okay. Yeah, that, I mean, to me, that is more intelligence, right? That is not just something that's, a dumb tool, it's actually pushing back on you I think. Yeah.Cole [00:41:24]: when you mentioned, of course, the blog posts. There was one blog they had where they fed a swarm of agents together and built a browser.Swyx [00:41:34]: That was I think that was the one.Cole [00:41:36]: You can have, likeSwyx [00:41:37]: I think it's the same oneCole [00:41:37]: Creation of it. We found a surprising success of, don't do a swarm or anything, just have one Devin, it does its own context management. Just let it keep running for a while and give it some crazy tasks. I think we asked it to, rebuild, a Windows OS system. And it managed to do it just like, going on for long enough. It'sSwyx [00:41:55]: Was this Andrew's thing?Cole [00:41:58]: there were lots of demos that we ended up not posting, ‘cause at some point we'd just be posting way too much a bunch of, Demos. But I love that because it shows that I think the multi-agent thing still has, a bit of exciting sexiness to it, which is maybe still beyond still, the actual delta it adds to the capabilities of these systems. But it's absolutely the future. I think we're heading in that direction and we can see the progress being made there already.Swyx [00:42:25]: If I were to, make one super minor pushback because I don't feel that confident about it yetCole [00:42:33]: Go for itSwyx [00:42:33]: But I've had Ryan Lopopolo from OpenAI on the pod And he's a super slop cannon, right? Oh my God, that's my coding agent being done. I downloaded this, Peon Ping. I don't know if you guys have heard this. It takes like-, sound packs from popular games like, Command and Conquer and Warcraft, and then it plays it whenever it's done. And so it's like, “Work,” or whatever, “At your command,” or something. Anyway, what I got from the Cursor code base and from Ryan's thing was that there's a slop cannon approach where you try to loosen the single agent's, bottleneck, and I feel like that is, probably an, a very important thing to try to figure out. I don't think anyone's, really solved it. Because then you just have more reviewer slop on top of the agent slop To try to wrangle it all. Ryan will probably very strongly object that I say that he hasn't solved it, but he thinks he's He thinks he's completely solved it. But I think it's still I think it's, very important, ‘cause, that is a bottleneck, right? I feel Devin is slow sometimes Because I'm like, well, yeah, this is very readable and very sensible, but also it is slower than it could be if I just, I want a button to just say, “Just ramp this up 1,000 next parallel, in parallel and just, see what happens,”? And I don't know if that's, feasible at some point in the future.Code Review, Entropy, and AI SlopWalden [00:43:55]: I And we've also run experiments internally where we've basically tried to build entire products, true products that we knew we would eventually ship, but for now, let's try to see if we can do it just by purely, vibe coding on top of each other, auto merge, no code review at all. And then there's this benchmark of how many weeks can you go onto this for Before you say, “We have the trashiest code base.”Walden [00:44:18]: “Let's actually rewrite it from scratch.”Swyx [00:44:19]: Start a new factory, yeah. What'd you find?Walden [00:44:21]: I think we found that the state-of-the-art in December was you can probably, run this for about two weeks. By the end of those two weeks, you'd find that, hey, you want to, change the color of a button. Well, it turns out this button is implemented in, 10 different places, and they, have All these different variations, and oh, you forgot one of them, and actually it's a slightly different color in one spot. And you're like, “Okay, this is too much to work with. Let's actually try to do code review at the same time.” And make sure that we're on top of our software, actually cleaning it up a bit And making sure it's done in a scalable way.Cole [00:44:54]: I think building on that, the idea of, you don't have to look at code, I think is generally a bad idea. And the meme that I have for thatWalden [00:45:03]: What timeline, all right, is Do you think that statement will be true on?Cole [00:45:06]: I think probably for a while it'll be true that you should continue to look at your code. A problem that I see a lot of teams run into that I work with who are embracing AI native, AI first coding, is The meme that I have is that your code base regresses to your worst engineer, because that engineer who is, very gung-ho about AI and is not auditing their code, their pattern starts cementing into the code, and now the AI is referencing their patterns. And so now their if/else block that, is 20 if/elses back and forth, the AI is seeing that as the pattern of how things are done and starts to then exponentially grow this slop. And I find to your point, a pretty good approach to that is having scheduled cleanup, whether by humans or through systems, that are looking for duplication. They then address that. You'll end up with like 12 helpers for how to format a date. And you need to address that, because otherwise it will continue to sprawl.Swyx [00:46:09]: Within balance, I think it's fine to have some duplication, and then sometimes To have garbage collection, right? Yeah. The What I've been, talking about with a lot of engineering leaders is that you want to be very strict about the boundaries between modules, and it's your job as an architect, as a CTO, whatever, to say like, “Okay, here's the hard contract between you guys and you guys. Whatever you do inside this black box is your business. You do whatever. But between these guys, let's be, really damn clear, and any movement must be signed off by a human or me,” or. Then, and like that's that. I don't know if you have any other modifications or advice.Walden [00:46:44]: Well, I guess generally on the topic of, where humans can be useful, I found that ‘cause, some of these, really deep infra problems, sometimes just having a human that just has, really deep expertise can make a big difference. I've actually seen this come into play when actually building agents. So we've had a few friends now, try building their own coding agents, and I think one same problem that I recurringly heard a lot of them run into was this problem of like, “Oh, Grep is really slow on our agents' machines.” And so a lot of them, I assume because they're using AI and they themselves don't have, super deep infra background knowledge, say, “Okay, we're going to go build our own custom Grep index. It's going to be really fast,” and use that as a way around this problem. When we ran into this problem About like, maybe like a year and a half ago when we were, in the early days of building Devin, we obviously didn't have AI then. We just asked our, how to, how to do this. You can just swap out a new Grep index, so.Infrastructure Details: Grep, File Systems, and SandboxesSwyx [00:47:45]: What do you mean you hand-coded Devin? What?Walden [00:47:48]: It's like, can you believe we hand-wrote this code? And we had, our infra people who are really amazing, they were looking into it and they're like, “Oh, what? We realized that actually the root cause of this problem is actually super simple, but like fine-grain detail,” which is that a lot of these virtual machines actually underlying them don't use real file systems. They use these, network file systems where things are actually cached over the network actually in S3. So when you're Grepping, you're actually making network calls Every time you're doing these things, and that's why Grep is extremely slow on these machines. And so again, goes back to, what is all of the crazy infra work that we had to do to actually get these machines working. If you try to do this yourself, there are tons of small details like this, and so we had to eventually go swap out that network file system. ButSwyx [00:48:35]: I think there's a write-up about it, right? Silas did one about the virtual file system.Walden [00:48:38]: Oh, that was a whole other thing. TheSwyx [00:48:39]: Oh, that's a different thingWalden [00:48:40]: The BlockDev file storage formatSwyx [00:48:42]: I'll bring it upWalden [00:48:42]: Which is, a file system format that we built so that the VMs could be spun up and down very quickly. Basically, the intuition behind this is-Imagine you have, a terabyte of disk, and your agent only, wrote, a hundred lines of code on top of that disk. How long does it, say, take to, save and re-bring up that disk? And most systems, because you're not optimizing for this case, it's just, on the order of a terabyte of work because you have to Save all of that and bring it back up. In our system, we try to build a file system that incrementally builds on top of each other. So every time you save and bring the machine back up, you're only doing work that is proportional to effectively the diff in the file system. And so this, shaves off a lot of time in the boot-up process of Devin. I think we This is actually now outdated. We have a newer system inside of Devin. But yeah, there's a lot of tiny details you have to get right here to actually get the day-to-day experience of Devin to be good.Swyx [00:49:39]: It's, not technically agents, but it is agent infra, and when you sell an agent as a company, you sell agent plus agent infra.Walden [00:49:46]: At least the way we do it be And the other The nice thing about having the agent infra being done together is, you We get to deploy Devin in whatever environment we want now. We don't need to wait for some underlying infra provider to also go and support VPC or on-prem or FedGovCloud, for instance. So we can actually go and figure out, okay, since we own the infrastructure, how can we get that set up for you?Cloud Providers: Modal, Daytona, and Enterprise SandboxesSwyx [00:50:12]: Whereas you're Cloudflare dependent.Cole [00:50:15]: so Cloudflare runs the control plane. The sandboxes, Modal is supported. A contributor just added Daytona. E2B is on the roadmap, and I think there's an abstraction in place that if any contributor wants to add a new provider, they can add that in.Walden [00:50:32]: Well, what are, How are the customers you work with Do they generally try to then go set up a contract with another one of these third-party providers? Do they try to do the VMs in-house?Cole [00:50:44]: most of them I see using Modal. I think Modal has a greatWalden [00:50:48]: Shout out Modal.Swyx [00:50:48]: Shout out Modal.Cole [00:50:50]: I think Modal has a great offering. It captures all of the sandbox pieces you need, snapshots being a pretty big piece of that, and given that they also offer GPUs, I think it's a pretty nice offering as a whole.Swyx [00:51:04]: no debate there.Walden [00:51:07]: Modal is great, especially, I think their container offering is, the most natural, and so especially if you are willing to, forego, the full VM requirements Modal is, a really vast place you can spin something up on.Swyx [00:51:20]: Is there a point So Modal's very Python, and I feel like most workload, has really shifted to JavaScript. I don't know if you guys Get the same feeling. So, okay, when I started Landspace and IE and all these things, I was like 50/50 Python and JS, right? That's roughly. I think that's wrong now. I think JS has won. I don't know if you guys Like, I Maybe I'm overstating it, and maybe for cognition, there's, C# and Java and what have you. But for, new greenfield apps, do you feel that Do you get that sense? Does it matter?Cole [00:51:52]: I think that most of the libraries that I see in this space are Python native first, especially in theCole [00:51:58]: Observability space. That said, I think that there is a pretty big appeal of having your entire system in one language. Especially when you have both your frontend and backend communicating, you can have one central type Which is very nice.Swyx [00:52:11]: That's my case against Modal, which is Then you have to run JS. You can run JS inside Modal. It's just, one extra step That, isn't native to the runtime. I don't know ifWalden [00:52:22]: I don't knowSwyx [00:52:23]: Reviews. Do you have numbers? I don't know.Walden [00:52:25]: the one thing I don't like about Python is whenever AI, whenever it writes Python, it always does, the weirdest patterns, andSwyx [00:52:32]: Oh, because it's, mixing two and three or what?Walden [00:52:34]: I think it's something mixing two and three, yeah. The I don't know if you see this. It always tries to do, has attribute on objects as likeCole [00:52:41]: Oh, my God.Walden [00:52:41]: But it's like But that you shouldn't be doing that. It should error if there wasSwyx [00:52:45]: Because it's training on library code?Cole [00:52:47]: I think it's more of, likeCole [00:52:48]: From what I've seen, it's more of, a reward hacking mechanism where it doesn't want to basicallyWalden [00:52:54]: It'll never error.Cole [00:52:54]: It doesn't want the code to fail. And so it Even when it knows it has the attribute, it'll call getattr on a, and for a lot of my clients who have moved towards more autonomous coding, we've put that in as a lint rule That if you do getattr, your pull request is going to fail.Slop Signatures: Comments, Backwards Compatibility, and TypesSwyx [00:53:12]: Ooh, this is a fun topic. Can you tell me more about this? What else is a sign of AI coding that you have to put guards in?Walden [00:53:21]: So we were talking just before this about Opus 4.7. One of the things this new model likes to do is it writes lots of comments. Not like, it'll, comment every line, but it'll write, paragraph, PRDs, on top of every function. But I will say, to its credit, these aren't slop, descriptions like they were before. “Oh, here's what this function does.” It's like, “Oh, here's actually the r

god amazon ai business pr work secrets walk research brain local single security generation decisions diy memory os hiring choices android computers honestly consulting ios windows surprising cto command gemini slack openai conquer appearance themes shopify leak sh harness pms gpt paradigm python separating daytona github java warcraft demos notion db settings stripe dev vm anthropic screenshots conductor javascript opus macos cognition versus agi ramp walden cpo xyz s3 ide codex cloudflare entropy prs docker git js gpus internally narrowing sonnets continual deleting repo confluence sentry mcp cursor sre firecrackers modal cli vms observability datadog postgres async backwards compatibility windsurf all hands supersets ec2 7x grep mcps cfps code reviews vpc flue windows os devins clis steve yegge vpcs little snitch semgrep

Inside Google I/O with a DeepMind Exec

Where It Happens

Play Episode Listen Later May 22, 2026 25:42

I sit down with Logan Kilpatrick from the Google DeepMind team, live at Google I/O, to unpack everything Google just announced and what it means for founders and builders. We cover Gemini 3.5 Flash, the new Gemini Omni world model, the expanded Antigravity ecosystem, managed agents in the Gemini API, and the native Android app builder inside AI Studio. Logan shares how distillation keeps pushing Pro-level intelligence into Flash, where the real opportunities sit for solo founders, and why the agentic era has finally crossed the chasm from demo to useful. If you have an idea and want to ship something this week, this episode maps the toolkit. Timestamps 00:00 – Intro 00:53 – Gemini 3.5 Flash: The New Workhorse Model 01:49 – How Flash 3.5 Stacks Up Against Sonnet 02:38 – Gemini Omni: A World Model for Any Input and Output 06:18 – Building a Content and Creator Layer on Omni 08:21 – What to look forward to 10:53 – Google Spark and Managed Agents 14:00 – The Agentic Era and Requests for Startups 17:17 – The Antigravity Ecosystem Overhaul 18:51 – AI Studio vs. Antigravity: Vibe Coding vs. Agentic Engineering 21:31 – Native Android Apps Built Inside AI Studio 23:44 – Closing Thoughts Key Points Gemini 3.5 Flash ships as a Sonnet-level workhorse model tuned for long-running agentic tasks, coding, and tool use, available on day one to 900M+ Gemini app users. Gemini Omni is a single model that takes any input and produces any output across video, image, audio, and music, fusing Veo, Nano Banana, Lyria, and TTS into one system. Managed agents in the Gemini API let builders ship agentic products with a single API call, using skills and markdown instead of writing orchestration code. The Antigravity suite now spans an IDE, agent manager, CLI, SDK, and API surface, all sharing the same agent harness that powers Gemini Spark. AI Studio targets vibe coding and now builds native Android apps for free, while Antigravity targets production-quality, million-line-codebase engineering. The cost of intelligence keeps dropping thanks to distillation, opening up smaller markets that previously needed a 40-person team and venture funding to address. The #1 tool to find startup ideas/trends - https://www.ideabrowser.com LCA helps Fortune 500s and fast-growing startups build their future - from Warner Music to Fortnite to Dropbox. We turn 'what if' into reality with AI, apps, and next-gen products https://latecheckout.agency/ The Vibe Marketer - Resources for people into vibe marketing/marketing with AI: https://www.thevibemarketer.com/ FIND ME ON SOCIAL X/Twitter: https://twitter.com/gregisenberg Instagram: https://instagram.com/gregisenberg/ LinkedIn: https://www.linkedin.com/in/gisenberg/ FIND LOGAN ON SOCIAL X/Twitter: https://x.com/OfficialLoganK Youtube: https://www.youtube.com/@LoganKilpatrickYT LinkedIn: https://www.linkedin.com/in/logankilpatrick/

SaaStr 854: The Agents #005, Our AI is Hiring! Would You Work for One? And Are Autonomous Agents ... Safe?

The Official SaaStr Podcast: SaaS | Founders | Investors

Play Episode Listen Later May 19, 2026 78:58

The Agents #005, Our AI is Hiring! Would You Work for One? And Are Autonomous Agents ... Safe? Welcome to The Agents, where SaaStr's CEO and Founder, Jason Lemkin and Chief AI Officer, Amelia LeRutte share the latest each week on running a company with more agents than humans. It costs $257 a month to run two AI VPs. Jason and Amelia open the books on what 10K (AI VP of Marketing) and QB (AI VP of Customer Success) actually cost to operate, and the number shocked both of them. Most of the heavy lifting is API calls to Salesforce, Bizzabo, and Marketo, which are basically free. The Postgres storage costs pennies. And 95% of the AI calls run on OpenAI Mini at less than a penny each. The fully burdened cost with Clerk, 11 Labs, and Salesforce overhead might hit $500-800/month, but the soft cost of human time dwarfs all of it. Then 10K gets asked point blank: are you a VP of Marketing? Its answer is no, not yet. It says it replaced the bottom half of the marketing org, the analyst, the ops coordinator, the junior content marketer, and a sliver of the VP job. But it's honest about what it can't do: strategy, cross-functional politics, crisis response, hiring. Amelia points out that 10K's current job description is exactly what her job was when she started at SaaStr as Director of Demand Gen. It took her years to get to CAIO. 10K might get there faster. And SaaStr is putting its money where its mouth is: they're hiring a human marketer whose primary manager would be 10K. Not a thought experiment, a real job posting. Would you take a job reporting to an AI? Then the safety question gets real. Amelia is talking to agents via WhisperFlow while walking around a 40-acre event site during SaaStr Annual load-in, and the production crew started asking her to relay their questions because 10K and QB answer in seconds with correct data. But when QB autonomously emailed 83 sponsors at 12:20am with fully customized check-in emails, Amelia admits she hesitated before letting it rip. Each email was unique to the sponsor, showing exactly what they still owed, their registration codes, and outstanding tasks. The result: fewer inbound questions the next day and more sponsors using the QB chatbot directly. That's an autonomous agent acting on behalf of your company in the middle of the night. Jason and Amelia also tackle the Postgres vs. Salesforce debate that listeners keep asking about. Short answer: not happening for them. Too much history, too many third-party agents optimized around Salesforce, and they're actually consolidating more tools onto the platform, not fewer. They killed Marketo and moved to Marketing Cloud. Plus they built a newsletter auto-builder that replaced a $4K/year tool called Bee. 10K uses Sonnet to force rank articles, builds the HTML, inserts ads, and sends it. Human on the loop, not in it.

ceo director founders ai marketing safe human hiring qb salesforce labs api 10k 4k autonomous html customer success clerk caio sonnets marketo postgres demand gen saastr marketing cloud jason lemkin bizzabo saastr annual

Malcolm Guite pt. 1: Does Theology Need an Imaginative Spark to Grasp God's Mystery?

Good Faith

Play Episode Listen Later Apr 27, 2026 22:05

Imagination Combined with Reason Can Build a Sturdier Faith. Malcolm Guite invites us to recover a "baptized imagination," showing how poetry can do real theological work by carrying truth through image, beauty, sacrament, and story. Rather than replacing reason, imagination helps us perceive meaning—opening Scripture, creation, and the mystery of Christ in ways analysis alone cannot reach. Take the Listener Survey Sign up for The After Party Sign up for The Good List Mentioned In This Episode: Malcolm Guite's Galahad in the Grail Malcolm Guite's Parable and Paradox William Shakespoeare's Sonnet 18: Shall I compare thee to a summer's day? George Herbert's poem The Agonie C.S. Lewis's Bluspels and Flalansferes C.S. Lewis on Imagination and Reason in Christian Apologetics Samuel Taylor Coleridge's Biographia Literaria Scriptures Referenced In This Episode: 1 Corinthians 2 (ESV) Luke 22:19-20 (NJKV) Luke 10:27 (NKJV) John 1:1 (NIV) Psalm 19:1 (KJV) More from Malcolm Guite: Malcolm Guite's website and blog Malcolm Guite's Youtube channel Malcolm Guite's books Follow Us: Good Faith on Instagram Good Faith on X (formerly Twitter) Good Faith on Facebook The Good Faith Podcast is a production of a 501(c)(3) nonpartisan organization that does not engage in any political campaign activity to support or oppose any candidate for public office. Any views and opinions expressed by any guests on this program are solely those of the individuals and do not necessarily reflect the views or positions of Good Faith.

jesus christ mystery scripture corinthians theology parable spark imagination reason kjv grasp sonnets imaginative good faith george herbert galahad malcolm guite niv psalm good list

Podcasts about sonnets

Best podcasts about sonnets

SONNETCAST â€“ William Shakespeare's Sonnets Recited, Revealed, Relived

Shakespeare Sundays with Chop Bard

Unbound Sketchbook

A Voix Haute

The Daily Poem

The Grey Rooms

Shakespeare’s Sonnets

Poetry For All

Everyday AI Podcast â€“ An AI and ChatGPT Podcast

The Persistent Rumor

Poem-a-Day

Classic Poetry Aloud

Words in the Air: 52 Weeks of Poetry

Rusty Sonnets

Shakespeare Saga

Audio Poem of the Day

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

This Day in AI Podcast

Folger Shakespeare Library: Shakespeare Unlimited

The Slowdown

The History of Literature

Beyond Shakespeare

The Pendant Shakespeare audio drama anthology

Podcast Shakespeare

Dialogic

The Grimerica Show

In Our Time

TWIP! Pendant Productions audio drama news

???? More to Read

The AI Breakdown: Daily Artificial Intelligence News and Discussions

People's Guide to the Cthulhu Mythos

The Nonlinear Library

Black Clock Audio Tales: Audio Books, Science Fiction, Folklore, Gothic Literature, Classic Horror, and the Cthulhu Mythos

The Cloud Pod

The Fourth Way

PodCastle

Women of Substance Music Podcast

Procento Miloše ?ermáka

Books & Writers · The Creative Process

Marketing Against The Grain

(sub)Text Literature and Film Podcast

The Bardcast: "It's Shakespeare, You Dick!"

Poetry · The Creative Process

Music & Dance · The Creative Process

Poetry Unbound

amimetobios

Free Audio-Books

Social Justice & Activism · The Creative Process

The iServalanâ„¢ Show

??????? with ???

Front Row

Shakespeare Is My Home Slice

Lenny's Podcast: Product | Growth | Career

No Holds Bard

Latest news about sonnets

Latest podcast episodes about sonnets

Episode #563: Primary Mind vs. Extended Mind: Aaron Neyer on Digital Brains

Keep It Going 14

The AI Glossary You Need, Cyber Insurers Shift to Speed, Big Tech Flips on AI Jobs Doom, Free AI Credits, Fable Ban Lifted, Claude Sonnet 5

1559: Florida Doll Sonnet by Denise Duhamel and Maureen Seaton

Ep 817: ChatGPT's 5.6 Sol, Grok and Meta bounce back and OpenAI's biggest week ever? And more AI News That Matters

AI News #8

This Week in European Tech: Europe's AI wake-up call

AI Distillation: How Frontier Models Teach Each Other #1870

Semana de IA y Redes: 10 noticias que están cambiando el juego del marketing digital

Emergency meeting - Fable 5 for designers

S08E14 - Over Mistral, Fables terugkeer en oberdrones

Narco-Terrorism and the Criminal Mind: What the 22nd MEU's Caribbean Campaign Reveals About Cartel Psychology, Organizational Violence, and

T6.E133. INSIDE X AI WARS ANTHROPIC CONTRAATACA! Claude Fable 5, Sonnet 5, Tag, Science... y mucho más.

Inteligencia Artificial, tu resumen semanal | Regresó la IA más poderosa del mundo

Anthropic Launches Claude Sonnet 5: High Performance, Lower Cost

Claude Science: AI Workbench for Scientists #1868

Ep 811: Fable 5 and Sonnet 5 Released, OpenClaw on Your iPhone, NotebookLM's New Video Format and 7 More AI Features You Need Now

GEO Kills the Listicle

EP316. Claude Sonnet 5、Meta 也要賣算力、PLTR 合作 NVDA | M觀點

Anthropic's Rapid Model Releases, GPT 5.6's Gated Launch, and The Real AI Jobs Story | 141

NOW Fable's Back?

WW 990: Don't Be Nostalgic for Stupid - The Doom & Gloom Watch