Podcasts about ai engineer

  • 112PODCASTS
  • 211EPISODES
  • 46mAVG DURATION
  • 5WEEKLY NEW EPISODES
  • Feb 27, 2026LATEST

POPULARITY

20192020202120222023202420252026


Best podcasts about ai engineer

Latest podcast episodes about ai engineer

The Clip Out
Hollywood Comes Calling for Cody Rigsby

The Clip Out

Play Episode Listen Later Feb 27, 2026 46:44


In this episode, we cover the latest updates from the world of Peloton and the broader fitness industry. We break down new app features, instructor news, and legal updates impacting the platform. Here is what we discuss in this week's lineup:Peloton introduces a new "Chat" feature within the app to enhance social connectivity.Club Peloton adds a new exclusive live class to its roster (plus, we discuss the new display for Legend status).Booking limits for in-person Peloton studio classes have been expanded for members.The Peloton Run Club officially kicks off Race Week festivities in Melbourne.Peloton secures another legal victory regarding the Bike+ trademark lawsuit.The company is actively hiring for a new AI Engineer role to support future tech developments.Instructor Cody Rigsby takes his talents to the big screen with a new movie role.Nico Sarani is finally making her return to the United States.Jon Hosking is scheduled for an appearance at PSNY in 2026.Becs Gentry is featured in People Magazine discussing motherhood and ultra-marathon training.Robin Arzón launches her "Project Swagger" podcast, featuring an "old school" radio vibe with a call-in hotline for listener questions.Kristen McGee releases a new book focused on prioritizing self-care without guilt.We provide an overview of the upcoming Artist Series slated for March.Reality star Snooki reveals a cervical cancer diagnosis.SoulCycle announces a new partnership with Loop Earplugs.TCO Top 5: We share this week's listener-recommended fitness classes.This Week at Peloton: A recap of the week's biggest highlights and scenic meditation updates.TCO Radar: Our curated list of classes you should be paying attention to right now.The Stranger Things content is leaving the platform soon—take it while you can.Peloton launches a new Pilates Challenge for members.A new "Road To Recovery: ACL" program is now available for rehabilitation support.Alex K and Ash Pryor team up for a special 2-for-1 row.Assal Arian and Mayla Wedekind host a 2-for-1 session for Women's History Month.We take a closer look at the new collection of Progression Rides.Thanks for listening! If you enjoyed this update, please subscribe and leave a review.See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

New Books Network
Kola Tytler: Sneakerhead, Entrepreneur, and Medical Doctor

New Books Network

Play Episode Listen Later Feb 25, 2026 65:00


In this conversation we hear about Kola's journey as self-taught coder, business school, learning by doing, and how he is self-funding one person AI company for doctors: Kola Tytler's parallel journey as an NHS Doctor while building pioneering and potentially world changing business is inspiring. Listen in on a remarkable conversation between host Richard Lucas and Kola Tytler, now a qualified doctor who taught himself to code. We explore the roots of his entrepreneurial activity, despite knowing he wanted to be a doctor from a young age. the influence and opportunities of being an immigrant from a different background as he went to medical school in London. his first venture selling event tickets via a Facebook platform, scaling a fashion blog with millions of followers, and launching and exiting the successful Dropout retail business in Milan Lessons of having investors who were not always aligned How he dealt with realising that he might have a bigger financial opportunity through dropping out of his studies. The benefits and limitations of bootstrapping when you have the resources to put together a great team The impact of both his formal business school education and self tuition via online resources like Y Combinator, and prominent SV figures like Steve Jobs and Bill Gates. The ambition and vision for his self funded AI platform for doctors iatroX which provides clinical guidance to over 20,000 users. Kola's journey is a masterclass in calculated risk and relentless drive, Kola shares the critical lessons he has learned from his triumphs and challenges. Through insightful questions, Richard draws out the key takeaways on finding balance, the importance of a strong team, understanding domain expertise, and the necessity of continuous business education. This episode is packed with inspiration for anyone looking to bridge diverse passions and build a high-impact venture. About Kola Dr Kola Tytler – Doctor/MBA & full-stack developer MBBS @ King's College London Certificates in Law & Business (LSE & Imperial) MBA (with merit) @ University of Birmingham MSt Entrepreneurship @ University of Cambridge ‘26 Forbes 30 under 30 Europe, Forbes 100 under 30 Italy IBM-certified AI Engineer & MENSA member Founder of YEEZY Mafia, dropout, & HypeAnalyzer Links iatroX is a UKCA-marked, MHRA-registered medical device. It acts as an AI‑driven assistant that centralizes clinical guidelines offering: 1 quick Q&A, 2 structured brainstorming, and 3 an adaptive quiz engine for medical students. Kola Tytler's Linkedin Kola Tytler's personal websiteDrop Out MilanoHype Analyzer CAMentrepreneurs Learn more about your ad choices. Visit megaphone.fm/adchoices Support our show by becoming a premium member! https://newbooksnetwork.supportingcast.fm/new-books-network

New Books in Medicine
Kola Tytler: Sneakerhead, Entrepreneur, and Medical Doctor

New Books in Medicine

Play Episode Listen Later Feb 25, 2026 65:00


In this conversation we hear about Kola's journey as self-taught coder, business school, learning by doing, and how he is self-funding one person AI company for doctors: Kola Tytler's parallel journey as an NHS Doctor while building pioneering and potentially world changing business is inspiring. Listen in on a remarkable conversation between host Richard Lucas and Kola Tytler, now a qualified doctor who taught himself to code. We explore the roots of his entrepreneurial activity, despite knowing he wanted to be a doctor from a young age. the influence and opportunities of being an immigrant from a different background as he went to medical school in London. his first venture selling event tickets via a Facebook platform, scaling a fashion blog with millions of followers, and launching and exiting the successful Dropout retail business in Milan Lessons of having investors who were not always aligned How he dealt with realising that he might have a bigger financial opportunity through dropping out of his studies. The benefits and limitations of bootstrapping when you have the resources to put together a great team The impact of both his formal business school education and self tuition via online resources like Y Combinator, and prominent SV figures like Steve Jobs and Bill Gates. The ambition and vision for his self funded AI platform for doctors iatroX which provides clinical guidance to over 20,000 users. Kola's journey is a masterclass in calculated risk and relentless drive, Kola shares the critical lessons he has learned from his triumphs and challenges. Through insightful questions, Richard draws out the key takeaways on finding balance, the importance of a strong team, understanding domain expertise, and the necessity of continuous business education. This episode is packed with inspiration for anyone looking to bridge diverse passions and build a high-impact venture. About Kola Dr Kola Tytler – Doctor/MBA & full-stack developer MBBS @ King's College London Certificates in Law & Business (LSE & Imperial) MBA (with merit) @ University of Birmingham MSt Entrepreneurship @ University of Cambridge ‘26 Forbes 30 under 30 Europe, Forbes 100 under 30 Italy IBM-certified AI Engineer & MENSA member Founder of YEEZY Mafia, dropout, & HypeAnalyzer Links iatroX is a UKCA-marked, MHRA-registered medical device. It acts as an AI‑driven assistant that centralizes clinical guidelines offering: 1 quick Q&A, 2 structured brainstorming, and 3 an adaptive quiz engine for medical students. Kola Tytler's Linkedin Kola Tytler's personal websiteDrop Out MilanoHype Analyzer CAMentrepreneurs Learn more about your ad choices. Visit megaphone.fm/adchoices Support our show by becoming a premium member! https://newbooksnetwork.supportingcast.fm/medicine

The New Stack Podcast
NanoClaw's answer to OpenClaw is minimal code, maximum isolation

The New Stack Podcast

Play Episode Listen Later Feb 20, 2026 51:54


OnThe New Stack Agents, Gavriel Cohen discusses why he built NanoClaw, a minimalist alternative to OpenClaw, after discovering security and architectural flaws in the rapidly growing agentic framework. Cohen, co-founder of AI marketing agencyQwibit, had been running agents across operations, sales, and research usingClaude Code. When Clawdbot (laterOpenClaw) launched, it initially seemed ideal. But Cohen grew concerned after noticing questionable dependencies—including his own outdated GitHub package—excessive WhatsApp data storage, a massive AI-generated codebase nearing 400,000 lines, and a lack of OS-level isolation between agents. In response, he createdNanoClawwith radical minimalism: only a few hundred core lines, minimal dependencies, and containerized agents. Built around Claude Code “skills,” NanoClaw enables modular, build-time integrations while keeping the runtime small enough to audit easily. Cohen argues AI changes coding norms—favoring duplication over DRY, relaxing strict file limits, and treating code as disposable. His goal is simple, secure infrastructure that enterprises can fully understand and trust.   Learn more from The New Stack about the latest around personal AI agents Anthropic: You can still use your Claude accounts to run OpenClaw, NanoClaw and Co. It took a researcher fewer than 2 hours to hijack OpenClaw OpenClaw is being called a security “Dumpster fire,” but there is a way to stay safe Join our community of newsletter subscribers to stay on top of the news and at the top of your game. 

The New Stack Podcast
The developer as conductor: Leading an orchestra of AI agents with the feature flag baton

The New Stack Podcast

Play Episode Listen Later Feb 19, 2026 19:32


A few weeks after Dynatrace acquired DevCycle, Michael Beemer and Andrew Norris discussed on The New Stack Makers podcast how feature flagging is becoming a critical safeguard in the AI era. By integrating DevCycle's feature flagging into the Dynatrace observability platform, the combined solution delivers a “360-degree view” of software performance at the feature level. This closes a key visibility gap, enabling teams to see exactly how individual features affect systems in production. As “agentic development” accelerates—where AI agents rapidly generate code—feature flags act as a safety net. They allow teams to test, control, and roll back AI-generated changes in live environments, keeping a human in the loop before full releases. This reduces risk while speeding enterprise adoption of AI tools. The discussion also highlighted support for the Cloud Native Computing Foundation's OpenFeature standard to avoid vendor lock-in. Ultimately, developers are evolving into “conductors,” orchestrating AI agents with feature flags as their baton.   Learn more from The New Stack about the latest around AI enterprise development:  Why You Can't Build AI Without Progressive Delivery  Beyond automation: Dynatrace unveils agentic AI that fixes problems on its own  Join our community of newsletter subscribers to stay on top of the news and at the top of your game.   

The Kapeel Gupta Career Podshow
AI Engineer Career Guide: Scope, Salary, Skills & Future Opportunities in India and Abroad

The Kapeel Gupta Career Podshow

Play Episode Listen Later Feb 14, 2026 11:02


Send a textAI Engineer Career Guide: Scope, Salary, Skills & Future Opportunities in India and AbroadArtificial Intelligence is no longer the future — it is the present.From Netflix recommendations and fraud detection to medical diagnosis and self-driving cars, AI Engineers are quietly building the intelligent systems shaping our world. But what does it really take to become an AI Engineer? Is it only for IIT toppers, or can any focused student build a career in this powerful field?In this episode of The Kapeel Gupta Career PodShow, we decode the complete roadmap to becoming an AI Engineer — including scope in India and abroad, required skills, educational pathways, real-world work life, and salary potential.If you are a student curious about technology, coding, and future-ready careers, this episode will help you see AI Engineering not just as a job… but as a life path you can consciously design.

The New Stack Podcast
You can't fire a bot: The blunt truth about AI slop and your job

The New Stack Podcast

Play Episode Listen Later Feb 11, 2026 57:18


Matan-Paul Shetrit, Director of Product Management at Writer, argues that people must take responsibility for how they use AI. If someone produces poor-quality output, he says, the blame lies with the user—not the tool. He believes many misunderstand AI's role, confusing its ability to accelerate work with an abdication of accountability. Speaking on The New Stack Agents podcast, Shetrit emphasized that “we're all becoming editors,” meaning professionals increasingly review and refine AI-generated content rather than create everything from scratch. However, ultimate responsibility remains human. If an AI-generated presentation contains errors, the presenter—not the AI—is accountable. Shetrit also discussed the evolving AI landscape, contrasting massive general-purpose models from companies like OpenAI and Google with smaller, specialized models. At Writer, the focus is on enabling enterprise-scale AI adoption by reducing costs, improving accuracy, and increasing speed. He argues that bespoke, narrowly focused models tailored to specific use cases are essential for delivering reliable, cost-effective AI solutions at scale. Learn more from The New Stack about the latest around enterprise development: Why Pure AI Coding Won't Work for Enterprise Software How To Use Vibe Coding Safely in the Enterprise Join our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
GitLab CEO on why AI isn't helping enterprise ship code faster

The New Stack Podcast

Play Episode Listen Later Feb 10, 2026 57:18


AI coding assistants are boosting developer productivity, but most enterprises aren't shipping software any faster. GitLab CEO Bill Staples says the reason is simple: coding was never the main bottleneck. After speaking with more than 60 customers, Staples found that developers spend only 10–20% of their time writing code. The remaining 80–90% is consumed by reviews, CI/CD pipelines, security scans, compliance checks, and deployment—areas that remain largely unautomated. Faster code generation only worsens downstream queues.GitLab's response is its newly GA'ed Duo Agent Platform, designed to automate the full software development lifecycle. The platform introduces “agent flows,” multi-step orchestrations that can take work from issue creation through merge requests, testing, and validation. Staples argues that context is the key differentiator. Unlike standalone coding tools that only see local code, GitLab's all-in-one platform gives agents access to issues, epics, pipeline history, security data, and more through a unified knowledge graph.Staples believes this platform approach, rather than fragmented point solutions, is what will finally unlock enterprise software delivery at scale. Learn more from The New Stack about the latest around GitLab and AI: GitLab Launches Its AI Agent Platform in Public BetaGitLab's Field CTO Predicts: When DevSecOps Meets AIJoin our community of newsletter subscribers to stay on top of the news and at the top of your game. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

DOU Podcast
Домен за за $70 млн | «Білий список» Starlink | Серіал Baldur's Gate 3 від HBO — DOU News #236

DOU Podcast

Play Episode Listen Later Feb 9, 2026 40:59


У свіжому дайджесті DOU News аналізуємо свіжий звіт про зарплати Data Science та обговорюємо мега-угоду Ілона Маска. Також у випуску: обов'язкова реєстрація Starlink в Україні через «Дію» та ЦНАПи, український стартап Swarmer іде на IPO, та анонс серіалу за мотивами Baldur's Gate 3 від творця «The Last of Us». Дивіться ці та інші новини українського та глобального тек-сектору. Таймкоди 00:00 Інтро 00:21 Зарплати дата-фахівців: у AI Engineer знижуються, у Product Analyst зростають 04:32 Обов'язкова верифікація Starlink 09:13 Український дефтек-стартап Swarmer готується до IPO 11:02 Столи з електрорегулюванням висоти, зручні крісла та аксесуари STIYSTIL 12:15 ПДВ для ФОПів: мінімальний поріг на дохід можуть підняти 13:18 Збір DOU та KOLO для НГУ 14:33 SpaceX купує xAI за $1,25 трлн 17:46 Злам Notepad++: як хакери підмінили систему оновлень популярного редактора 19:31 OpenClaw та «жахи» безпеки: чому ШІ-скіли можуть вкрасти ваші дані 23:42 ЄС проти TikTok: алгоритми нескінченної стрічки визнали незаконними 26:16 Рекордний продаж домену AI.com за $70 млн 27:58 Битва на Super Bowl: OpenAI Codex проти Anthropic Claude 30:14 Релізи тижня: Claude Opus 4.6 та OpenAI GPT-5.3-Codex 34:06 Xcode 26.3: Apple додає підтримку «vibe coding» 35:53 Valve відкладає Steam Machine через дефіцит пам'яті 37:23 HBO анонсувала серіал Baldur's Gate 3 38:44 Що рекомендує Женя: AI 2027 та відео «I shipped code I don't understand»

The Edtech Podcast
#313 Bett: Lightening the Admin Load, Strengthening Human Connection

The Edtech Podcast

Play Episode Listen Later Feb 6, 2026 60:36


In this episode, we explore the intersection of technology, scalability, and human connection. We begin with Peter Lancet from My Trencher, who discusses the "numbers game" university career services face. He explains how his platform provides an "operational layer" that allows small teams to support thousands of students, moving from reactive advice to strategically identifying and assisting those most at risk. The conversation then deepens with Doug Delpha, a Professor at the University of Strasbourg and AI Engineer. Doug advocates for a return to the "Aristotle and Plato" style of mentorship, arguing that while teachers must become AI-proficient, their primary role is to be a protected human figure who can handle sensitive student crises that chatbots simply cannot. Crucially, Doug also issues a wake-up call regarding digital sovereignty, challenging international institutions to build "elasticity" into their systems rather than relying blindly on American infrastructure that could be impacted by geopolitical shifts or outages. We then turn our focus to the administrative side of inclusion with Andy Bell from Trellis. A winner of the CivTech award in Scotland, Andy shares how his team is using AI to alleviate the paperwork burden in Special Educational Needs (SEND) departments. By transcribing meetings and automatically creating first drafts of child plans, Trellis is helping councils and teachers reclaim valuable time, allowing them to focus less on administration and more on the students who need them most. Ready to learn more about the future of learning? This episode is proudly sponsored by Everway and Edmentum. Discover how they are transforming education by visiting https://www.everway.com/ and https://www.edmentum.com/.

The New Stack Podcast
The enterprise is not ready for "the rise of the developer"

The New Stack Podcast

Play Episode Listen Later Feb 5, 2026 25:50


Sean O'Dell of Dynatrace argues that enterprises are unprepared for a major shift brought on by AI: the rise of the developer. Speaking at Dynatrace Perform in Las Vegas, O'Dell explains that AI-assisted and “vibe” coding are collapsing traditional boundaries in software development. Developers, once insulated from production by layers of operations and governance, are now regaining end-to-end ownership of the entire software lifecycle — from development and testing to deployment and security. This shift challenges long-standing enterprise structures built around separation of duties and risk mitigation. At the same time, the definition of “developer” is expanding. With AI lowering technical barriers, software creation is becoming more about creative intent than mastery of specialized tools, opening the door to nontraditional developers. Experimentation is also moving into production environments, a change that would have seemed reckless just 18 months ago. According to O'Dell, enterprises now understand AI well enough to experiment confidently, but many are not ready for the cultural, operational, and security implications of developers — broadly defined — taking full control again.Learn more from The New Stack about the latest around enterprise developers and AI: Retool's New AI-Powered App Builder Lets Non-Developers Build Enterprise AppsSolving 3 Enterprise AI Problems Developers FaceEnterprise Platform Teams Are Stuck in Day 2 HellJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
Meet Gravitino, a geo-distributed, federated metadata lake

The New Stack Podcast

Play Episode Listen Later Jan 29, 2026 29:27


In the era of agentic AI, attention has largely focused on data itself, while metadata has remained a neglected concern. Junping (JP) Du, founder and CEO of Datastrato, argues that this must change as AI fundamentally alters how data and metadata are consumed, governed, and understood. To address this gap, Datastrato created Apache Gravitino, an open source, high-performance, geo-distributed, federated metadata lake designed to act as a neutral control plane for metadata and governance across multi-modal, multi-engine AI workloads. Gravitino achieved major milestones in 2025, including graduation as an Apache Top Level Project, a stable 1.1.0 release, and membership in the new Agentic AI Foundation. Du describes Gravitino as a “catalog of catalogs” that unifies metadata across engines like Spark, Trino, Ray, and PyTorch, eliminating silos and inconsistencies. Built to support both structured and unstructured data, Gravitino enables secure, consistent, and AI-friendly data access across clouds and regions, helping enterprises manage governance, access control, and scalability in increasingly complex AI environments.Learn more from The New Stack about how the latest data and metadata are consumed, governed, and understood: Is Agentic Metadata the Next Infrastructure Layer?Why AI Loves Object StorageThe Real Bottleneck in Enterprise AI Isn't the Model, It's ContextJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Tea Time with Talent Acquisition
Going beyond ChatGPT and Prompts

Tea Time with Talent Acquisition

Play Episode Listen Later Jan 26, 2026 50:46


Tea Time with Talent Acquisition is proudly sponsored by Peritus Partners - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Peritus Partners - Next Generation Recruitment⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠------------------------------Over the last few years we have had a few talks about AI within recruitment and talent acquisition but Senior Global Talent Advisor, Hari Sarangam takes it a step further...Not talking about theory and ideas, but rather real practical AI examples that can become the starting block to you building your own AI Agents for personal or work life! Hari doesn't claim to be an AI Engineer or expert in developing agents, he is simply someone in our field with enthusiasm about what is happening to tech in general and as a result has taken it upon himself to explore AI past articles and videos with his own AI Agent that he gracefully talks us through during this packed episode. If you are unsure how to start building AI agents then this is for you.Feel free to find Hari here - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Dr Hari Sarangam | LinkedIn ------------------------------If you find this topic interesting or have an idea about a future topic in which either you would like to speak yourself or would like us to cover, please reach out to the host, Eden here - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Eden Whitcomb | LinkedIn⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Thank you for supporting our project, if you liked what you heard then we would appreciate if you could drop us a review or rating!#podcast #recruitment #talentacquisition #hiring #HR #HRBP #AI #ChatGPT #promptengineering

The New Stack Podcast
CTO Chris Aniszczyk on the CNCF push for AI interoperability

The New Stack Podcast

Play Episode Listen Later Jan 22, 2026 23:33


Chris Aniszczyk, co-founder and CTO of the Cloud Native Computing Foundation (CNCF), argues that AI agents resemble microservices at a surface level, though they differ in how they are scaled and managed. In an interview ahead of KubeCon/CloudNativeCon Europe, he emphasized that being “AI native” requires being cloud native by default. Cloud-native technologies such as containers, microservices, Kubernetes, gRPC, Prometheus, and OpenTelemetry provide the scalability, resilience, and observability needed to support AI systems at scale. Aniszczyk noted that major AI platforms like ChatGPT and Claude already rely on Kubernetes and other CNCF projects.To address growing complexity in running generative and agentic AI workloads, the CNCF has launched efforts to extend its conformance programs to AI. New requirements—such as dynamic resource allocation for GPUs and TPUs and specialized networking for inference workloads—are being handled inconsistently across the industry. CNCF aims to establish a baseline of compatibility to ensure vendor neutrality. Aniszczyk also highlighted CNCF incubation projects like Metal³ for bare-metal Kubernetes and OpenYurt for managing edge-based Kubernetes deployments. Learn more from The New Stack about CNCF and what to expect in 2026:Why the CNCF's New Executive Director Is Obsessed With InferenceCNCF Dragonfly Speeds Container, Model Sharing with P2PJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The Linus Tech Podcast
AI Engineer's Power Role

The Linus Tech Podcast

Play Episode Listen Later Jan 21, 2026 11:09


In this episode, we discuss the AI engineer's role in addressing energy demands, highlighting the Trump administration's $15 billion power plant proposal. We explore the technical challenges and solutions for powering AI development.Chapters00:00 AI's Power Hunger01:48 Trump Admin's Plan10:07 Energy Source Debate13:30 Accountability for ConsumptionIn this episode, we explore the Trump administration's proposal for tech companies to invest $15 billion in power plants to meet the surging electricity demands of AI and data centers. We also discuss its potential impact on America's power grid, consumer costs, and the national effort to remain competitive in the global AI race.Chapters00:00 $15 Billion Power Plant Deal01:48 PJM Grid Operator & Energy Needs04:44 Alternative Power Solutions06:45 The Trump Administration's Plan10:05 Renewables vs. Consistent Power13:33 Addressing Critiques and FutureLinksGet the top 40+ AI Models for $20 at AI Box: ⁠⁠https://aibox.aiAI Chat YouTube Channel: https://www.youtube.com/@JaedenSchaferJoin my AI Hustle Community: https://www.skool.com/aihustle See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

The Next 100 Days Podcast
#508 - Faiza Patan - From Student to AI Engineer

The Next 100 Days Podcast

Play Episode Listen Later Jan 16, 2026 51:53


Faiza has gone from Student to AI Engineer, developing valuable solutions for MicroYES and Finely Fettled clients. Her skills include AWS, Linux, and DevOps. She hails from Southern India and will complete her MSc in International Management at York St John University in early 2026. She is currently developing lead generation AI solutions for Finely Fettled and MicroYES clients.Summary of PodcastKey TakeawaysFaiza Khan's career progressed from student to AI Engineer via a structured path: internship → placement → full-time hire.Her role involves building AI agents (e.g., "Phone to Agent") and Answer Engine Optimisation (AEO) to help clients get found in LLM answers, a critical shift from traditional SEO.The hiring process used Handshake, a university student-focused job platform, and video interviews, where key advice for students is to speak up, slow down, smile, and make eye contact.AI is shifting the workforce from manual research to higher-value roles like AI architecture, with low-code/no-code tools enabling non-technical entry.Faiza's Career ProgressionBackground: From Kadapa, Southern India, with a Bachelor of Commerce.Early Skill-Building: Completed a 6-month course in AWS, Linux, and DevOps in Bangalore while working in inside sales.UK Education: Choose York St John University for its placement year option, which Manchester Metropolitan lacks.Hiring Process:Platform: Found via Handshake, a university job platform.Video Interview: A key step where students answer AI-generated questions on camera.Career Path:Internship: Initial role at Finely Fettled and its brand MicroYES.Placement: Extended 9-month contract.Full-Time: Hired as an AI Engineer/Architect and Marketing Manager.AI in Business & MarketingMeclabsAI Platform: Faiza's work on this AI solutions platform includes:AI Agent Delivery Systems: Personalised agents, not generic chatbots.AI Workflows: Self-service tools, like a database query workflow on the https://finelyfettled.co.uk website."Phone to Agent": A new service for small businesses.An AI agent answers calls using the client's specific policies and pricing.Designed for natural conversation (e.g., "mm-hmm" confirmations, background noise).Rationale: Provides cost-effective, consistent phone support for busy professionals and small businesses.Answer Engine Optimisation (AEO):Rationale: Anticipates ChatGPT providing more answers than Google by early 2028, making AEO a critical marketing strategy.Goal: Structure website content to be found and cited in LLM answers.Execution: An AI agent guides clients through the process.The Value of Diversity: Kevin noted Faiza's value comes from her diverse perspective (age, gender, culture), which provides fresh insights.Advice for StudentsSet a Clear Goal: Define a career path and stay focused.Use University Resources: Actively leverage career services and platforms like...

The New Stack Podcast
Solving the Problems that Accompany API Sprawl with AI

The New Stack Podcast

Play Episode Listen Later Jan 15, 2026 19:19


API sprawl creates hidden security risks and missed revenue opportunities when organizations lose visibility into the APIs they build. According to IBM's Neeraj Nargund, APIs power the core business processes enterprises want to scale, making automated discovery, observability, and governance essential—especially when thousands of APIs exist across teams and environments. Strong governance helps identify endpoints, remediate shadow APIs, and manage risk at scale. At the same time, enterprises increasingly want to monetize the data APIs generate, packaging insights into products and pricing and segmenting usage, a need amplified by the rise of AI.To address these challenges, Nargund highlights “smart APIs,” which are infused with AI to provide context awareness, event-driven behavior, and AI-assisted governance throughout the API lifecycle. These APIs help interpret and act on data, integrate with AI agents, and support real-time, streaming use cases.IBM's latest API Connect release embeds AI across API management and is designed for hybrid and multi-cloud environments, offering centralized governance, observability, and control through a single hybrid control plane.Learn more from The New Stack about smart APIs: Redefining API Management for the AI-Driven Enterprise How To Accelerate Growth With AI-Powered Smart APIs Wrangle Account Sprawl With an AI Gateway Join our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The Linus Tech Podcast
AI Engineer's Brain Challenge

The Linus Tech Podcast

Play Episode Listen Later Jan 15, 2026 11:26


In this episode, we address the engineering challenges and opportunities presented by Merge Labs' brain-computer interface technology for AI 'engineers.' We discuss the development of systems that can interpret neural signals.Resources Mentioned01:49 AIbox.ai02:27 AIbox.ai11:14 AIbox.ai See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

The New Stack Podcast
CloudBees CEO: Why Migration Is a Mirage Costing You Millions

The New Stack Podcast

Play Episode Listen Later Jan 13, 2026 34:08


A CloudBees survey reveals that enterprise migration projects often fail to deliver promised modernization benefits. In 2024, 57% of enterprises spent over $1 million on migrations, with average overruns costing $315,000 per project. In The New Stack Makers podcast, CloudBees CEO Anuj Kapur describes this pattern as “the migration mirage,” where organizations chase modernization through costly migrations that push value further into the future. Findings from the CloudBees 2025 DevOps Migration Index show leaders routinely underestimate the longevity and resilience of existing systems. Kapur notes that applications often outlast CIOs, yet new leadership repeatedly mandates wholesale replacement. The report argues modernization has been mistakenly equated with migration, which diverts resources from customer value to replatforming efforts. Beyond financial strain, migration erodes developer morale by forcing engineers to rework functioning systems instead of building new solutions. CloudBees advocates meeting developers where they are, setting flexible guardrails rather than enforcing rigid platforms. Kapur believes this approach, combined with emerging code assistance tools, could spark a new renaissance in software development by 2026.Learn more from The New Stack about enterprise modernization: Why AI Alone Fails at Large-Scale Code ModernizationHow AI Can Speed up Modernization of Your Legacy IT SystemsJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.   Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

GOTO - Today, Tomorrow and the Future
The AI Engineer's Guide to Surviving the EU AI Act • Larysa Visengeriyeva & Barbara Lampl

GOTO - Today, Tomorrow and the Future

Play Episode Listen Later Jan 13, 2026 32:03


This interview was recorded for the GOTO Book Club.http://gotopia.tech/bookclubCheck out more here:https://gotopia.tech/episodes/409Dr. Larysa Visengeriyeva - Author of "The AI Engineer's Guide to Surviving the EU AI Act" & Independent Consultant for EU AI Act EngineeringBarbara Lampl - Behavioral Mathematician at empathic business by Barbara LamplRESOURCESLarysahttps://x.com/visengerhttps://bsky.app/profile/visenger.bsky.socialhttps://github.com/visengerhttps://www.linkedin.com/in/larysavisengerBarbarahttps://x.com/BarbaraLamplhttps://www.linkedin.com/in/barbaralamplhttps://barbara-lampl.tumblr.comLinkshttps://ml-ops.orghttps://github.com/visenger/awesome-mlopshttps://eur-lex.europa.eu/eli/reg/2024/1689/oj/enghttps://machinelearningcanvas.comhttps://louisdorard.gumroad.com/l/mlcanvashttps://ml-ops.org/content/crisp-mlDESCRIPTIONBarbara Lampl interviews Larysa Visengeriyeva, software engineer and "godmother of MLOps", about her new book on AI engineering and compliance. What starts as a discussion about the EU AI Act quickly reveals a deeper truth: the real challenge isn't regulatory compliance - it's fundamental engineering practices.Larysa argues that quality AI systems require robust MLOps, comprehensive documentation, and proper data governance, whether regulation mandates it or not. Drawing from frameworks like CRISP-ML and the Machine Learning Canvas, the book provides practical checklists and methodologies for taking AI projects from prototype to production. Written partially in Ukraine during wartime, this "battle-tested" guide addresses the gap between technical and non-technical stakeholders, offering a common language for building sustainable AI systems.RECOMMENDED BOOKSLarysa Visengeriyeva • The AI Engineer's Guide to Surviving the EU AI Act • https://amzn.to/42SKOuULakshmanan, Robinson & Munn • Machine Learning Design Patterns • https://amzn.to/4ox4EosPhil Winder • Reinforcement Learning • https://amzn.to/3t1S1VZDiana Montalion • Learning Systems Thinking • https://amzn.to/3ZpycdJBernd Rücker • Practical Process Automation • https://amzn.to/3cs3BSHLauren Maffeo • Designing Data Governance from the Ground Up • https://amzn.to/3QhIlnVKatharine Jarmul • Practical Data Privacy • https://amzn.to/46XPrnsZhamak Dehghani • Data Mesh • https://amzn.to/3tTCwACKate Stanley & Mickael Maison • Kafka Connect • https://amzn.to/40Jq5JzBlueskyTwitterInstagramLinkedInFacebookCHANNEL MEMBERSHIP BONUSJoin this channel to get early access to videos & other perks:https://www.youtube.com/channel/UCs_tLP3AiwYKwdUHpltJPuA/joinLooking for a unique learning experience?Attend the next GOTO conference near you! Get your ticket: gotopia.techSUBSCRIBE TO OUR YOUTUBE CHANNEL - new videos posted daily!

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jan 8, 2026 78:24


Happy New Year! You may have noticed that in 2025 we had moved toward YouTube as our primary podcasting platform. As we'll explain in the next State of Latent Space post, we'll be doubling down on Substack again and improving the experience for the over 100,000 of you who look out for our emails and website updates!We first mentioned Artificial Analysis in 2024, when it was still a side project in a Sydney basement. They then were one of the few Nat Friedman and Daniel Gross' AIGrant companies to raise a full seed round from them and have now become the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities.We have chatted with both Clementine Fourrier of HuggingFace's OpenLLM Leaderboard and (the freshly valued at $1.7B) Anastasios Angelopoulos of LMArena on their approaches to LLM evals and trendspotting, but Artificial Analysis have staked out an enduring and important place in the toolkit of the modern AI Engineer by doing the best job of independently running the most comprehensive set of evals across the widest range of open and closed models, and charting their progress for broad industry analyst use.George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is “open” really?We discuss:* The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet* Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers* The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints* How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)* The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs* Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding ”I don't know”), and Claude models lead with the lowest hallucination rates despite not always being the smartest* GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)* The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)* The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)* Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future* Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)* V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)Links to Artificial Analysis* Website: https://artificialanalysis.ai* George Cameron on X: https://x.com/georgecameron* Micah-Hill Smith on X: https://x.com/micahhsmithFull Episode on YouTubeTimestamps* 00:00 Introduction: Full Circle Moment and Artificial Analysis Origins* 01:19 Business Model: Independence and Revenue Streams* 04:33 Origin Story: From Legal AI to Benchmarking Need* 16:22 AI Grant and Moving to San Francisco* 19:21 Intelligence Index Evolution: From V1 to V3* 11:47 Benchmarking Challenges: Variance, Contamination, and Methodology* 13:52 Mystery Shopper Policy and Maintaining Independence* 28:01 New Benchmarks: Omissions Index for Hallucination Detection* 33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning* 23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks* 50:19 Stirrup Agent Harness: Open Source Agentic Framework* 52:43 Openness Index: Measuring Model Transparency Beyond Licenses* 58:25 The Smiling Curve: Cost Falling While Spend Rising* 1:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits* 1:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges* 1:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas* 1:15:05 Looking Ahead: Intelligence Index V4 and Future Directions* 1:16:50 Closing: The Insatiable Demand for IntelligenceTranscriptMicah [00:00:06]: This is kind of a full circle moment for us in a way, because the first time artificial analysis got mentioned on a podcast was you and Alessio on Latent Space. Amazing.swyx [00:00:17]: Which was January 2024. I don't even remember doing that, but yeah, it was very influential to me. Yeah, I'm looking at AI News for Jan 17, or Jan 16, 2024. I said, this gem of a models and host comparison site was just launched. And then I put in a few screenshots, and I said, it's an independent third party. It clearly outlines the quality versus throughput trade-off, and it breaks out by model and hosting provider. I did give you s**t for missing fireworks, and how do you have a model benchmarking thing without fireworks? But you had together, you had perplexity, and I think we just started chatting there. Welcome, George and Micah, to Latent Space. I've been following your progress. Congrats on... It's been an amazing year. You guys have really come together to be the presumptive new gardener of AI, right? Which is something that...George [00:01:09]: Yeah, but you can't pay us for better results.swyx [00:01:12]: Yes, exactly.George [00:01:13]: Very important.Micah [00:01:14]: Start off with a spicy take.swyx [00:01:18]: Okay, how do I pay you?Micah [00:01:20]: Let's get right into that.swyx [00:01:21]: How do you make money?Micah [00:01:24]: Well, very happy to talk about that. So it's been a big journey the last couple of years. Artificial analysis is going to be two years old in January 2026. Which is pretty soon now. We first run the website for free, obviously, and give away a ton of data to help developers and companies navigate AI and make decisions about models, providers, technologies across the AI stack for building stuff. We're very committed to doing that and tend to keep doing that. We have, along the way, built a business that is working out pretty sustainably. We've got just over 20 people now and two main customer groups. So we want to be... We want to be who enterprise look to for data and insights on AI, so we want to help them with their decisions about models and technologies for building stuff. And then on the other side, we do private benchmarking for companies throughout the AI stack who build AI stuff. So no one pays to be on the website. We've been very clear about that from the very start because there's no use doing what we do unless it's independent AI benchmarking. Yeah. But turns out a bunch of our stuff can be pretty useful to companies building AI stuff.swyx [00:02:38]: And is it like, I am a Fortune 500, I need advisors on objective analysis, and I call you guys and you pull up a custom report for me, you come into my office and give me a workshop? What kind of engagement is that?George [00:02:53]: So we have a benchmarking and insight subscription, which looks like standardized reports that cover key topics or key challenges enterprises face when looking to understand AI and choose between all the technologies. And so, for instance, one of the report is a model deployment report, how to think about choosing between serverless inference, managed deployment solutions, or leasing chips. And running inference yourself is an example kind of decision that big enterprises face, and it's hard to reason through, like this AI stuff is really new to everybody. And so we try and help with our reports and insight subscription. Companies navigate that. We also do custom private benchmarking. And so that's very different from the public benchmarking that we publicize, and there's no commercial model around that. For private benchmarking, we'll at times create benchmarks, run benchmarks to specs that enterprises want. And we'll also do that sometimes for AI companies who have built things, and we help them understand what they've built with private benchmarking. Yeah. So that's a piece mainly that we've developed through trying to support everybody publicly with our public benchmarks. Yeah.swyx [00:04:09]: Let's talk about TechStack behind that. But okay, I'm going to rewind all the way to when you guys started this project. You were all the way in Sydney? Yeah. Well, Sydney, Australia for me.Micah [00:04:19]: George was an SF, but he's Australian, but he moved here already. Yeah.swyx [00:04:22]: And I remember I had the Zoom call with you. What was the impetus for starting artificial analysis in the first place? You know, you started with public benchmarks. And so let's start there. We'll go to the private benchmark. Yeah.George [00:04:33]: Why don't we even go back a little bit to like why we, you know, thought that it was needed? Yeah.Micah [00:04:40]: The story kind of begins like in 2022, 2023, like both George and I have been into AI stuff for quite a while. In 2023 specifically, I was trying to build a legal AI research assistant. So it actually worked pretty well for its era, I would say. Yeah. Yeah. So I was finding that the more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem. So had like this multistage algorithm thing, trying to figure out what the minimum viable model for each bit was, trying to optimize every bit of it as you build that out, right? Like you're trying to think about accuracy, a bunch of other metrics and performance and cost. And mostly just no one was doing anything to independently evaluate all the models. And certainly not to look at the trade-offs for speed and cost. So we basically set out just to build a thing that developers could look at to see the trade-offs between all of those things measured independently across all the models and providers. Honestly, it was probably meant to be a side project when we first started doing it.swyx [00:05:49]: Like we didn't like get together and say like, Hey, like we're going to stop working on all this stuff. I'm like, this is going to be our main thing. When I first called you, I think you hadn't decided on starting a company yet.Micah [00:05:58]: That's actually true. I don't even think we'd pause like, like George had an acquittance job. I didn't quit working on my legal AI thing. Like it was genuinely a side project.George [00:06:05]: We built it because we needed it as people building in the space and thought, Oh, other people might find it useful too. So we'll buy domain and link it to the Vercel deployment that we had and tweet about it. And, but very quickly it started getting attention. Thank you, Swyx for, I think doing an initial retweet and spotlighting it there. This project that we released. And then very quickly though, it was useful to others, but very quickly it became more useful as the number of models released accelerated. We had Mixtrel 8x7B and it was a key. That's a fun one. Yeah. Like a open source model that really changed the landscape and opened up people's eyes to other serverless inference providers and thinking about speed, thinking about cost. And so that was a key. And so it became more useful quite quickly. Yeah.swyx [00:07:02]: What I love talking to people like you who sit across the ecosystem is, well, I have theories about what people want, but you have data and that's obviously more relevant. But I want to stay on the origin story a little bit more. When you started out, I would say, I think the status quo at the time was every paper would come out and they would report their numbers versus competitor numbers. And that's basically it. And I remember I did the legwork. I think everyone has some knowledge. I think there's some version of Excel sheet or a Google sheet where you just like copy and paste the numbers from every paper and just post it up there. And then sometimes they don't line up because they're independently run. And so your numbers are going to look better than... Your reproductions of other people's numbers are going to look worse because you don't hold their models correctly or whatever the excuse is. I think then Stanford Helm, Percy Liang's project would also have some of these numbers. And I don't know if there's any other source that you can cite. The way that if I were to start artificial analysis at the same time you guys started, I would have used the Luther AI's eval framework harness. Yup.Micah [00:08:06]: Yup. That was some cool stuff. At the end of the day, running these evals, it's like if it's a simple Q&A eval, all you're doing is asking a list of questions and checking if the answers are right, which shouldn't be that crazy. But it turns out there are an enormous number of things that you've got control for. And I mean, back when we started the website. Yeah. Yeah. Like one of the reasons why we realized that we had to run the evals ourselves and couldn't just take rules from the labs was just that they would all prompt the models differently. And when you're competing over a few points, then you can pretty easily get- You can put the answer into the model. Yeah. That in the extreme. And like you get crazy cases like back when I'm Googled a Gemini 1.0 Ultra and needed a number that would say it was better than GPT-4 and like constructed, I think never published like chain of thought examples. 32 of them in every topic in MLU to run it, to get the score, like there are so many things that you- They never shipped Ultra, right? That's the one that never made it up. Not widely. Yeah. Yeah. Yeah. I mean, I'm sure it existed, but yeah. So we were pretty sure that we needed to run them ourselves and just run them in the same way across all the models. Yeah. And we were, we also did certain from the start that you couldn't look at those in isolation. You needed to look at them alongside the cost and performance stuff. Yeah.swyx [00:09:24]: Okay. A couple of technical questions. I mean, so obviously I also thought about this and I didn't do it because of cost. Yep. Did you not worry about costs? Were you funded already? Clearly not, but you know. No. Well, we definitely weren't at the start.Micah [00:09:36]: So like, I mean, we're paying for it personally at the start. There's a lot of money. Well, the numbers weren't nearly as bad a couple of years ago. So we certainly incurred some costs, but we were probably in the order of like hundreds of dollars of spend across all the benchmarking that we were doing. Yeah. So nothing. Yeah. It was like kind of fine. Yeah. Yeah. These days that's gone up an enormous amount for a bunch of reasons that we can talk about. But yeah, it wasn't that bad because you can also remember that like the number of models we were dealing with was hardly any and the complexity of the stuff that we wanted to do to evaluate them was a lot less. Like we were just asking some Q&A type questions and then one specific thing was for a lot of evals initially, we were just like sampling an answer. You know, like, what's the answer for this? Like, we didn't want to go into the answer directly without letting the models think. We weren't even doing chain of thought stuff initially. And that was the most useful way to get some results initially. Yeah.swyx [00:10:33]: And so for people who haven't done this work, literally parsing the responses is a whole thing, right? Like because sometimes the models, the models can answer any way they feel fit and sometimes they actually do have the right answer, but they just returned the wrong format and they will get a zero for that unless you work it into your parser. And that involves more work. And so, I mean, but there's an open question whether you should give it points for not following your instructions on the format.Micah [00:11:00]: It depends what you're looking at, right? Because you can, if you're trying to see whether or not it can solve a particular type of reasoning problem, and you don't want to test it on its ability to do answer formatting at the same time, then you might want to use an LLM as answer extractor approach to make sure that you get the answer out no matter how unanswered. But these days, it's mostly less of a problem. Like, if you instruct a model and give it examples of what the answers should look like, it can get the answers in your format, and then you can do, like, a simple regex.swyx [00:11:28]: Yeah, yeah. And then there's other questions around, I guess, sometimes if you have a multiple choice question, sometimes there's a bias towards the first answer, so you have to randomize the responses. All these nuances, like, once you dig into benchmarks, you're like, I don't know how anyone believes the numbers on all these things. It's so dark magic.Micah [00:11:47]: You've also got, like… You've got, like, the different degrees of variance in different benchmarks, right? Yeah. So, if you run four-question multi-choice on a modern reasoning model at the temperatures suggested by the labs for their own models, the variance that you can see on a four-question multi-choice eval is pretty enormous if you only do a single run of it and it has a small number of questions, especially. So, like, one of the things that we do is run an enormous number of all of our evals when we're developing new ones and doing upgrades to our intelligence index to bring in new things. Yeah. So, that we can dial in the right number of repeats so that we can get to the 95% confidence intervals that we're comfortable with so that when we pull that together, we can be confident in intelligence index to at least as tight as, like, a plus or minus one at a 95% confidence. Yeah.swyx [00:12:32]: And, again, that just adds a straight multiple to the cost. Oh, yeah. Yeah, yeah.George [00:12:37]: So, that's one of many reasons that cost has gone up a lot more than linearly over the last couple of years. We report a cost to run the artificial analysis. We report a cost to run the artificial analysis intelligence index on our website, and currently that's assuming one repeat in terms of how we report it because we want to reflect a bit about the weighting of the index. But our cost is actually a lot higher than what we report there because of the repeats.swyx [00:13:03]: Yeah, yeah, yeah. And probably this is true, but just checking, you don't have any special deals with the labs. They don't discount it. You just pay out of pocket or out of your sort of customer funds. Oh, there is a mix. So, the issue is that sometimes they may give you a special end point, which is… Ah, 100%.Micah [00:13:21]: Yeah, yeah, yeah. Exactly. So, we laser focus, like, on everything we do on having the best independent metrics and making sure that no one can manipulate them in any way. There are quite a lot of processes we've developed over the last couple of years to make that true for, like, the one you bring up, like, right here of the fact that if we're working with a lab, if they're giving us a private endpoint to evaluate a model, that it is totally possible. That what's sitting behind that black box is not the same as they serve on a public endpoint. We're very aware of that. We have what we call a mystery shopper policy. And so, and we're totally transparent with all the labs we work with about this, that we will register accounts not on our own domain and run both intelligence evals and performance benchmarks… Yeah, that's the job. …without them being able to identify it. And no one's ever had a problem with that. Because, like, a thing that turns out to actually be quite a good… …good factor in the industry is that they all want to believe that none of their competitors could manipulate what we're doing either.swyx [00:14:23]: That's true. I never thought about that. I've been in the database data industry prior, and there's a lot of shenanigans around benchmarking, right? So I'm just kind of going through the mental laundry list. Did I miss anything else in this category of shenanigans? Oh, potential shenanigans.Micah [00:14:36]: I mean, okay, the biggest one, like, that I'll bring up, like, is more of a conceptual one, actually, than, like, direct shenanigans. It's that the things that get measured become things that get targeted by labs that they're trying to build, right? Exactly. So that doesn't mean anything that we should really call shenanigans. Like, I'm not talking about training on test set. But if you know that you're going to be great at another particular thing, if you're a researcher, there are a whole bunch of things that you can do to try to get better at that thing that preferably are going to be helpful for a wide range of how actual users want to use the thing that you're building. But will not necessarily work. Will not necessarily do that. So, for instance, the models are exceptional now at answering competition maths problems. There is some relevance of that type of reasoning, that type of work, to, like, how we might use modern coding agents and stuff. But it's clearly not one for one. So the thing that we have to be aware of is that once an eval becomes the thing that everyone's looking at, scores can get better on it without there being a reflection of overall generalized intelligence of these models. Getting better. That has been true for the last couple of years. It'll be true for the next couple of years. There's no silver bullet to defeat that other than building new stuff to stay relevant and measure the capabilities that matter most to real users. Yeah.swyx [00:15:58]: And we'll cover some of the new stuff that you guys are building as well, which is cool. Like, you used to just run other people's evals, but now you're coming up with your own. And I think, obviously, that is a necessary path once you're at the frontier. You've exhausted all the existing evals. I think the next point in history that I have for you is AI Grant that you guys decided to join and move here. What was it like? I think you were in, like, batch two? Batch four. Batch four. Okay.Micah [00:16:26]: I mean, it was great. Nat and Daniel are obviously great. And it's a really cool group of companies that we were in AI Grant alongside. It was really great to get Nat and Daniel on board. Obviously, they've done a whole lot of great work in the space with a lot of leading companies and were extremely aligned. With the mission of what we were trying to do. Like, we're not quite typical of, like, a lot of the other AI startups that they've invested in.swyx [00:16:53]: And they were very much here for the mission of what we want to do. Did they say any advice that really affected you in some way or, like, were one of the events very impactful? That's an interesting question.Micah [00:17:03]: I mean, I remember fondly a bunch of the speakers who came and did fireside chats at AI Grant.swyx [00:17:09]: Which is also, like, a crazy list. Yeah.George [00:17:11]: Oh, totally. Yeah, yeah, yeah. There was something about, you know, speaking to Nat and Daniel about the challenges of working through a startup and just working through the questions that don't have, like, clear answers and how to work through those kind of methodically and just, like, work through the hard decisions. And they've been great mentors to us as we've built artificial analysis. Another benefit for us was that other companies in the batch and other companies in AI Grant are pushing the capabilities. Yeah. And I think that's a big part of what AI can do at this time. And so being in contact with them, making sure that artificial analysis is useful to them has been fantastic for supporting us in working out how should we build out artificial analysis to continue to being useful to those, like, you know, building on AI.swyx [00:17:59]: I think to some extent, I'm mixed opinion on that one because to some extent, your target audience is not people in AI Grants who are obviously at the frontier. Yeah. Do you disagree?Micah [00:18:09]: To some extent. To some extent. But then, so a lot of what the AI Grant companies are doing is taking capabilities coming out of the labs and trying to push the limits of what they can do across the entire stack for building great applications, which actually makes some of them pretty archetypical power users of artificial analysis. Some of the people with the strongest opinions about what we're doing well and what we're not doing well and what they want to see next from us. Yeah. Yeah. Because when you're building any kind of AI application now, chances are you're using a whole bunch of different models. You're maybe switching reasonably frequently for different models and different parts of your application to optimize what you're able to do with them at an accuracy level and to get better speed and cost characteristics. So for many of them, no, they're like not commercial customers of ours, like we don't charge for all our data on the website. Yeah. They are absolutely some of our power users.swyx [00:19:07]: So let's talk about just the evals as well. So you start out from the general like MMU and GPQA stuff. What's next? How do you sort of build up to the overall index? What was in V1 and how did you evolve it? Okay.Micah [00:19:22]: So first, just like background, like we're talking about the artificial analysis intelligence index, which is our synthesis metric that we pulled together currently from 10 different eval data sets to give what? We're pretty much the same as that. Pretty confident is the best single number to look at for how smart the models are. Obviously, it doesn't tell the whole story. That's why we published the whole website of all the charts to dive into every part of it and look at the trade-offs. But best single number. So right now, it's got a bunch of Q&A type data sets that have been very important to the industry, like a couple that you just mentioned. It's also got a couple of agentic data sets. It's got our own long context reasoning data set and some other use case focused stuff. As time goes on. The things that we're most interested in that are going to be important to the capabilities that are becoming more important for AI, what developers are caring about, are going to be first around agentic capabilities. So surprise, surprise. We're all loving our coding agents and how the model is going to perform like that and then do similar things for different types of work are really important to us. The linking to use cases to economically valuable use cases are extremely important to us. And then we've got some of the. Yeah. These things that the models still struggle with, like working really well over long contexts that are not going to go away as specific capabilities and use cases that we need to keep evaluating.swyx [00:20:46]: But I guess one thing I was driving was like the V1 versus the V2 and how bad it was over time.Micah [00:20:53]: Like how we've changed the index to where we are.swyx [00:20:55]: And I think that reflects on the change in the industry. Right. So that's a nice way to tell that story.Micah [00:21:00]: Well, V1 would be completely saturated right now. Almost every model coming out because doing things like writing the Python functions and human evil is now pretty trivial. It's easy to forget, actually, I think how much progress has been made in the last two years. Like we obviously play the game constantly of like the today's version versus last week's version and the week before and all of the small changes in the horse race between the current frontier and who has the best like smaller than 10B model like right now this week. Right. And that's very important to a lot of developers and people and especially in this particular city of San Francisco. But when you zoom out a couple of years ago, literally most of what we were doing to evaluate the models then would all be 100% solved by even pretty small models today. And that's been one of the key things, by the way, that's driven down the cost of intelligence at every tier of intelligence. We can talk about more in a bit. So V1, V2, V3, we made things harder. We covered a wider range of use cases. And we tried to get closer to things developers care about as opposed to like just the Q&A type stuff that MMLU and GPQA represented. Yeah.swyx [00:22:12]: I don't know if you have anything to add there. Or we could just go right into showing people the benchmark and like looking around and asking questions about it. Yeah.Micah [00:22:21]: Let's do it. Okay. This would be a pretty good way to chat about a few of the new things we've launched recently. Yeah.George [00:22:26]: And I think a little bit about the direction that we want to take it. And we want to push benchmarks. Currently, the intelligence index and evals focus a lot on kind of raw intelligence. But we kind of want to diversify how we think about intelligence. And we can talk about it. But kind of new evals that we've kind of built and partnered on focus on topics like hallucination. And we've got a lot of topics that I think are not covered by the current eval set that should be. And so we want to bring that forth. But before we get into that.swyx [00:23:01]: And so for listeners, just as a timestamp, right now, number one is Gemini 3 Pro High. Then followed by Cloud Opus at 70. Just 5.1 high. You don't have 5.2 yet. And Kimi K2 Thinking. Wow. Still hanging in there. So those are the top four. That will date this podcast quickly. Yeah. Yeah. I mean, I love it. I love it. No, no. 100%. Look back this time next year and go, how cute. Yep.George [00:23:25]: Totally. A quick view of that is, okay, there's a lot. I love it. I love this chart. Yeah.Micah [00:23:30]: This is such a favorite, right? Yeah. And almost every talk that George or I give at conferences and stuff, we always put this one up first to just talk about situating where we are in this moment in history. This, I think, is the visual version of what I was saying before about the zooming out and remembering how much progress there's been. If we go back to just over a year ago, before 01, before Cloud Sonnet 3.5, we didn't have reasoning models or coding agents as a thing. And the game was very, very different. If we go back even a little bit before then, we're in the era where, when you look at this chart, open AI was untouchable for well over a year. And, I mean, you would remember that time period well of there being very open questions about whether or not AI was going to be competitive, like full stop, whether or not open AI would just run away with it, whether we would have a few frontier labs and no one else would really be able to do anything other than consume their APIs. I am quite happy overall that the world that we have ended up in is one where... Multi-model. Absolutely. And strictly more competitive every quarter over the last few years. Yeah. This year has been insane. Yeah.George [00:24:42]: You can see it. This chart with everything added is hard to read currently. There's so many dots on it, but I think it reflects a little bit what we felt, like how crazy it's been.swyx [00:24:54]: Why 14 as the default? Is that a manual choice? Because you've got service now in there that are less traditional names. Yeah.George [00:25:01]: It's models that we're kind of highlighting by default in our charts, in our intelligence index. Okay.swyx [00:25:07]: You just have a manually curated list of stuff.George [00:25:10]: Yeah, that's right. But something that I actually don't think every artificial analysis user knows is that you can customize our charts and choose what models are highlighted. Yeah. And so if we take off a few names, it gets a little easier to read.swyx [00:25:25]: Yeah, yeah. A little easier to read. Totally. Yeah. But I love that you can see the all one jump. Look at that. September 2024. And the DeepSeek jump. Yeah.George [00:25:34]: Which got close to OpenAI's leadership. They were so close. I think, yeah, we remember that moment. Around this time last year, actually.Micah [00:25:44]: Yeah, yeah, yeah. I agree. Yeah, well, a couple of weeks. It was Boxing Day in New Zealand when DeepSeek v3 came out. And we'd been tracking DeepSeek and a bunch of the other global players that were less known over the second half of 2024 and had run evals on the earlier ones and stuff. I very distinctly remember Boxing Day in New Zealand, because I was with family for Christmas and stuff, running the evals and getting back result by result on DeepSeek v3. So this was the first of their v3 architecture, the 671b MOE.Micah [00:26:19]: And we were very, very impressed. That was the moment where we were sure that DeepSeek was no longer just one of many players, but had jumped up to be a thing. The world really noticed when they followed that up with the RL working on top of v3 and R1 succeeding a few weeks later. But the groundwork for that absolutely was laid with just extremely strong base model, completely open weights that we had as the best open weights model. So, yeah, that's the thing that you really see in the game. But I think that we got a lot of good feedback on Boxing Day. us on Boxing Day last year.George [00:26:48]: Boxing Day is the day after Christmas for those not familiar.George [00:26:54]: I'm from Singapore.swyx [00:26:55]: A lot of us remember Boxing Day for a different reason, for the tsunami that happened. Oh, of course. Yeah, but that was a long time ago. So yeah. So this is the rough pitch of AAQI. Is it A-A-Q-I or A-A-I-I? I-I. Okay. Good memory, though.Micah [00:27:11]: I don't know. I'm not used to it. Once upon a time, we did call it Quality Index, and we would talk about quality, performance, and price, but we changed it to intelligence.George [00:27:20]: There's been a few naming changes. We added hardware benchmarking to the site, and so benchmarks at a kind of system level. And so then we changed our throughput metric to, we now call it output speed, and thenswyx [00:27:32]: throughput makes sense at a system level, so we took that name. Take me through more charts. What should people know? Obviously, the way you look at the site is probably different than how a beginner might look at it.Micah [00:27:42]: Yeah, that's fair. There's a lot of fun stuff to dive into. Maybe so we can hit past all the, like, we have lots and lots of emails and stuff. The interesting ones to talk about today that would be great to bring up are a few of our recent things, I think, that probably not many people will be familiar with yet. So first one of those is our omniscience index. So this one is a little bit different to most of the intelligence evils that we've run. We built it specifically to look at the embedded knowledge in the models and to test hallucination by looking at when the model doesn't know the answer, so not able to get it correct, what's its probability of saying, I don't know, or giving an incorrect answer. So the metric that we use for omniscience goes from negative 100 to positive 100. Because we're simply taking off a point if you give an incorrect answer to the question. We're pretty convinced that this is an example of where it makes most sense to do that, because it's strictly more helpful to say, I don't know, instead of giving a wrong answer to factual knowledge question. And one of our goals is to shift the incentive that evils create for models and the labs creating them to get higher scores. And almost every evil across all of AI up until this point, it's been graded by simple percentage correct as the main metric, the main thing that gets hyped. And so you should take a shot at everything. There's no incentive to say, I don't know. So we did that for this one here.swyx [00:29:22]: I think there's a general field of calibration as well, like the confidence in your answer versus the rightness of the answer. Yeah, we completely agree. Yeah. Yeah.George [00:29:31]: On that. And one reason that we didn't do that is because. Or put that into this index is that we think that the, the way to do that is not to ask the models how confident they are.swyx [00:29:43]: I don't know. Maybe it might be though. You put it like a JSON field, say, say confidence and maybe it spits out something. Yeah. You know, we have done a few evils podcasts over the, over the years. And when we did one with Clementine of hugging face, who maintains the open source leaderboard, and this was one of her top requests, which is some kind of hallucination slash lack of confidence calibration thing. And so, Hey, this is one of them.Micah [00:30:05]: And I mean, like anything that we do, it's not a perfect metric or the whole story of everything that you think about as hallucination. But yeah, it's pretty useful and has some interesting results. Like one of the things that we saw in the hallucination rate is that anthropics Claude models at the, the, the very left-hand side here with the lowest hallucination rates out of the models that we've evaluated amnesty is on. That is an interesting fact. I think it probably correlates with a lot of the previously, not really measured vibes stuff that people like about some of the Claude models. Is the dataset public or what's is it, is there a held out set? There's a hell of a set for this one. So we, we have published a public test set, but we we've only published 10% of it. The reason is that for this one here specifically, it would be very, very easy to like have data contamination because it is just factual knowledge questions. We would. We'll update it at a time to also prevent that, but with yeah, kept most of it held out so that we can keep it reliable for a long time. It leads us to a bunch of really cool things, including breakdown quite granularly by topic. And so we've got some of that disclosed on the website publicly right now, and there's lots more coming in terms of our ability to break out very specific topics. Yeah.swyx [00:31:23]: I would be interested. Let's, let's dwell a little bit on this hallucination one. I noticed that Haiku hallucinates less than Sonnet hallucinates less than Opus. And yeah. Would that be the other way around in a normal capability environments? I don't know. What's, what do you make of that?George [00:31:37]: One interesting aspect is that we've found that there's not really a, not a strong correlation between intelligence and hallucination, right? That's to say that the smarter the models are in a general sense, isn't correlated with their ability to, when they don't know something, say that they don't know. It's interesting that Gemini three pro preview was a big leap over here. Gemini 2.5. Flash and, and, and 2.5 pro, but, and if I add pro quickly here.swyx [00:32:07]: I bet pro's really good. Uh, actually no, I meant, I meant, uh, the GPT pros.George [00:32:12]: Oh yeah.swyx [00:32:13]: Cause GPT pros are rumored. We don't know for a fact that it's like eight runs and then with the LM judge on top. Yeah.George [00:32:20]: So we saw a big jump in, this is accuracy. So this is just percent that they get, uh, correct and Gemini three pro knew a lot more than the other models. And so big jump in accuracy. But relatively no change between the Google Gemini models, between releases. And the hallucination rate. Exactly. And so it's likely due to just kind of different post-training recipe, between the, the Claude models. Yeah.Micah [00:32:45]: Um, there's, there's driven this. Yeah. You can, uh, you can partially blame us and how we define intelligence having until now not defined hallucination as a negative in the way that we think about intelligence.swyx [00:32:56]: And so that's what we're changing. Uh, I know many smart people who are confidently incorrect.George [00:33:02]: Uh, look, look at that. That, that, that is very humans. Very true. And there's times and a place for that. I think our view is that hallucination rate makes sense in this context where it's around knowledge, but in many cases, people want the models to hallucinate, to have a go. Often that's the case in coding or when you're trying to generate newer ideas. One eval that we added to artificial analysis is, is, is critical point and it's really hard, uh, physics problems. Okay.swyx [00:33:32]: And is it sort of like a human eval type or something different or like a frontier math type?George [00:33:37]: It's not dissimilar to frontier frontier math. So these are kind of research questions that kind of academics in the physics physics world would be able to answer, but models really struggled to answer. So the top score here is not 9%.swyx [00:33:51]: And when the people that, that created this like Minway and, and, and actually off via who was kind of behind sweep and what organization is this? Oh, is this, it's Princeton.George [00:34:01]: Kind of range of academics from, from, uh, different academic institutions, really smart people. They talked about how they turn the models up in terms of the temperature as high temperature as they can, where they're trying to explore kind of new ideas in physics as a, as a thought partner, just because they, they want the models to hallucinate. Um, yeah, sometimes it's something new. Yeah, exactly.swyx [00:34:21]: Um, so not right in every situation, but, um, I think it makes sense, you know, to test hallucination in scenarios where it makes sense. Also, the obvious question is, uh, this is one of. Many that there is there, every lab has a system card that shows some kind of hallucination number, and you've chosen to not, uh, endorse that and you've made your own. And I think that's a, that's a choice. Um, totally in some sense, the rest of artificial analysis is public benchmarks that other people can independently rerun. You provide it as a service here. You have to fight the, well, who are we to, to like do this? And your, your answer is that we have a lot of customers and, you know, but like, I guess, how do you converge the individual?Micah [00:35:08]: I mean, I think, I think for hallucinations specifically, there are a bunch of different things that you might care about reasonably, and that you'd measure quite differently, like we've called this a amnesty and solutionation rate, not trying to declare the, like, it's humanity's last hallucination. You could, uh, you could have some interesting naming conventions and all this stuff. Um, the biggest picture answer to that. It's something that I actually wanted to mention. Just as George was explaining, critical point as well is, so as we go forward, we are building evals internally. We're partnering with academia and partnering with AI companies to build great evals. We have pretty strong views on, in various ways for different parts of the AI stack, where there are things that are not being measured well, or things that developers care about that should be measured more and better. And we intend to be doing that. We're not obsessed necessarily with that. Everything we do, we have to do entirely within our own team. Critical point. As a cool example of where we were a launch partner for it, working with academia, we've got some partnerships coming up with a couple of leading companies. Those ones, obviously we have to be careful with on some of the independent stuff, but with the right disclosure, like we're completely comfortable with that. A lot of the labs have released great data sets in the past that we've used to great success independently. And so it's between all of those techniques, we're going to be releasing more stuff in the future. Cool.swyx [00:36:26]: Let's cover the last couple. And then we'll, I want to talk about your trends analysis stuff, you know? Totally.Micah [00:36:31]: So that actually, I have one like little factoid on omniscience. If you go back up to accuracy on omniscience, an interesting thing about this accuracy metric is that it tracks more closely than anything else that we measure. The total parameter count of models makes a lot of sense intuitively, right? Because this is a knowledge eval. This is the pure knowledge metric. We're not looking at the index and the hallucination rate stuff that we think is much more about how the models are trained. This is just what facts did they recall? And yeah, it tracks parameter count extremely closely. Okay.swyx [00:37:05]: What's the rumored size of GPT-3 Pro? And to be clear, not confirmed for any official source, just rumors. But rumors do fly around. Rumors. I get, I hear all sorts of numbers. I don't know what to trust.Micah [00:37:17]: So if you, if you draw the line on omniscience accuracy versus total parameters, we've got all the open ways models, you can squint and see that likely the leading frontier models right now are quite a lot bigger than the ones that we're seeing right now. And the one trillion parameters that the open weights models cap out at, and the ones that we're looking at here, there's an interesting extra data point that Elon Musk revealed recently about XAI that for three trillion parameters for GROK 3 and 4, 6 trillion for GROK 5, but that's not out yet. Take those together, have a look. You might reasonably form a view that there's a pretty good chance that Gemini 3 Pro is bigger than that, that it could be in the 5 to 10 trillion parameters. To be clear, I have absolutely no idea, but just based on this chart, like that's where you would, you would land if you have a look at it. Yeah.swyx [00:38:07]: And to some extent, I actually kind of discourage people from guessing too much because what does it really matter? Like as long as they can serve it as a sustainable cost, that's about it. Like, yeah, totally.George [00:38:17]: They've also got different incentives in play compared to like open weights models who are thinking to supporting others in self-deployment for the labs who are doing inference at scale. It's I think less about total parameters in many cases. When thinking about inference costs and more around number of active parameters. And so there's a bit of an incentive towards larger sparser models. Agreed.Micah [00:38:38]: Understood. Yeah. Great. I mean, obviously if you're a developer or company using these things, not exactly as you say, it doesn't matter. You should be looking at all the different ways that we measure intelligence. You should be looking at cost to run index number and the different ways of thinking about token efficiency and cost efficiency based on the list prices, because that's all it matters.swyx [00:38:56]: It's not as good for the content creator rumor mill where I can say. Oh, GPT-4 is this small circle. Look at GPT-5 is this big circle. And then there used to be a thing for a while. Yeah.Micah [00:39:07]: But that is like on its own, actually a very interesting one, right? That is it just purely that chances are the last couple of years haven't seen a dramatic scaling up in the total size of these models. And so there's a lot of room to go up properly in total size of the models, especially with the upcoming hardware generations. Yes.swyx [00:39:29]: So, you know. Taking off my shitposting face for a minute. Yes. Yes. At the same time, I do feel like, you know, especially coming back from Europe, people do feel like Ilya is probably right that the paradigm is doesn't have many more orders of magnitude to scale out more. And therefore we need to start exploring at least a different path. GDPVal, I think it's like only like a month or so old. I was also very positive when it first came out. I actually talked to Tejo, who was the lead researcher on that. Oh, cool. And you have your own version.George [00:39:59]: It's a fantastic. It's a fantastic data set. Yeah.swyx [00:40:01]: And maybe it will recap for people who are still out of it. It's like 44 tasks based on some kind of GDP cutoff that's like meant to represent broad white collar work that is not just coding. Yeah.Micah [00:40:12]: Each of the tasks have a whole bunch of detailed instructions, some input files for a lot of them. It's within the 44 is divided into like two hundred and twenty two to five, maybe subtasks that are the level of that we run through the agenda. And yeah, they're really interesting. I will say that it doesn't. It doesn't necessarily capture like all the stuff that people do at work. No avail is perfect is always going to be more things to look at, largely because in order to make the tasks well enough to find that you can run them, they need to only have a handful of input files and very specific instructions for that task. And so I think the easiest way to think about them are that they're like quite hard take home exam tasks that you might do in an interview process.swyx [00:40:56]: Yeah, for listeners, it is not no longer like a long prompt. It is like, well, here's a zip file with like a spreadsheet or a PowerPoint deck or a PDF and go nuts and answer this question.George [00:41:06]: OpenAI released a great data set and they released a good paper which looks at performance across the different web chat bots on the data set. It's a great paper, encourage people to read it. What we've done is taken that data set and turned it into an eval that can be run on any model. So we created a reference agentic harness that can run. Run the models on the data set, and then we developed evaluator approach to compare outputs. That's kind of AI enabled, so it uses Gemini 3 Pro Preview to compare results, which we tested pretty comprehensively to ensure that it's aligned to human preferences. One data point there is that even as an evaluator, Gemini 3 Pro, interestingly, doesn't do actually that well. So that's kind of a good example of what we've done in GDPVal AA.swyx [00:42:01]: Yeah, the thing that you have to watch out for with LLM judge is self-preference that models usually prefer their own output, and in this case, it was not. Totally.Micah [00:42:08]: I think the way that we're thinking about the places where it makes sense to use an LLM as judge approach now, like quite different to some of the early LLM as judge stuff a couple of years ago, because some of that and MTV was a great project that was a good example of some of this a while ago was about judging conversations and like a lot of style type stuff. Here, we've got the task that the grader and grading model is doing is quite different to the task of taking the test. When you're taking the test, you've got all of the agentic tools you're working with, the code interpreter and web search, the file system to go through many, many turns to try to create the documents. Then on the other side, when we're grading it, we're running it through a pipeline to extract visual and text versions of the files and be able to provide that to Gemini, and we're providing the criteria for the task and getting it to pick which one more effectively meets the criteria of the task. Yeah. So we've got the task out of two potential outcomes. It turns out that we proved that it's just very, very good at getting that right, matched with human preference a lot of the time, because I think it's got the raw intelligence, but it's combined with the correct representation of the outputs, the fact that the outputs were created with an agentic task that is quite different to the way the grading model works, and we're comparing it against criteria, not just kind of zero shot trying to ask the model to pick which one is better.swyx [00:43:26]: Got it. Why is this an ELO? And not a percentage, like GDP-VAL?George [00:43:31]: So the outputs look like documents, and there's video outputs or audio outputs from some of the tasks. It has to make a video? Yeah, for some of the tasks. Some of the tasks.swyx [00:43:43]: What task is that?George [00:43:45]: I mean, it's in the data set. Like be a YouTuber? It's a marketing video.Micah [00:43:49]: Oh, wow. What? Like model has to go find clips on the internet and try to put it together. The models are not that good at doing that one, for now, to be clear. It's pretty hard to do that with a code editor. I mean, the computer stuff doesn't work quite well enough and so on and so on, but yeah.George [00:44:02]: And so there's no kind of ground truth, necessarily, to compare against, to work out percentage correct. It's hard to come up with correct or incorrect there. And so it's on a relative basis. And so we use an ELO approach to compare outputs from each of the models between the task.swyx [00:44:23]: You know what you should do? You should pay a contractor, a human, to do the same task. And then give it an ELO and then so you have, you have human there. It's just, I think what's helpful about GDPVal, the OpenAI one, is that 50% is meant to be normal human and maybe Domain Expert is higher than that, but 50% was the bar for like, well, if you've crossed 50, you are superhuman. Yeah.Micah [00:44:47]: So we like, haven't grounded this score in that exactly. I agree that it can be helpful, but we wanted to generalize this to a very large number. It's one of the reasons that presenting it as ELO is quite helpful and allows us to add models and it'll stay relevant for quite a long time. I also think it, it can be tricky looking at these exact tasks compared to the human performance, because the way that you would go about it as a human is quite different to how the models would go about it. Yeah.swyx [00:45:15]: I also liked that you included Lama 4 Maverick in there. Is that like just one last, like...Micah [00:45:20]: Well, no, no, no, no, no, no, it is the, it is the best model released by Meta. And... So it makes it into the homepage default set, still for now.George [00:45:31]: Other inclusion that's quite interesting is we also ran it across the latest versions of the web chatbots. And so we have...swyx [00:45:39]: Oh, that's right.George [00:45:40]: Oh, sorry.swyx [00:45:41]: I, yeah, I completely missed that. Okay.George [00:45:43]: No, not at all. So that, which has a checkered pattern. So that is their harness, not yours, is what you're saying. Exactly. And what's really interesting is that if you compare, for instance, Claude 4.5 Opus using the Claude web chatbot, it performs worse than the model in our agentic harness. And so in every case, the model performs better in our agentic harness than its web chatbot counterpart, the harness that they created.swyx [00:46:13]: Oh, my backwards explanation for that would be that, well, it's meant for consumer use cases and here you're pushing it for something.Micah [00:46:19]: The constraints are different and the amount of freedom that you can give the model is different. Also, you like have a cost goal. We let the models work as long as they want, basically. Yeah. Do you copy paste manually into the chatbot? Yeah. Yeah. That's, that was how we got the chatbot reference. We're not going to be keeping those updated at like quite the same scale as hundreds of models.swyx [00:46:38]: Well, so I don't know, talk to a browser base. They'll, they'll automate it for you. You know, like I have thought about like, well, we should turn these chatbot versions into an API because they are legitimately different agents in themselves. Yes. Right. Yeah.Micah [00:46:53]: And that's grown a huge amount of the last year, right? Like the tools. The tools that are available have actually diverged in my opinion, a fair bit across the major chatbot apps and the amount of data sources that you can connect them to have gone up a lot, meaning that your experience and the way you're using the model is more different than ever.swyx [00:47:10]: What tools and what data connections come to mind when you say what's interesting, what's notable work that people have done?Micah [00:47:15]: Oh, okay. So my favorite example on this is that until very recently, I would argue that it was basically impossible to get an LLM to draft an email for me in any useful way. Because most times that you're sending an email, you're not just writing something for the sake of writing it. Chances are context required is a whole bunch of historical emails. Maybe it's notes that you've made, maybe it's meeting notes, maybe it's, um, pulling something from your, um, any of like wherever you at work store stuff. So for me, like Google drive, one drive, um, in our super base databases, if we need to do some analysis or some data or something, preferably model can be plugged into all of those things and can go do some useful work based on it. The things that like I find most impressive currently that I am somewhat surprised work really well in late 2025, uh, that I can have models use super base MCP to query read only, of course, run a whole bunch of SQL queries to do pretty significant data analysis. And. And make charts and stuff and can read my Gmail and my notion. And okay. You actually use that. That's good. That's, that's, that's good. Is that a cloud thing? To various degrees of order, but chat GPD and Claude right now, I would say that this stuff like barely works in fairness right now. Like.George [00:48:33]: Because people are actually going to try this after they hear it. If you get an email from Micah, odds are it wasn't written by a chatbot.Micah [00:48:38]: So, yeah, I think it is true that I have never actually sent anyone an email drafted by a chatbot. Yet.swyx [00:48:46]: Um, and so you can, you can feel it right. And yeah, this time, this time next year, we'll come back and see where it's going. Totally. Um, super base shout out another famous Kiwi. Uh, I don't know if you've, you've any conversations with him about anything in particular on AI building and AI infra.George [00:49:03]: We have had, uh, Twitter DMS, um, with, with him because we're quite big, uh, super base users and power users. And we probably do some things more manually than we should in. In, in super base support line because you're, you're a little bit being super friendly. One extra, um, point regarding, um, GDP Val AA is that on the basis of the overperformance of the models compared to the chatbots turns out, we realized that, oh, like our reference harness that we built actually white works quite well on like gen generalist agentic tasks. This proves it in a sense. And so the agent harness is very. Minimalist. I think it follows some of the ideas that are in Claude code and we, all that we give it is context management capabilities, a web search, web browsing, uh, tool, uh, code execution, uh, environment. Anything else?Micah [00:50:02]: I mean, we can equip it with more tools, but like by default, yeah, that's it. We, we, we give it for GDP, a tool to, uh, view an image specifically, um, because the models, you know, can just use a terminal to pull stuff in text form into context. But to pull visual stuff into context, we had to give them a custom tool, but yeah, exactly. Um, you, you can explain an expert. No.George [00:50:21]: So it's, it, we turned out that we created a good generalist agentic harness. And so we, um, released that on, on GitHub yesterday. It's called stirrup. So if people want to check it out and, and it's a great, um, you know, base for, you know, generalist, uh, building a generalist agent for more specific tasks.Micah [00:50:39]: I'd say the best way to use it is get clone and then have your favorite coding. Agent make changes to it, to do whatever you want, because it's not that many lines of code and the coding agents can work with it. Super well.swyx [00:50:51]: Well, that's nice for the community to explore and share and hack on it. I think maybe in, in, in other similar environments, the terminal bench guys have done, uh, sort of the Harbor. Uh, and so it's, it's a, it's a bundle of, well, we need our minimal harness, which for them is terminus and we also need the RL environments or Docker deployment thing to, to run independently. So I don't know if you've looked at it. I don't know if you've looked at the harbor at all, is that, is that like a, a standard that people want to adopt?George [00:51:19]: Yeah, we've looked at it from a evals perspective and we love terminal bench and, and host benchmarks of, of, of terminal mention on artificial analysis. Um, we've looked at it from a, from a coding agent perspective, but could see it being a great, um, basis for any kind of agents. I think where we're getting to is that these models have gotten smart enough. They've gotten better, better tools that they can perform better when just given a minimalist. Set of tools and, and let them run, let the model control the, the agentic workflow rather than using another framework that's a bit more built out that tries to dictate the, dictate the flow. Awesome.swyx [00:51:56]: Let's cover the openness index and then let's go into the report stuff. Uh, so that's the, that's the last of the proprietary art numbers, I guess. I don't know how you sort of classify all these. Yeah.Micah [00:52:07]: Or call it, call it, let's call it the last of like the, the three new things that we're talking about from like the last few weeks. Um, cause I mean, there's a, we do a mix of stuff that. Where we're using open source, where we open source and what we do and, um, proprietary stuff that we don't always open source, like long context reasoning data set last year, we did open source. Um, and then all of the work on performance benchmarks across the site, some of them, we looking to open source, but some of them, like we're constantly iterating on and so on and so on and so on. So there's a huge mix, I would say, just of like stuff that is open source and not across the side. So that's a LCR for people. Yeah, yeah, yeah, yeah.swyx [00:52:41]: Uh, but let's, let's, let's talk about open.Micah [00:52:42]: Let's talk about openness index. This. Here is call it like a new way to think about how open models are. We, for a long time, have tracked where the models are open weights and what the licenses on them are. And that's like pretty useful. That tells you what you're allowed to do with the weights of a model, but there is this whole other dimension to how open models are. That is pretty important that we haven't tracked until now. And that's how much is disclosed about how it was made. So transparency about data, pre-training data and post-training data. And whether you're allowed to use that data and transparency about methodology and training code. So basically, those are the components. We bring them together to score an openness index for models so that you can in one place get this full picture of how open models are.swyx [00:53:32]: I feel like I've seen a couple other people try to do this, but they're not maintained. I do think this does matter. I don't know what the numbers mean apart from is there a max number? Is this out of 20?George [00:53:44]: It's out of 18 currently, and so we've got an openness index page, but essentially these are points, you get points for being more open across these different categories and the maximum you can achieve is 18. So AI2 with their extremely open OMO3 32B think model is the leader in a sense.swyx [00:54:04]: It's hooking face.George [00:54:05]: Oh, with their smaller model. It's coming soon. I think we need to run, we need to get the intelligence benchmarks right to get it on the site.swyx [00:54:12]: You can't have it open in the next. We can not include hooking face. We love hooking face. We'll have that, we'll have that up very soon. I mean, you know, the refined web and all that stuff. It's, it's amazing. Or is it called fine web? Fine web. Fine web.Micah [00:54:23]: Yeah, yeah, no, totally. Yep. One of the reasons this is cool, right, is that if you're trying to understand the holistic picture of the models and what you can do with all the stuff the company's contributing, this gives you that picture. And so we are going to keep it up to date alongside all the models that we do intelligence index on, on the site. And it's just an extra view to understand.swyx [00:54:43]: Can you scroll down to this? The, the, the, the trade-offs chart. Yeah, yeah. That one. Yeah. This, this really matters, right? Obviously, because you can b

The New Stack Podcast
Human Cognition Can't Keep Up with Modern Networks. What's Next?

The New Stack Podcast

Play Episode Listen Later Jan 7, 2026 23:16


IBM's recent acquisitions of Red Hat, HashiCorp, and its planned purchase of Confluent reflect a deliberate strategy to build the infrastructure required for enterprise AI. According to IBM's Sanil Nambiar, AI depends on consistent hybrid cloud runtimes (Red Hat), programmable and automated infrastructure (HashiCorp), and real-time, trustworthy data (Confluent). Without these foundations, AI cannot function effectively. Nambiar argues that modern, software-defined networks have become too complex for humans to manage alone, overwhelmed by fragmented data, escalating tool sophistication, and a widening skills gap that makes veteran “tribal knowledge” hard to transfer. Trust, he says, is the biggest barrier to AI adoption in networking, since errors can cause costly outages. To address this, IBM launched IBM Network Intelligence, a “network-native” AI solution that combines time-series foundation models with reasoning large language models. This architecture enables AI agents to detect subtle warning patterns, collapse incident response times, and deliver accurate, trustworthy insights for real-world network operations.Learn more from The New Stack about AI infrastructure and IBM's approach:  AI in Network Observability: The Dawn of Network Intelligence How Agentic AI Is Redefining Campus and Branch Network Needs Join our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The top AI news from the past week, every ThursdAI
ThursdAI - Jan 1 2026 - Will Brown Interview + Nvidia buys Groq, Meta buys Manus, Qwen Image 2412 & Alex New Year greetings

The top AI news from the past week, every ThursdAI

Play Episode Listen Later Jan 1, 2026 29:42


Hey all, Happy new year! This is Alex, writing to you for the very fresh start of this year, it's 2026 already, can you believe it? There was no live stream today, I figured the cohosts deserve a break and honestly it was a very slow week. Even the chinese labs who don't really celebrate X-mas and new years didn't come out with a banger AFAIK. ThursdAI - AI moves fast, we're here to make sure you never miss a thing! Subscribe :) Tho I thought it was an incredible opportunity to finally post the Will Brow interview I recorded in November during the AI Engineer conference. Will is a researcher at Prime Intellect (big fans on WandB btw!) and is very known on X as a hot takes ML person, often going viral for tons of memes! Will is the creator and maintainer of the Verifiers library (Github) and his talk at AI Engineer was all about RL Environments (what they are, you can hear in the interview, I asked him!) TL;DR last week of 2025 in AIBesides this, my job here is to keep you up to date, and honestly this was very easy this week, as… almost nothing has happened, but here we go: Meta buys ManusThe year ended with 2 huge acquisitions / aquihires. First we got the news from Alex Wang that Meta has bought Manus.ai which is an agentic AI startup we covered back in March for an undisclosed amount (folks claim $2-3B) The most interesting thing here is that Manus is a Chinese company, and this deal requires very specific severance from Chinese operations.Jensen goes on a new years spending spree, Nvidia buys Groq (not GROK) for $20BGroq which we covered often here, and are great friends, is going to NVIDIA, in a… very interesting acqui-hire, which is a “non binding license” + most of Groq top employees apparently are going to NVIDIA. Jonathan Ross the CEO of Groq, was the co-creator of the TPU chips at Google before founding Groq, so this seems like a very strategic aquihire for NVIDIA! Congrats to our friends from Groq on this amazing news for the new year! Tencent open-sources HY-MT1.5 translation models with 1.8B edge-deployable and 7B cloud variants supporting 33 languages (X, HF, HF, GitHub)It seems that everyone's is trying to de-throne whisper and this latest attempt from Tencent is a interesting one. a 1.8B and 7B translation models with very interesting stats. Alibaba's Qwen-Image-2512 drops on New Year's Eve as strongest open-source text-to-image model, topping AI Arena with photorealistic humans and sharper textures (X, HF, Arxiv)Our friends in Tongyi decided to give is a new years present in the form of an updated Qwen-image, with much improved realismThat's it folks, this was a quick one, hopefully you all had an amazing new year celebration, and are gearing up to an eventful and crazy 2026. I wish you all happiness, excitement and energy to keep up with everything in the new year, and will make sure that we're here to keep you up to date as always! P.S - I got a little news of my own this yesterday, not related to AI. She said yes

The New Stack Podcast
From Group Science Project to Enterprise Service: Rethinking OpenTelemetry

The New Stack Podcast

Play Episode Listen Later Dec 30, 2025 17:20


Ari Zilka, founder of MyDecisive.ai and former Hortonworks CPO, argues that most observability vendors now offer essentially identical, reactive dashboards that highlight problems only after systems are already broken. After speaking with all 23 observability vendors at KubeCon + CloudNativeCon North America 2025, Zilka said these tools fail to meaningfully reduce mean time to resolution (MTTR), a long-standing demand he heard repeatedly from thousands of CIOs during his time at New Relic.Zilka believes observability must shift from reactive monitoring to proactive operations, where systems automatically respond to telemetry in real time. MyDecisive.ai is his attempt to solve this, acting as a “bump in the wire” that intercepts telemetry and uses AI-driven logic to trigger actions like rolling back faulty releases.He also criticized the rising cost and complexity of OpenTelemetry adoption, noting that many companies now require large, specialized teams just to maintain OTel stacks. MyDecisive aims to turn OpenTelemetry into an enterprise-ready service that reduces human intervention and operational overhead.Learn more from The New Stack about OpenTelemetry:Observability Is Stuck in the Past. Your Users Aren't. Setting Up OpenTelemetry on the Frontend Because I Hate MyselfHow to Make OpenTelemetry Better in the BrowserJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Algoritmi
Il PAGELLONE AI del 2025

Algoritmi

Play Episode Listen Later Dec 29, 2025 22:21


In questo episodio analizziamo mese per mese i rilasci chiave dell'anno: da DeepSeek-R1 a Gemini 3. Ne sono successe di cose!

The Enrollify Podcast
Pulse Check: Building the Modern Campus - A Higher-Ed Project Management Playbook — Pt. 4

The Enrollify Podcast

Play Episode Listen Later Dec 25, 2025 34:36


Guest Name: Ben Burke, Senior Data Scientist, SlalomGuest Social: https://www.linkedin.com/in/ben-burke-data/Guest Bio: Ben is a Sr. Data Scientist and AI Engineer consultant developing Generative AI solutions for Fortune 1000 companies. He's known for his practical, human-centered approach to AI adoption, and for teaching professionals how to partner with AI to improve clarity, collaboration, and decision-making. His business, Between The Data, helps teams using AI 'build the right things'. You can find him on LinkedIn where he posts about AI, team formation, project management, and his family. - - - -Connect With Our Host:Mallory Willsea https://www.linkedin.com/in/mallorywillsea/https://twitter.com/mallorywillseaAbout The Enrollify Podcast Network:The Higher Ed Pulse is a part of the Enrollify Podcast Network. If you like this podcast, chances are you'll like other Enrollify shows too!Enrollify is made possible by Element451 — The AI Workforce Platform for Higher Ed. Learn more at element451.com. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
Do All Your AI Workloads Actually Require Expensive GPUs?

The New Stack Podcast

Play Episode Listen Later Dec 18, 2025 29:49


GPUs dominate today's AI landscape, but Google argues they are not necessary for every workload. As AI adoption has grown, customers have increasingly demanded compute options that deliver high performance with lower cost and power consumption. Drawing on its long history of custom silicon, Google introduced Axion CPUs in 2024 to meet needs for massive scale, flexibility, and general-purpose computing alongside AI workloads. The Axion-based C4A instance is generally available, while the newer N4A virtual machines promise up to 2x price performance.In this episode, Andrei Gueletii, a technical solutions consultant for Google Cloud joined Gari Singh, a product manager for Google Kubernetes Engine (GKE), and Pranay Bakre, a principal solutions engineer at Arm for this episode, recorded at KubeCon + CloudNativeCon North America, in Atlanta. Built on Arm Neoverse V2 cores, Axion processors emphasize energy efficiency and customization, including flexible machine shapes that let users tailor memory and CPU resources. These features are particularly valuable for platform engineering teams, which must optimize centralized infrastructure for cost, FinOps goals, and price performance as they scale.Importantly, many AI tasks—such as inference for smaller models or batch-oriented jobs—do not require GPUs. CPUs can be more efficient when GPU memory is underutilized or latency demands are low. By decoupling workloads and choosing the right compute for each task, organizations can significantly reduce AI compute costs.Learn more from The New Stack about the Axion-based C4A: Beyond Speed: Why Your Next App Must Be Multi-ArchitectureArm: See a Demo About Migrating a x86-Based App to ARM64Join our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
Breaking Data Team Silos Is the Key to Getting AI to Production

The New Stack Podcast

Play Episode Listen Later Dec 17, 2025 30:47


Enterprises are racing to deploy AI services, but the teams responsible for running them in production are seeing familiar problems reemerge—most notably, silos between data scientists and operations teams, reminiscent of the old DevOps divide. In a discussion recorded at AWS re:Invent 2025, IBM's Thanos Matzanas and Martin Fuentes argue that the challenge isn't new technology but repeating organizational patterns. As data teams move from internal projects to revenue-critical, customer-facing applications, they face new pressures around reliability, observability, and accountability.The speakers stress that many existing observability and governance practices still apply. Standard metrics, KPIs, SLOs, access controls, and audit logs remain essential foundations, even as AI introduces non-determinism and a heavier reliance on human feedback to assess quality. Tools like OpenTelemetry provide common ground, but culture matters more than tooling.Both emphasize starting with business value and breaking down silos early by involving data teams in production discussions. Rather than replacing observability professionals, AI should augment human expertise, especially in critical systems where trust, safety, and compliance are paramount.Learn more from The New Stack about enabling AI with silos: Are Your AI Co-Pilots Trapping Data in Isolated Silos?Break the AI Gridlock at the Intersection of Velocity and TrustTaming AI Observability: Control Is the Key to SuccessJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
Why AI Parallelization Will Be One of the Biggest Challenges of 2026

The New Stack Podcast

Play Episode Listen Later Dec 16, 2025 24:05


Rob Whiteley, CEO of Coder, argues that the biggest winners in today's AI boom resemble the “picks and shovels” sellers of the California Gold Rush: companies that provide tools enabling others to build with AI. Speaking onThe New Stack Makersat AWS re:Invent, Whiteley described the current AI moment as the fastest-moving shift he's seen in 25 years of tech. Developers are rapidly adopting AI tools, while platform teams face pressure to approve them, as saying “no” is no longer viable. Whiteley warns of a widening gap between organizations that extract real value from AI and those that don't, driven by skills shortages and insufficient investment in training. He sees parallels with the cloud-native transition and predicts the rise of “AI-native” companies. As agentic AI grows, developers increasingly act as managers overseeing many parallel AI agents, creating new challenges around governance, security, and state management. To address this, Coder introduced Mux, an open source coding agent multiplexer designed to help developers manage and evaluate large volumes of AI-generated code efficiently.Learn more from The New Stack about AI Parallelization The Production Generative AI Stack: Architecture and ComponentsEnable ParallelFrontend/Backend Development to Unlock VelocityJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
Kubernetes GPU Management Just Got a Major Upgrade

The New Stack Podcast

Play Episode Listen Later Dec 11, 2025 35:26


Nvidia Distinguished Engineer Kevin Klues noted that low-level systems work is invisible when done well and highly visible when it fails — a dynamic that frames current Kubernetes innovations for AI. At KubeCon + CloudNativeCon North America 2025, Klues and AWS product manager Jesse Butler discussed two emerging capabilities: dynamic resource allocation (DRA) and a new workload abstraction designed for sophisticated AI scheduling.DRA, now generally available in Kubernetes 1.34, fixes long-standing limitations in GPU requests. Instead of simply asking for a number of GPUs, users can specify types and configurations. Modeled after persistent volumes, DRA allows any specialized hardware to be exposed through standardized interfaces, enabling vendors to deliver custom device drivers cleanly. Butler called it one of the most elegant designs in Kubernetes.Yet complex AI workloads require more coordination. A forthcoming workload abstraction, debuting in Kubernetes 1.35, will let users define pod groups with strict scheduling and topology rules — ensuring multi-node jobs start fully or not at all. Klues emphasized that this abstraction will shape Kubernetes' AI trajectory for the next decade and encouraged community involvement.Learn more from The New Stack about dynamic resource allocation: Kubernetes Primer: Dynamic Resource Allocation (DRA) for GPU WorkloadsKubernetes v1.34 Introduces Benefits but Also New Blind SpotsJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.   Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Practical AI
The AI engineer skills gap

Practical AI

Play Episode Listen Later Dec 10, 2025 45:33 Transcription Available


Chris and Daniel talk with returning guest, Ramin Mohammadi, about how those seeking to get into AI Engineer/ Data Science jobs are expected to come in a mid level engineers (not entry level). They explore this growing gap along with what should (or could) be done in academia to focus on real world skills vs. theoretical knowledge. Featuring:Ramin Mohammadi – LinkedInChris Benson – Website, LinkedIn, Bluesky, GitHub, XDaniel Whitenack – Website, GitHub, XSponsors:Shopify – The commerce platform trusted by millions. From idea to checkout, Shopify gives you everything you need to launch and scale your business—no matter your level of experience. Build beautiful storefronts, market with built-in AI tools, and tap into the platform powering 10% of all U.S. eCommerce. Start your one-dollar trial at shopify.com/practicalaiUpcoming Events: Register for upcoming webinars here!

The New Stack Podcast
The Rise of the Cognitive Architect

The New Stack Podcast

Play Episode Listen Later Dec 10, 2025 22:53


At KubeCon North America 2025, GitLab's Emilio Salvador outlined how developers are shifting from individual coders to leaders of hybrid human–AI teams. He envisions developers evolving into “cognitive architects,” responsible for breaking down large, complex problems and distributing work across both AI agents and humans. Complementing this is the emerging role of the “AI guardian,” reflecting growing skepticism around AI-generated code. Even as AI produces more code, humans remain accountable for reviewing quality, security, and compliance.Salvador also described GitLab's “AI paradox”: developers may code faster with AI, but overall productivity stalls because testing, security, and compliance processes haven't kept pace. To fix this, he argues organizations must apply AI across the entire development lifecycle, not just in coding. GitLab's Duo Agent Platform aims to support that end-to-end transformation.Looking ahead, Salvador predicts the rise of a proactive “meta agent” that functions like a full team member. Still, he warns that enterprise adoption remains slow and advises organizations to start small, build skills, and scale gradually.Learn more from The New Stack about the evolving role of "cognitive architects":The Engineer in the AI Age: The Orchestrator and ArchitectThe New Role of Enterprise Architecture in the AI EraThe Architect's Guide to Understanding Agentic AIJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
Why the CNCF's New Executive Director is Obsessed With Inference

The New Stack Podcast

Play Episode Listen Later Dec 9, 2025 25:09


Jonathan Bryce, the new CNCF executive director, argues that inference—not model training—will define the next decade of computing. Speaking at KubeCon North America 2025, he emphasized that while the industry obsesses over massive LLM training runs, the real opportunity lies in efficiently serving these models at scale. Cloud-native infrastructure, he says, is uniquely suited to this shift because inference requires real-time deployment, security, scaling, and observability—strengths of the CNCF ecosystem. Bryce believes Kubernetes is already central to modern inference stacks, with projects like Ray, KServe, and emerging GPU-oriented tooling enabling teams to deploy and operationalize models. To bring consistency to this fast-moving space, the CNCF launched a Kubernetes AI Conformance Program, ensuring environments support GPU workloads and Dynamic Resource Allocation. With AI agents poised to multiply inference demand by executing parallel, multi-step tasks, efficiency becomes essential. Bryce predicts that smaller, task-specific models and cloud-native routing optimizations will drive major performance gains. Ultimately, he sees CNCF technologies forming the foundation for what he calls “the biggest workload mankind will ever have.” Learn more from The New Stack about inference: Confronting AI's Next Big Challenge: Inference Compute Deep Infra Is Building an AI Inference Cloud for Developers Join our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
Kubernetes Gets an AI Conformance Program — and VMware Is Already On Board

The New Stack Podcast

Play Episode Listen Later Dec 8, 2025 30:40


The Cloud Native Computing Foundation has introduced the Certified Kubernetes AI Conformance Program to bring consistency to an increasingly fragmented AI ecosystem. Announced at KubeCon + CloudNativeCon North America 2025, the program establishes open, community-driven standards to ensure AI applications run reliably and portably across different Kubernetes platforms. VMware by Broadcom's vSphere Kubernetes Service (VKS) is among the first platforms to achieve certification.In an interview with The New Stack, Broadcom leaders Dilpreet Bindra and Himanshu Singh explained that the program applies lessons from Kubernetes' early evolution, aiming to reduce the “muddiness” in AI tooling and improve cross-platform interoperability. They emphasized portability as a core value: organizations should be able to move AI workloads between public and private clouds with minimal friction.VKS integrates tightly with vSphere, using Kubernetes APIs directly to manage infrastructure components declaratively. This approach, along with new add-on management capabilities, reflects Kubernetes' growing maturity. According to Bindra and Singh, this stability now enables enterprises to trust Kubernetes as a foundation for production-grade AI. Learn more from The New Stack about Broadcom's latest updates with Kubernetes: Has VMware Finally Caught Up with Kubernetes?VMware VCF 9.0 Finally Unifies Container and VM ManagementJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Supermanagers
AI Automates Email, Meetings & Internal Workflows with Mike Potter

Supermanagers

Play Episode Listen Later Dec 4, 2025 51:44


Aydin sits down with Mike Potter, CEO and co-founder of Rewind, to talk about how AI is changing both the risk and opportunity landscape for SaaS companies. They cover how AI agents are now deleting real customer data, why backup is more critical than ever, and how Rewind became an AI-native org with dedicated AI ownership, monthly Lunch & Learns, and real internal workflows.Mike walks through the exact N8N workflows he uses to:Auto-triage his Gmail into multiple inboxes using AIGenerate a daily AI brief based on tasks, calendar events, and past email contextAnalyze churn, win/loss, and internal product data using Claude and MCPThey close with Mike's “dream automation”: a full AI-generated business review that looks across financials, CRM data, and benchmarks.Timestamps:0:00 — Welcome to the show0:31 — Mike's intro & what Rewind backs up across SaaS ecosystems1:40 — AI agents as a new failure mode and how Rewind “saves you from your AI”4:05 — Turning Rewind into an AI-native company early on4:53 — First attempt at AI-built integrations (why it failed then, why it might work now)7:23 — Developers trading tedious integration maintenance for more interesting AI work9:45 — Code vs architecture: the Shopify webhooks story and handling 1.1B+ events14:03 — Hiring an AI Engineer: scope, responsibilities, and why background mattered15:33 — How Rewind drove AI adoption: Lunch & Learns, “use it in your personal life,” experimentation20:53 — How AI Lunch & Learns actually run across multiple offices and remote folks23:10 — Examples: CS tools, Alloy prototypes, AI video voiceovers, end-to-end workflows25:13 — Churn workflows: combining uninstall reasons from multiple marketplaces into Claude27:06 — Win/loss and internal analytics using Claude Projects + MCP server into an internal DB29:14 — Choosing between Claude, ChatGPT, and Gemini depending on the task (and re-testing every few months)31:23 — Mike's Gmail system: multiple inboxes + N8N + AI classification36:07 — Inside the email-classifier prompt and AI-powered spam that beats Gmail filters41:34 — The “Daily AI Brief”: pulling tasks, meetings, and prior email threads into a single morning email45:02 — Letting AI write and debug N8N workflows (and how assistants in tools are getting better)48:58 — Wishlist: automated AI business review across finance, Salesforce, and SaaS benchmarks51:23 — Closing thoughts: so many useful tools are possible, but GTM is the hard partTools & Technologies MentionedRewind – Backup and restore for mission-critical SaaS applications.Claude – LLM used for analysis, projects, agents, and internal tools.ChatGPT / OpenAI (GPT-4.1, GPT-4.1 mini) – LLMs used for code, prompts, and workflow JSON.N8N – Automation platform used to build email and daily-brief workflows.Gmail – Email client where AI-powered labels drive multiple inboxes.Google Calendar – Calendar data powering the daily AI agenda.Google Tasks – Task list feeding into the morning brief email.MCP (Model Context Protocol) – Connects Claude to Rewind's internal databases.Alloy – Tool for building interactive product UI prototypes.Salesforce – CRM used for pipeline, churn, and win/loss analysis.Gumloop – Workflow tool with an embedded AI assistant.Zapier – Automation platform referenced for plain-English workflow creation.Fellow – AI meeting assistant for summaries, action items, and insights.Subscribe at⁠ thisnewway.com⁠ to get the step-by-step playbooks, tools, and workflows.

Data Hackers
O que você precisa saber sobre a Carreira de AI Engineer ? Data Hackers Podcast #118

Data Hackers

Play Episode Listen Later Dec 3, 2025 56:10


A carreira de AI Engineer se consolidou como uma das mais disputadas do mercado de tecnologia. Mas afinal, o que realmente é esperado desse profissional na prática?Neste episódio do Data Hackers, discutimos em profundidade o caminho para se tornar um AI Engineer, analisando as principais habilidades técnicas, as diferenças em relação a outros cargos da área de dados e engenharia, a formação acadêmica versus experiência prática, a rotina nas empresas e o impacto da IA Generativa, RAG e AI Agents no dia a dia da função.Para enriquecer o debate, utilizamos dados da pesquisa State of Data Brazil como base para entender o cenário atual do mercado brasileiro, identificar tendências de demanda por habilidades, perfis profissionais mais buscados e os principais desafios enfrentados por quem deseja ingressar ou evoluir nessa carreira.Se você quer migrar para IA, se preparar para oportunidades reais ou entender se esse é o próximo passo profissional em dados, este episódio é para você. Não se esqueça de preencher a pesquisa State of Data Brazil: https://www.stateofdata.com.br/Nossa Bancada Data Hackers:Paulo Vasconcellos — Co-founder da Data Hackers e Principal Data Scientist na Hotmart.Gabriel Lages — Co-founder da Data Hacker e Diretor de Dados & AI da Hotmart

The New Stack Podcast
Helm 4: What's New in the Open Source Kubernetes Package Manager?

The New Stack Podcast

Play Episode Listen Later Dec 3, 2025 24:45


Helm — originally a hackathon project called Kate's Place — turned 10 in 2025, marking the milestone with the release of Helm 4, its first major update in six years. Created by Matt Butcher and colleagues as a playful take on “K8s,” the early project won a small prize but quickly grew into a serious effort when Deus leadership recognized the need for a Kubernetes package manager. Renamed Helm, it rapidly expanded with community contributors and became one of the first CNCF graduating projects.Helm 4 reflects years of accumulated design debt and evolving use cases. After the rapid iterations of Helm 1, 2, and 3, the latest version modernizes logging, improves dependency management, and introduces WebAssembly-based plugins for cross-platform portability—addressing the growing diversity of operating systems and architectures. Beyond headline features, maintainers emphasize that mature projects increasingly deliver “boring” but essential improvements, such as better logging, which simplify workflows and integrate more cleanly with other tools. Helm's re-architected internals also lay the foundation for new chart and package capabilities in upcoming 4.x releases. Learn more from The New Stack about Helm: The Super Helm Chart: To Deploy or Not To Deploy?Kubernetes Gets a New Resource Orchestrator in the Form of KroJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.   Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
All About Cedar, an Open Source Solution for Fine-Tuning Kubernetes Authorization

The New Stack Podcast

Play Episode Listen Later Dec 2, 2025 16:13


Kubernetes has relied on role-based access control (RBAC) since 2017, but its simplicity limits what developers can express, said Micah Hausler, principal engineer at AWS, on The New Stack Makers. RBAC only allows actions; it can't enforce conditions, denials, or attribute-based rules. Seeking a more expressive authorization model for Kubernetes, Hausler explored Cedar, an authorization engine and policy language created at AWS in 2022 and later open-sourced. Although not designed specifically for Kubernetes, Cedar proved capable of modeling its authorization needs in a concise, readable way. Hausler highlighted Cedar's clarity—nontechnical users can often understand policies at a glance—as well as its schema validation, autocomplete support, and formal verification, which ensures policies are correct and produce only allow or deny outcomes.Now onboarding to the CNCF sandbox, Cedar is used by companies like Cloudflare and MongoDB and offers language-agnostic tooling, including a Go implementation donated by StrongDM. The project is actively seeking contributors, especially to expand bindings for languages like TypeScript, JavaScript, and Python.Learn more from The New Stack about Cedar:Ceph: 20 Years of Cutting-Edge Storage at the Edge The Cedar Programming Language: Authorization SimplifiedJoin our community of newsletter subscribers to stay on top of the news and at the top of your game.  Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
2026 Will Be the Year of Agentic Workloads in Production on Amazon EKS

The New Stack Podcast

Play Episode Listen Later Nov 28, 2025 23:16


AWS's approach to Elastic Kubernetes Service has evolved significantly since its 2018 launch. According to Mike Stefanik, Senior Manager of Product Management for EKS and ECR, today's users increasingly represent the late majority—teams that want Kubernetes without managing every component themselves. In a conversation onThe New Stack Makers, Stefanik described how AI workloads are reshaping Kubernetes operations and why AWS open-sourced an MCP server for EKS. Early feedback showed that meaningful, task-oriented tool names—not simple API mirrors—made MCP servers more effective for LLMs, prompting AWS to design tools focused on troubleshooting, runbooks, and full application workflows. AWS also introduced a hosted knowledge base built from years of support cases to power more capable agents.While “agentic AI” gets plenty of buzz, most customers still rely on human-in-the-loop workflows. Stefanik expects that to shift, predicting 2026 as the year agentic workloads move into production. For experimentation, he recommends the open-source Strands SDK. Internally, he has already seen major productivity gains from BI agents that automate complex data analysis tasks.Learn more from The New Stack about Amazon Web Services' approach to Elastic Kubernetes ServiceHow Amazon EKS Auto Mode Simplifies Kubernetes Cluster Management (Part 1)A Deep Dive Into Amazon EKS Auto (Part 2)Join our community of newsletter subscribers to stay on top of the news and at the top of your game.   Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The New Stack Podcast
Amazon CTO Werner Vogels' Predictions for 2026

The New Stack Podcast

Play Episode Listen Later Nov 25, 2025 54:43


AWS re:Invent has long featured CTO Werner Vogels' closing keynote, but this year he signaled it may be his last, emphasizing it's time for “younger voices” at Amazon. After 21 years with the company, Vogels reflected on arriving as an academic and being stunned by Amazon's technical scale—an energy that still drives him today. He released his annual predictions ahead of re:Invent, with this year's five themes focused heavily on AI and broader societal impacts.Vogels highlights technology's growing role in addressing loneliness, noting how devices like Alexa can offer comfort to those who feel isolated. He foresees a “Renaissance developer,” where engineers must pair deep expertise with broad business and creative awareness. He warns quantum-safe encryption is becoming urgent as data harvested today may be decrypted within five years. Military innovations, he notes, continue to influence civilian tech, for better and worse. Finally, he argues personalized learning can preserve children's curiosity and better support teachers, which he views as essential for future education.Learn more from The New Stack about evolving role of technology systems from past to future: Werner Vogels' 6 Lessons for Keeping Systems Simple50 Years Later: Remembering How the Future Looked in 1974Join our community of newsletter subscribers to stay on top of the news and at the top of your game.   Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The RevOps Review
"AI Won't Fix Broken Processes": GTM Strategy, RevOps, and the Rise of the AI Engineer with Kristina McMillan

The RevOps Review

Play Episode Listen Later Nov 14, 2025 23:47


In this episode, Kristina McMillan, Executive in Residence at Scale Venture Partners, shares what she's seeing across Scale's portfolio when it comes to AI adoption in revenue teams. From the rise of the go-to-market engineer to the three levels of AI maturity, Kristina breaks down what's working, what's hype, and why RevOps needs to lead with strategy, not just tools. We also get into AI's real impact on metrics like ARR per employee, the role of internal AI hackathons, and how top teams are choosing between building and buying. If you're feeling overwhelmed by the pace of change, this episode will give you clarity and a tactical playbook.

Develop Yourself
#287 - From Smoothie King to AI Engineer

Develop Yourself

Play Episode Listen Later Nov 13, 2025 19:19 Transcription Available


Ryan is a current student at Parsity who build an app for his employer, Smoothie King, to suggest drinks in a chat interface using a powerful and lesser-known AI technology: RAG.RAG stands for retrieval augmented generation. Basically, providing information (like smoothie recipes) to an AI model so it can return a highly specific response.Ryan breaks down how he finds the time to build side projects like this and how he built this app.Want to build your own AI-powered app? Check out this project: parsity.io/ai-with-ragConnect with Ryan here: https://www.linkedin.com/in/rhardin378/Send us a textShameless Plugs

TechTopia
Techtopia 386: Hvad er vibe coding?

TechTopia

Play Episode Listen Later Nov 10, 2025 49:17


AI-assisteret softwareudvikling er rykket fra eksperiment til virkelighed. Men hvad virker – og hvad er bare hype?Kasper Junge og Christian Bech Nørhave tager dig med ind i maskinrummet, hvor AI allerede er en del af udviklingsteamets hverdag. De deler erfaringer med AI i praksis.Det handler ikke om hype, men om hvad der virker i praksis.Hvad AI faktisk kan (og ikke kan) i softwareudviklingFælles sprog og processer: gør AI til en kollega, ikke en gadgetFart kræver retning: klare mål, kodekvalitet og ansvarBrug AI som kraftforstærker – uden at miste kontrollenMedvirkende:Christian Bech Nørhave+20 års erfaring med Digitaliseringsrådgivning+200 foredrag omkring AIBygger nordisk MSP i samarbejde med DevoteamKasper JungeAI Engineer hos DineroVært på Verbos PodcastNordic AI Influencer DAIR Award WinnerLink:vibe-coding.dk

Open Tech Talks : Technology worth Talking| Blogging |Lifestyle

Building Career Resilience in the Age of Generative AI Every week, we explore how AI and technology are changing the way we work and learn. This episode dives into the question I get asked the most, How is Generative AI changing every career? Let's unpack why it matters, how it's shifting roles and skills, and what you can do to lead this change instead of chasing it In this solo episode of Open Tech Talks, host Kashif Manzoor, AI Engineer and Strategiest, and author of AI Tech Circle, dives deep into one of the biggest career questions of our time: How is Generative AI reshaping every profession? Whether you're a developer, analyst, marketer, finance expert, or operations lead, the rise of Gen AI is transforming how work gets done. Kashif combines real-world enterprise experience, current research from McKinsey and Goldman Sachs, and his personal journey building the Gen AI Maturity Framework and Portal to uncover how you can stay relevant, resilient, and ready for AI-driven change. He shares first-hand stories from his own AI adoption journey, how enterprise teams are shifting from cloud architecture to AI architecture, from isolated use-cases to full-scale agentic AI strategies and the lessons learned while guiding organizations through transformation. This episode is both a roadmap and a reflection: how to experiment weekly, build your portfolio, upskill smartly, reposition your role, and teach and share as you grow. Episode # 173 What You'll Learn Why Generative AI matters now and how it differs from traditional AI How tasks, roles, and careers are evolving across industries Real-world examples from finance, marketing, and software engineering The five practical steps to future-proof your career with Gen AI Insights from McKinsey, ResearchGate, and Goldman Sachs on AI productivity impact How to move from "knowing AI tools" to using AI strategically in daily work A behind-the-scenes look at the creation of the Gen AI Maturity Framework Why the future of work is not about jobs lost but roles transformed   External References   McKinsey Global Institute – Generative AI and the Future of Work  Deloitte – Generative AI and the Future of Work  Goldman Sachs – How Will AI Affect the Global Workforce  Robert Half – How GenAI Is Changing Creative Careers Mäkelä & Stephany (2024) – Complement or Substitute?

DGTL Voices with Ed Marx
From AC/DC to AI... Engineer to CEO (ft. Eduardo Conrado)

DGTL Voices with Ed Marx

Play Episode Listen Later Oct 14, 2025 23:57


On this episode of DGTL Voices, Ed interviews Eduardo Conrado, the incoming CEO of Ascension, discussing his journey from engineering to healthcare leadership. They explore the role of data-driven insights, and strategies for career growth. Eduardo shares his experiences and insights on how CIOs and technology leaders can effectively connect with operations to drive transformation in the healthcare sector.

UBC News World
What Is the Pathway to Become an AI Engineer? 5 Skills Developers Need Most

UBC News World

Play Episode Listen Later Oct 10, 2025 4:07


Is there a defined pathway to becoming an AI engineer? While school curriculums are still inchoate, must-have skills have been, more or less, identified; the major ones, we tackle in this segment.Find out more at https://interviewcamp.ai/ interviewcamp.ai City: New York Address: 430 Park Ave Website: https://interviewcamp.ai

The New Stack Podcast
How the EU's Cyber Act Burdens Lone Open Source Developers

The New Stack Podcast

Play Episode Listen Later Sep 11, 2025 19:30


The European Union's upcoming Cyber Resilience Act (CRA) goes into effect in  October 2026, with the remainder of the requirements going into effect in December 2027, and introduces significant cybersecurity compliance requirements for software vendors, including those who rely heavily on open source components. At the Open Source Summit Europe, Christopher "CRob" Robinson of the Open Source Security Foundation highlighted concerns about how these regulations could impact open source maintainers. Many open source projects begin as personal solutions to shared problems and grow in popularity, often ending up embedded in critical systems across industries like automotive and energy. Despite this widespread use—Robinson noted up to 97% of commercial software contains open source—these projects are frequently maintained by individuals or small teams with limited resources.Developers often have no visibility into how their code is used, yet they're increasingly burdened by legal and compliance demands from downstream users, such as requests for Software Bills of Materials (SBOMs) and conformity assessments. The CRA raises the stakes, with potential penalties in the billions for noncompliance, putting immense pressure on the open source ecosystem. Learn more from The New Stack about Open Source Security:Open Source Propels the Fall of Security by ObscurityThere Is Just One Way To Do Open Source Security: TogetherJoin our community of newsletter subscribers to stay on top of the news and at the top of your game. 

Keys to the Commonwealth
E82 - Joseph Thacker, Leveraging AI's Impact in a Changing World

Keys to the Commonwealth

Play Episode Listen Later Sep 8, 2025 64:09


Send us a textAs a security researcher who specializes in application security and AI, Joseph Thacker shares his knowledge on the growing influence of AI in various aspects of our culture. He's the principal AI Engineer at AppOmni and has helped multiple Fortune 500 companies find vulnerablities that could have cost them millions. He is incredibly knowledgable and offers great insight into this growing industry._______________________________Find Joseph Thacker onLinkedIn:https://www.linkedin.com/in/josephthacker?original_referer=https%3A%2F%2Fwww.google.com%2FHis New website and course for parents:https://aisafetyforparents.com/X:@rez0__Instagram:@thackandforthWebsite:https://josephthacker.com/_______________________________Show hosted by Landry Fieldshttps://www.x.com/landryfieldz'https://www.linkedin.com/in/landryfields/https://www.instagram.com/landryfields_https://www.youtube.com/@landryfields_www.novainsurancegroup.com859-687-2004

PodRocket - A web development podcast from LogRocket
Navigating the AI bubble, the 10x AI engineer, and the Cloudflare vs. Perplexity data grab

PodRocket - A web development podcast from LogRocket

Play Episode Listen Later Aug 28, 2025 44:26


Is the AI industry an unsustainable bubble built on burning billions in cash? We break down the AI hype cycle, the tough job market for developers, and whether a crash is on the horizon. In this panel discussion with Josh Goldberg, Paige Niedringhaus, Paul Mikulskis, and Noel Minchow, we tackle the biggest questions in tech today. * We debate if AI is just another Web3-style hype cycle * Why the "10x AI engineer" is a myth that ignores the reality of software development * The ethical controversy around AI crawlers and data scraping, highlighted by Cloudflare's recent actions Plus, we cover the latest industry news, including Vercel's powerful new AI SDK V5 and what GitHub's leadership shakeup means for the future of developers. Resources Anthropic Is Bleeding Out: https://www.wheresyoured.at/anthropic-is-bleeding-out The Hater's Guide To The AI Bubble: https://www.wheresyoured.at/the-haters-gui No, AI is not Making Engineers 10x as Productive: https://colton.dev/blog/curing-your-ai-10x-engineer-imposter-syndrome Cloudflare Is Blocking AI Crawlers by Default: https://www.wired.com/story/cloudflare-blocks-ai-crawlers-default Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives: https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives GitHub just got less independent at Microsoft after CEO resignation: https://www.theverge.com/news/757461/microsoft-github-thomas-dohmke-resignation-coreai-team-transition Chapters 0:00 Is the AI Industry Burning Cash Unsustainably? 01:06 Anthropic and the "AI Bubble Euphoria" 04:42 How the AI Hype Cycle is Different from Web3 & VR 08:24 The Problem with "Slapping AI" on Every App 11:54 The "10x AI Engineer" is a Myth and Why 17:55 Real-World AI Success Stories 21:26 Cloudflare vs. AI Crawlers: The Ethics of Data Scraping 30:05 Vercel's New AI SDK V5: What's Changed? 33:45 GitHub's CEO Steps Down: What It Means for Developers 38:54 Hot Takes: The Future of AI Startups, the Job Market, and More We want to hear from you! How did you find us? Did you see us on Twitter? In a newsletter? Or maybe we were recommended by a friend? Fill out our listener survey (https://t.co/oKVAEXipxu)! Let us know by sending an email to our producer, Em, at emily.kochanek@logrocket.com (mailto:emily.kochanek@logrocket.com), or tweet at us at PodRocketPod (https://twitter.com/PodRocketpod). Follow us. Get free stickers. Follow us on Apple Podcasts, fill out this form (https://podrocket.logrocket.com/get-podrocket-stickers), and we'll send you free PodRocket stickers! What does LogRocket do? LogRocket provides AI-first session replay and analytics that surfaces the UX and technical issues impacting user experiences. Start understanding where your users are struggling by trying it for free at LogRocket.com. Try LogRocket for free today. (https://logrocket.com/signup/?pdr)

Razib Khan's Unsupervised Learning
Nikolai Yakovenko: the $200 million AI engineer

Razib Khan's Unsupervised Learning

Play Episode Listen Later Aug 2, 2025 80:48


On this episode of Unsupervised Learning, in the wake of Elon Musk's xAI Grok chatbot turning anti-Semitic following a recent update, Razib catches up with Nikolai Yakovenko about the state of AI in the summer of 2025. Nearly three years after their first conversations on the topic, the catch up, covering ChatGPT's release and the anticipation of massive macroeconomic transformations driven by automation of knowledge-work. Yakovenko is a former professional poker player and research scientist at Google, Twitter (now X) and Nvidia (now the first $4 trillion company). With more than a decade on the leading edge computer science, Yakovenko has been at the forefront of the large-language-model revolution that was a necessary precursor to the rise of companies like OpenAI, Anthropic and Perplexity, as well as hundreds of smaller startups. Currently, he is the CEO of DeepNewz, an AI-driven news startup that leverages the latest models to retrieve the ground-truth on news-stories. Disclosure: Razib actively uses and recommends the service and is an advisor to the company. Razib and Yakovenko first tackle why Mark Zuckerberg's Meta is offering individual pay packages north of $200 million, poaching some of OpenAI's top individual contributors. Yakovenko observes that it seems Meta is giving up on its open-source Llama project, their competitor to the models that underpin OpenAI and ChatGPT (he also comments that it seems that engineers at xAI are disappointed in the latest version of Grok). Overall, though the pay-packages of AI engineers and researchers are high; there is now a big shakeout as massive companies with the money and engineering researchers pull away from their competitors. Additionally, in terms of cutting-edge models, the US and China are the only two international players (Yakovenko notes parenthetically that Chinese engineers are also the primary labor base of American AI firms). They also discuss how it is notable that almost three years after the beginning of the current booming repeated hype-cycles of artificial intelligence began to crest, we are still no closer to “artificial general intelligence” and the “intelligence super-explosion” that Ray Kurzweil has been predicting for generations. AI is partially behind the rise of companies like Waymo that are on the verge of transforming the economy, but overall, even though AI is still casting around for its killer app, big-tech has fully bought in and believes that the next decade will determine who wins the future.