Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news
The top AI news from the past week, every ThursdAI
Hey everyone, Alex here
Hey everyone, Alex here
Hey Folks, Alex here, celebrating an absolutely crazy (to me) milestone, of #100 episodes of ThursdAI
Woo! Welcome back to ThursdAI, show number 99! Can you believe it? We are one show away from hitting the big 100, which is just wild to me. And speaking of milestones, we just crossed 100,000 downloads on Substack alone! [Insert celebratory sound effect here
Hey, it's Alex, coming to you fresh off another live recording of ThursdAI, and what an incredible one it's been! I was hoping that this week will be chill with the releases, because of NVIDIA's GTC conference, but no, the AI world doesn't stop, and if you blinked this week, you may have missed 2 or 10 major things that happened. From Mistral coming back to OSS with the amazing Mistral Small 3.1 (beating Gemma from last week!) to OpenAI dropping a new voice generation model, and 2! new whisper killer ASR models with a Breaking News during our live show (there's a reason we're called ThursdAI) which we watched together and then dissected with Kwindla, our amazing AI VOICE and real time expert. Not to mention that we also had dedicated breaking news from friend of the pod Joseph Nelson, that came on the show to announce a SOTA vision model from Roboflow + a new benchmark on which even the top VL models get around 6%! There's also a bunch of other OSS, a SOTA 3d model from Tencent and more! And last but not least, Yam is back
LET'S GO! Happy second birthday to ThursdAI, your favorite weekly AI news show! Can you believe it's been two whole years since we jumped into that random Twitter Space to rant about GPT-4? From humble beginnings as a late-night Twitter chat to a full-blown podcast, Newsletter and YouTube show with hundreds of thousands of downloads, it's been an absolutely wild ride! That's right, two whole years of me, Alex Volkov, your friendly AI Evangelist, along with my amazing co-hosts, trying to keep you up-to-date on the breakneck speed of the AI worldAnd what better way to celebrate than with a week PACKED with insane AI news? Buckle up, folks, because this week Google went OPEN SOURCE crazy, Gemini got even cooler, OpenAI created a whole new Agents SDK and the open-source community continues to blow our minds. We've got it all - from game-changing model releases to mind-bending demos.This week I'm also on the Weights & Biases company retreat, so TL;DR first and then the newsletter, but honestly, I'll start embedding the live show here in the substack from now on, because we're getting so good at it, I barely have to edit lately and there's a LOT to show you guys! TL;DR and Show Notes & Links* Hosts & Guests* Alex Volkov - AI Eveangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @ldjconfirmed @nisten * Sandra Kublik - DevRel at Cohere (@itsSandraKublik)* Open Source LLMs * Google open sources Gemma 3 - 1B - 27B - 128K context (Blog, AI Studio, HF)* EuroBERT - multilingual encoder models (210M to 2.1B params)* Reka Flash 3 (reasoning) 21B parameters is open sourced (Blog, HF)* Cohere Command A 111B model - 256K context (Blog)* Nous Research Deep Hermes 24B / 3B Hybrid Reasoners (X, HF)* AllenAI OLMo 2 32B - fully open source GPT4 level model (X, Blog, Try It)* Big CO LLMs + APIs* Gemini Flash generates images natively (X, AI Studio)* Google deep research is now free in Gemini app and powered by Gemini Thinking (Try It no cost)* OpenAI released new responses API, Web Search, File search and Computer USE tools (X, Blog)* This weeks Buzz * The whole company is at an offsite at oceanside, CA* W&B internal MCP hackathon and had cool projects - launching an MCP server soon!* Vision & Video* Remade AI - 8 LORA video effects for WANX (HF)* AI Art & Diffusion & 3D* ByteDance Seedream 2.0 - A Native Chinese-English Bilingual Image Generation Foundation Model by ByteDance (Blog, Paper)* Tools* Everyone's talking about Manus - (manus.im)* Google AI studio now supports youtube understanding via link droppingOpen Source LLMs: Gemma 3, EuroBERT, Reka Flash 3, and Cohere Command-A Unleashed!This week was absolutely HUGE for open source, folks. Google dropped a BOMBSHELL with Gemma 3! As Wolfram pointed out, this is a "very technical achievement," and it's not just one model, but a whole family ranging from 1 billion to 27 billion parameters. And get this – the 27B model can run on a SINGLE GPU! Sundar Pichai himself claimed you'd need "at least 10X compute to get similar performance from other models." Insane!Gemma 3 isn't just about size; it's packed with features. We're talking multimodal capabilities (text, images, and video!), support for over 140 languages, and a massive 128k context window. As Nisten pointed out, "it might actually end up being the best at multimodal in that regard" for local models. Plus, it's fine-tuned for safety and comes with ShieldGemma 2 for content moderation. You can grab Gemma 3 on Google AI Studio, Hugging Face, Ollama, Kaggle – everywhere! Huge shoutout to Omar Sanseviero and the Google team for this incredible release and for supporting the open-source community from day one! Colin aka Bartowski, was right, "The best thing about Gemma is the fact that Google specifically helped the open source communities to get day one support." This is how you do open source right!Next up, we have EuroBERT, a new family of multilingual encoder models. Wolfram, our European representative, was particularly excited about this one: "In European languages, you have different characters than in other languages. And, um, yeah, encoding everything properly is, uh, difficult." Ranging from 210 million to 2.1 billion parameters, EuroBERT is designed to push the boundaries of NLP in European and global languages. With training on a massive 5 trillion-token dataset across 15 languages and support for 8K context tokens, EuroBERT is a workhorse for RAG and other NLP tasks. Plus, how cool is their mascot?Reka Flash 3 - a 21B reasoner with apache 2 trained with RLOOAnd the open source train keeps rolling! Reka AI dropped Reka Flash 3, a 21 billion parameter reasoning model with an Apache 2.0 license! Nisten was blown away by the benchmarks: "This might be one of the best like 20B size models that there is right now. And it's Apache 2.0. Uh, I, I think this is a much bigger deal than most people realize." Reka Flash 3 is compact, efficient, and excels at chat, coding, instruction following, and function calling. They even used a new reinforcement learning technique called REINFORCE Leave One-Out (RLOO). Go give it a whirl on Hugging Face or their chat interface – chat.reka.ai!Last but definitely not least in the open-source realm, we had a special guest, Sandra (@itsSandraKublik) from Cohere, join us to announce Command-A! This beast of a model clocks in at 111 BILLION parameters with a massive 256K context window. Sandra emphasized its efficiency, "It requires only two GPUs. Typically the models of this size require 32 GPUs. So it's a huge, huge difference." Command-A is designed for enterprises, focusing on agentic tasks, tool use, and multilingual performance. It's optimized for private deployments and boasts enterprise-grade security. Congrats to Sandra and the Cohere team on this massive release!Big CO LLMs + APIs: Gemini Flash Gets Visual, Deep Research Goes Free, and OpenAI Builds for AgentsThe big companies weren't sleeping either! Google continued their awesome week by unleashing native image generation in Gemini Flash Experimental! This is seriously f*****g cool, folks! Sorry for my French, but it's true. You can now directly interact with images, tell Gemini what to do, and it just does it. We even showed it live on the stream, turning ourselves into cat-confetti-birthday-hat-wearing masterpieces! Wolfram was right, "It's also a sign what we will see in, like, Photoshop, for example. Where you, you expect to just talk to it and have it do everything that a graphic designer would be doing." The future of creative tools is HERE.And guess what else Google did? They made Deep Research FREE in the Gemini app and powered by Gemini Thinking! Nisten jumped in to test it live, and we were all impressed. "This is the nicest interface so far that I've seen," he said. Deep Research now digs through HUNDREDS of websites (Nisten's test hit 156!) to give you comprehensive answers, and the interface is slick and user-friendly. Plus, you can export to Google Docs! Intelligence too cheap to meter? Google is definitely pushing that boundary.Last second additions - Allen Institute for AI released OLMo 2 32B - their biggest open model yetJust as I'm writing this, friend of the pod, Nathan from Allen Institute for AI announced the release of a FULLY OPEN OLMo 2, which includes weights, code, dataset, everything and apparently it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral. Evals look legit, but nore than that, this is an Apache 2 model with everything in place to advance open AI and open science! Check out Nathans tweet for more info, and congrats to Allen team for this awesome release! OpenAI new responses API and Agent ASK with Web, File and CUA toolsOf course, OpenAI wasn't going to let Google have all the fun. They dropped a new SDK for agents called the Responses API. This is a whole new way to build with OpenAI, designed specifically for the agentic era we're entering. They also released three new tools: Web Search, Computer Use Tool, and File Search Tool. The Web Search tool is self-explanatory – finally, built-in web search from OpenAI!The Computer Use Tool, while currently limited in availability, opens up exciting possibilities for agent automation, letting agents interact with computer interfaces. And the File Search Tool gives you a built-in RAG system, simplifying knowledge retrieval from your own files. As always, OpenAI is adapting to the agentic world and giving developers more power.Finally in the big company space, Nous Research released PORTAL, their new Inference API service. Now you can access their awesome models, like Hermes 3 Llama 70B and DeepHermes 3 8B, directly via API. It's great to see more open-source labs offering API access, making these powerful models even more accessible.This Week's Buzz at Weights & Biases: Offsite Hackathon and MCP Mania!This week's "This Week's Buzz" segment comes to you live from Oceanside, California! The whole Weights & Biases team is here for our company offsite. Despite the not-so-sunny California weather (thanks, storm!), it's been an incredible week of meeting colleagues, strategizing, and HACKING!And speaking of hacking, we had an MCP hackathon! After last week's MCP-pilling episode, we were all hyped about Model Context Protocol, and the team didn't disappoint. In just three hours, the innovation was flowing! We saw agents built for WordPress, MCP support integrated into Weave playground, and even MCP servers for Weights & Biases itself! Get ready, folks, because an MCP server for Weights & Biases is COMING SOON! You'll be able to talk to your W&B data like never before. Huge shoutout to the W&B team for their incredible talent and for embracing the agentic future! And in case you missed it, Weights & Biases is now part of the CoreWeave family! Exciting times ahead!Vision & Video: LoRA Video Effects and OpenSora 2.0Moving into vision and video, Remade AI released 8 LoRA video effects for 1X! Remember 1X from Alibaba? Now you can add crazy effects like "squish," "inflate," "deflate," and even "cakeify" to your videos using LoRAs. It's open source and super cool to see video effects becoming trainable and customizable.And in the realm of open-source video generation, OpenSora 2.0 dropped! This 11 billion parameter model claims state-of-the-art video generation trained for just $200,000! They're even claiming performance close to Sora itself on some benchmarks. Nisten checked out the demos, and while we're all a bit jaded now with the rapid pace of video AI, it's still mind-blowing how far we've come. Open source video is getting seriously impressive, seriously fast.AI Art & Diffusion & 3D: ByteDance's Bilingual Seedream 2.0ByteDance, the folks behind TikTok, released Seedream 2.0, a native Chinese-English bilingual image generation foundation model. This model, from ByteDream, excels at text rendering, cultural nuance, and human preference alignment. Seedream 2.0 boasts "powerful general capability," "native bilingual comprehension ability," and "excellent text rendering." It's designed to understand both Chinese and English prompts natively, generating high-quality, culturally relevant images. The examples look stunning, especially its ability to render Chinese text beautifully.Tools: Manus AI Agent, Google AI Studio YouTube Links, and Cursor EmbeddingsFinally, in the tools section, everyone's buzzing about Manus, a new AI research agent. We gave it a try live on the show, asking it to do some research. The UI is slick, and it seems to be using Claude 3.7 behind the scenes. Manus creates a to-do list, browses the web in a real Chrome browser, and even generates files. It's like Operator on steroids. We'll be keeping an eye on Manus and will report back on its performance in future episodes.And Google AI Studio keeps getting better! Now you can drop YouTube links into Google AI Studio, and it will natively understand the video! This is HUGE for video analysis and content understanding. Imagine using this for support, content summarization, and so much more.PHEW! What a week to celebrate two years of ThursdAI! From open source explosions to Gemini's visual prowess and OpenAI's agentic advancements, the AI world is moving faster than ever. As Wolfram aptly put it, "The acceleration, you can feel it." And Nisten reminded us of the incredible journey, "I remember I had early access to GPT-4 32K, and, uh, then... the person for the contract that had given me access, they cut it off because on the one weekend, I didn't realize how expensive it was. So I had to use $180 worth of tokens just trying it out." Now, we have models that are more powerful and more accessible than ever before. Thank you to Wolfram, Nisten, and LDJ for co-hosting and bringing their insights every week. And most importantly, THANK YOU to our amazing community for tuning in, listening, and supporting ThursdAI for two incredible years! We couldn't do it without you. Here's to another year of staying up-to-date so YOU don't have to! Don't forget to subscribe to the podcast, YouTube channel, and newsletter to stay in the loop. And share ThursdAI with a friend – it's the best birthday gift you can give us! Until next week, keep building and keep exploring the amazing world of AI! LET'S GO! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
What is UP folks! Alex here from Weights & Biases (yeah, still, but check this weeks buzz section below for some news!) I really really enjoyed today's episode, I feel like I can post it unedited it was so so good. We started the show with our good friend Junyang Lin from Alibaba Qwen, where he told us about their new 32B reasoner QwQ. Then we interviewed Google's VP of the search product, Robby Stein, who came and told us about their upcoming AI mode in Google! I got access and played with it, and it made me switch back from PPXL as my main. And lastly, I recently became fully MCP-pilled, since we covered it when it came out over thanksgiving, I saw this acronym everywhere on my timeline but only recently "got it" and so I wanted to have an MCP deep dive, and boy... did I get what I wished for! You absolutely should tune in to the show as there's no way for me to cover everything we covered about MCP with Dina and Jason! ok without, further adieu.. let's dive in (and the TL;DR, links and show notes in the end as always!) ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Holy moly, AI enthusiasts! Alex Volkov here, reporting live from the AI Engineer Summit in the heart of (touristy) Times Square, New York! This week has been an absolute whirlwind of announcements, from XAI's Grok 3 dropping like a bomb, to Figure robots learning to hand each other things, and even a little eval smack-talk between OpenAI and XAI. It's enough to make your head spin – but that's what ThursdAI is here for. We sift through the chaos and bring you the need-to-know, so you can stay on the cutting edge without having to, well, spend your entire life glued to X and Reddit.This week we had a very special live show with the Haize Labs folks, the ones I previously interviewed about their bijection attacks, discussing their open source judge evaluation library called Verdict. So grab your favorite caffeinated beverage, maybe do some stretches because your mind will be blown, and let's dive into the TL;DR of ThursdAI, February 20th, 2025!Participants* Alex Volkov: AI Evangelist with Weights and Biases* Nisten: AI Engineer and cohost* Akshay: AI Community Member* Nuo: Dev Advocate at 01AI* Nimit: Member of Technical Staff at Haize Labs* Leonard: Co-founder at Haize LabsOpen Source LLMsPerplexity's R1 7076: Censorship-Free DeepSeekPerplexity made a bold move this week, releasing R1 7076, a fine-tuned version of DeepSeek R1 specifically designed to remove what they (and many others) perceive as Chinese government censorship. The name itself, 1776, is a nod to American independence – a pretty clear statement! The core idea? Give users access to information on topics the CCP typically restricts, like Tiananmen Square and Taiwanese independence.Perplexity used human experts to identify around 300 sensitive topics and built a "censorship classifier" to train the bias out of the model. The impressive part? They claim to have done this without significantly impacting the model's performance on standard evals. As Nuo from 01AI pointed out on the show, though, he'd "actually prefer that they can actually disclose more of their details in terms of post training... Running the R1 model by itself, it's already very difficult and very expensive." He raises a good point – more transparency is always welcome! Still, it's a fascinating attempt to tackle a tricky problem, the problem which I always say we simply cannot avoid. You can check it out yourself on Hugging Face and read their blog post.Arc Institute & NVIDIA Unveil Evo 2: Genomics PowerhouseGet ready for some serious science, folks! Arc Institute and NVIDIA dropped Evo 2, a massive genomics model (40 billion parameters!) trained on a mind-boggling 9.3 trillion nucleotides. And it's fully open – two papers, weights, data, training, and inference codebases. We love to see it!Evo 2 uses the StripedHyena architecture to process huge genetic sequences (up to 1 million nucleotides!), allowing for analysis of complex genomic patterns. The practical applications? Predicting the effects of genetic mutations (super important for healthcare) and even designing entire genomes. I've been super excited about genomics models, and seeing these alternative architectures like StripedHyena getting used here is just icing on the cake. Check it out on X.ZeroBench: The "Impossible" Benchmark for VLLMsNeed more benchmarks? Always! A new benchmark called ZeroBench arrived, claiming to be the "impossible benchmark" for Vision Language Models (VLLMs). And guess what? All current top-of-the-line VLLMs get a big fat zero on it.One example they gave was a bunch of scattered letters, asking the model to "answer the question that is written in the shape of the star among the mess of letters." Honestly, even I struggled to see the star they were talking about. It highlights just how much further VLLMs need to go in terms of true visual understanding. (X, Page, Paper, HF)Hugging Face's Ultra Scale Playbook: Scaling UpFor those of you building massive models, Hugging Face released the Ultra Scale Playbook, a guide to building and scaling AI models on huge GPU clusters.They ran 4,000 scaling experiments on up to 512 GPUs (nothing close to Grok's 100,000, but still impressive!). If you're working in a lab and dreaming big, this is definitely a resource to check out. (HF).Big CO LLMs + APIsGrok 3: XAI's Big Swing new SOTA LLM! (and Maybe a Bug?)Monday evening, BOOM! While some of us were enjoying President's Day, the XAI team dropped Grok 3. They announced it with a setting very similar to OpenAI announcements. They're claiming state-of-the-art performance on some benchmarks (more on that drama later!), and a whopping 1 million token context window, finally confirmed after some initial confusion. They talked a lot about agents and a future of reasoners as well.The launch was a bit… messy. First, there was a bug where some users were getting Grok 2 even when the dropdown said Grok 3. That led to a lot of mixed reviews. Even when I finally thought I was using Grok 3, it still flubbed my go-to logic test, the "Beth's Ice Cubes" question. (The answer is zero, folks – ice cubes melt!). But Akshay, who joined us on the show, chimed in with some love: "...with just the base model of Grok 3, it's, in my opinion, it's the best coding model out there." So, mixed vibes, to say the least! It's also FREE for now, "until their GPUs melt," according to XAI, which is great.UPDATE: The vibes are shifting, more and more of my colleagues and mutuals are LOVING grok3 for one shot coding, for talking to it. I'm getting convinced as well, though I did use and will continue to use Grok for real time data and access to X. DeepSearchIn an attempt to show off some Agentic features, XAI also launched a deep search (not research like OpenAI but effectively the same) Now, XAI of course has access to X, which makes their deep search have a leg up, specifically for real time information! I found out it can even “use” the X search! OpenAI's Open Source TeaseIn what felt like a very conveniently timed move, Sam Altman dropped a poll on X the same day as the Grok announcement: if OpenAI were to open-source something, should it be a small, mobile-optimized model, or a model on par with o3-mini? Most of us chose o3 mini, just to have access to that model and play with it. No indication of when this might happen, but it's a clear signal that OpenAI is feeling the pressure from the open-source community.The Eval Wars: OpenAI vs. XAIThings got spicy! There was a whole debate about the eval numbers XAI posted, specifically the "best of N" scores (like best of 64 runs). Boris from OpenAI, and Aiden mcLau called out some of the graphs. Folks on X were quick to point out that OpenAI also used "best of N" in the past, and the discussion devolved from there.XAI is claiming SOTA. OpenAI (or some folks from within OpenAI) aren't so sure. The core issue? We can't independently verify Grok's performance because there's no API yet! As I said, "…we're not actually able to use this model to independently evaluate this model and to tell you guys whether or not they actually told us the truth." Transparency matters, folks!DeepSearch - How Deep?Grok also touted a new "Deep Search" feature, kind of like Perplexity or OpenAI's "Deep Research" in their more expensive plan. My initial tests were… underwhelming. I nicknamed it "Shallow Search" because it spent all of 34 seconds on a complex query where OpenAI's Deep Research took 11 minutes and cited 17 sources. We're going to need to do some more digging (pun intended) on this one.This Week's BuzzWe're leaning hard into agents at Weights & Biases! We just released an agents whitepaper (check it out on our socials!), and we're launching an agents course in collaboration with OpenAI's Ilan Biggio. Sign up at wandb.me/agents! We're hearing so much about agent evaluation and observability, and we're working hard to provide the tools the community needs.Also, sadly, our Toronto workshops are completely sold out. But if you're at AI Engineer in New York, come say hi to our booth! And catch my talk on LLM Reasoner Judges tomorrow (Friday) at 11 am EST – it'll be live on the AI Engineer YouTube channel (HERE)!Vision & VideoMicrosoft MUSE: Playable Worlds from a Single ImageThis one is wild. Microsoft's MUSE can generate minutes of playable gameplay from just a single second of video frames and controller actions.It's based on the World and Human Action Model (WHAM) architecture, trained on a billion gameplay images from Xbox. So if you've been playing Xbox lately, you might be in the model! I found it particularly cool: "…you give it like a single second of a gameplay of any type of game with all the screen elements, with percentages, with health bars, with all of these things and their model generates a game that you can control." (X, HF, Blog).StepFun's Step-Video-T2V: State-of-the-Art (and Open Source!)We got two awesome open-source video breakthroughs this week. First, StepFun's Step-Video-T2V (and T2V Turbo), a 30 billion parameter text-to-video model. The results look really good, especially the text integration. Imagine a Chinese girl opening a scroll, and the words "We will open source" appearing as she unfurls it. That's the kind of detail we're talking about.And it's MIT licensed! As Nisten noted "This is pretty cool. It came out. Right before Sora came out, people would have lost their minds." (X, Paper, HF, Try It).HAO AI's FastVideo: Speeding Up HY-VideoThe second video highlight: HAO AI released FastVideo, a way to make HY-Video (already a strong open-source contender) three times faster with no additional training! They call the trick "Sliding Tile Attention" apparently that alone provides enormous boost compared to even flash attention.This is huge because faster inference means these models become more practical for real-world use. And, bonus: it supports HY-Video's Loras, meaning you can fine-tune it for, ahem, all kinds of creative applications. I will not go as far as to mention civit ai. (Github)Figure's Helix: Robot Collaboration!Breaking news from the AI Engineer conference floor: Figure, the humanoid robot company, announced Helix, a Vision-Language-Action (VLA) model built into their robots!It has full upper body control!What blew my mind: they showed two robots working together, handing objects to each other, based on natural language commands! As I watched, I exclaimed, "I haven't seen a humanoid robot, hand off stuff to the other one... I found it like super futuristically cool." The model runs on the robot, using a 7 billion parameter VLM for understanding and an 80 million parameter transformer for control. This is the future, folks!Tools & OthersMicrosoft's New Quantum Chip (and State of Matter!)Microsoft announced a new quantum chip and a new state of matter (called "topological superconductivity"). "I found it like absolutely mind blowing that they announced something like this," I gushed on the show. While I'm no quantum physicist, this sounds like a big deal for the future of computing.Verdict: Hayes Labs' Framework for LLM JudgesAnd of course, the highlight of our show: Verdict, a new open-source framework from Hayes Labs (the folks behind those "bijection" jailbreaks!) for composing LLM judges. This is a huge deal for anyone working on evaluation. Leonard and Nimit from Hayes Labs joined us to explain how Verdict addresses some of the core problems with LLM-as-a-judge: biases (like preferring their own responses!), sensitivity to prompts, and the challenge of "meta-evaluation" (how do you know your judge is actually good?).Verdict lets you combine different judging techniques ("primitives") to create more robust and efficient evaluators. Think of it as "judge-time compute scaling," as Leonard called it. They're achieving near state-of-the-art results on benchmarks like ExpertQA, and it's designed to be fast enough to use as a guardrail in real-time applications!One key insight: you don't always need a full-blown reasoning model for judging. As Nimit explained, Verdict can combine simpler LLM calls to achieve similar results at a fraction of the cost. And, it's open source! (Paper, Github,X).ConclusionAnother week, another explosion of AI breakthroughs! Here are my key takeaways:* Open Source is THRIVING: From censorship-free LLMs to cutting-edge video models, the open-source community is delivering incredible innovation.* The Need for Speed (and Efficiency): Whether it's faster video generation or more efficient LLM judging, performance is key.* Robots are Getting Smarter (and More Collaborative): Figure's Helix is a glimpse into a future where robots work together.* Evaluation is (Finally) Getting Attention: Tools like Verdict are essential for building reliable and trustworthy AI systems.* The Big Players are Feeling the Heat: OpenAI's open-source tease and XAI's rapid progress show that the competition is fierce.I'll be back in my usual setup next week, ready to break down all the latest AI news. Stay tuned to ThursdAI – and don't forget to give the pod five stars and subscribe to the newsletter for all the links and deeper dives. There's potentially an Anthropic announcement coming, so we'll see you all next week.TLDR* Open Source LLMs* Perplexity R1 1776 - finetune of china-less R1 (Blog, Model)* Arc institute + Nvidia - introduce EVO 2 - genomics model (X)* ZeroBench - impossible benchmark for VLMs (X, Page, Paper, HF)* HuggingFace ultra scale playbook (HF)* Big CO LLMs + APIs* Grok 3 SOTA LLM + reasoning and Deep Search (blog, try it)* OpenAI is about to open source something? Sam posts a polls* This weeks Buzz* We are about to launch an agents course! Pre-sign up wandb.me/agents* Workshops are SOLD OUT* Watch my talk LIVE from AI Engineer - 11am EST Friday (HERE)* Keep watching AI Eng conference after the show on AIE YT* )* Vision & Video* Microsoft MUSE - playable worlds from one image (X, HF, Blog)* Microsoft OmniParser - Better, faster screen parsing for GUI agents with OmniParser v2 (Gradio Demo)* HAO AI - fastVIDEO - making HY-Video 3x as fast (Github)* StepFun - Step-Video-T2V (+Turbo), a SotA 30B text-to-video model (Paper, Github, HF, Try It)* Figure announces HELIX - vision action model built into FIGURE Robot (Paper)* Tools & Others* Microsoft announces a new quantum chip and a new state of matter (Blog, X)* Verdict - Framework to compose SOTA LLM judges with JudgeTime Scaling (Paper, Github,X) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
What a week in AI, folks! Seriously, just when you think things might slow down, the AI world throws another curveball. This week, we had everything from rogue AI apps giving unsolicited life advice (and sending rogue texts!), to mind-blowing open source releases that are pushing the boundaries of what's possible, and of course, the ever-present drama of the big AI companies with OpenAI dropping a roadmap that has everyone scratching their heads.Buckle up, because on this week's ThursdAI, we dove deep into all of it. We chatted with the brains behind the latest open source embedding model, marveled at a tiny model crushing math benchmarks, and tried to decipher Sam Altman's cryptic GPT-5 roadmap. Plus, I shared a personal story about an AI app that decided to psychoanalyze my text messages – you won't believe what happened! Let's get into the TL;DR of ThursdAI, February 13th, 2025 – it's a wild one!* Alex Volkov: AI Adventurist with weights and biases* Wolfram Ravenwlf: AI Expert & Enthusiast* Nisten: AI Community Member* Zach Nussbaum: Machine Learning Engineer at Nomic AI* Vu Chan: AI Enthusiast & Evaluator* LDJ: AI Community MemberPersonal story of Rogue AI with RPLYThis week kicked off with a hilarious (and slightly unsettling) story of my own AI going rogue, all thanks to a new Mac app called RPLY designed to help with message replies. I installed it thinking it would be a cool productivity tool, but it turned into a personal intervention session, and then… well, let's just say things escalated.The app started by analyzing my text messages and, to my surprise, delivered a brutal psychoanalysis of my co-parenting communication, pointing out how both my ex and I were being "unpleasant" and needed to focus on the kids. As I said on the show, "I got this as a gut punch. I was like, f*ck, I need to reimagine my messaging choices." But the real kicker came when the AI decided to take initiative and started sending messages without my permission (apparently this was a bug with RPLY that was fixed since I reported)! Friends were texting me question marks, and my ex even replied to a random "Hey, How's your day going?" message with a smiley, completely out of our usual post-divorce communication style. "This AI, like on Monday before just gave me absolute s**t about not being, a person that needs to be focused on the kids also decided to smooth things out on friday" I chuckled, still slightly bewildered by the whole ordeal. It could have gone way worse, but thankfully, this rogue AI counselor just ended up being more funny than disastrous.Open Source LLMsDeepHermes preview from NousResearchJust in time for me sending this newsletter (but unfortunately not quite in time for the recording of the show), our friends at Nous shipped an experimental new thinking model, their first reasoner, called DeepHermes. NousResearch claims DeepHermes is among the first models to fuse reasoning and standard LLM token generation within a single architecture (a trend you'll see echoed in the OpenAI and Claude announcements below!)Definitely experimental cutting edge stuff here, but exciting to see not just an RL replication but also innovative attempts from one of the best finetuning collectives around. Nomic Embed Text V2 - First Embedding MoENomic AI continues to impress with the release of Nomic Embed Text V2, the first general-purpose Mixture-of-Experts (MoE) embedding model. Zach Nussbaum from Nomic AI joined us to explain why this release is a big deal.* First general-purpose Mixture-of-Experts (MoE) embedding model: This innovative architecture allows for better performance and efficiency.* SOTA performance on multilingual benchmarks: Nomic Embed V2 achieves state-of-the-art results on the multilingual MIRACL benchmark for its size.* Support for 100+ languages: Truly multilingual embeddings for global applications.* Truly open source: Nomic is committed to open source, releasing training data, weights, and code under the Apache 2.0 License.Zach highlighted the benefits of MoE for embeddings, explaining, "So we're trading a little bit of, inference time memory, and training compute to train a model with mixture of experts, but we get this, really nice added bonus of, 25 percent storage." This is especially crucial when dealing with massive datasets. You can check out the model on Hugging Face and read the Technical Report for all the juicy details.AllenAI OLMOE on iOS and New Tulu 3.1 8BAllenAI continues to champion open source with the release of OLMOE, a fully open-source iOS app, and the new Tulu 3.1 8B model.* OLMOE iOS App: This app brings state-of-the-art open-source language models to your iPhone, privately and securely.* Allows users to test open-source LLMs on-device.* Designed for researchers studying on-device AI and developers prototyping new AI experiences.* Optimized for on-device performance while maintaining high accuracy.* Fully open-source code for further development.* Available on the App Store for iPhone 15 Pro or newer and M-series iPads.* Tulu 3.1 8B As Nisten pointed out, "If you're doing edge AI, the way that this model is built is pretty ideal for that." This move by AllenAI underscores the growing importance of on-device AI and open access. Read more about OLMOE on the AllenAI Blog.Groq Adds Qwen Models and Lands on OpenRouterGroq, known for its blazing-fast inference speeds, has added Qwen models, including the distilled R1-distill, to its service and joined OpenRouter.* Record-fast inference: Experience a mind-blowing 1000 TPS with distilled DeepSeek R1 70B on Open Router.* Usable Rate Limits: Groq is now accessible for production use cases with higher rate limits and pay-as-you-go options.* Qwen Model Support: Access Qwen models like 2.5B-32B and R1-distill-qwen-32B.* Open Router Integration: Groq is now available on OpenRouter, expanding accessibility for developers.As Nisten noted, "At the end of the day, they are shipping very fast inference and you can buy it and it looks like they are scaling it. So they are providing the market with what it needs in this case." This integration makes Groq's speed even more accessible to developers. Check out Groq's announcement on X.com.SambaNova adds full DeepSeek R1 671B - flies at 200t/s (blog)In a complete trend of this week, SambaNova just announced they have availability of DeepSeek R1, sped up by their custom chips, flying at 150-200t/s. This is the full DeepSeek R1, not the distilled Qwen based versions! This is really impressive work, and compared to the second fastest US based DeepSeek R1 (on Together AI) it absolutely fliesAgentica DeepScaler 1.5B Beats o1-preview on MathAgentica's DeepScaler 1.5B model is making waves by outperforming OpenAI's o1-preview on math benchmarks, using Reinforcement Learning (RL) for just $4500 of compute.* Impressive Math Performance: DeepScaleR achieves a 37.1% Pass@1 on AIME 2025, outperforming the base model and even o1-preview!!* Efficient Training: Trained using RL for just $4500, demonstrating cost-effective scaling of intelligence.* Open Sourced Resources: Agentica open-sourced their dataset, code, and training logs, fostering community progress in RL-based reasoning.Vu Chan, an AI enthusiast who evaluated the model, joined us to share his excitement: "It achieves, 42% pass at one on a AIME 24. which basically means if you give the model only one chance at every problem, it will solve 42% of them." He also highlighted the model's efficiency, generating correct answers with fewer tokens. You can find the model on Hugging Face, check out the WandB logs, and see the announcement on X.com.ModernBert Instruct - Encoder Model for General TasksModernBert, known for its efficient encoder-only architecture, now has an instruct version, ModernBert Instruct, capable of handling general tasks.* Instruct-tuned Encoder: ModernBERT-Large-Instruct can perform classification and multiple-choice tasks using its Masked Language Modeling (MLM) head.* Beats Qwen .5B: Outperforms Qwen .5B on MMLU and MMLU Pro benchmarks.* Efficient and Versatile: Demonstrates the potential of encoder models for general tasks without task-specific heads.This release shows that even encoder-only models can be adapted for broader applications, challenging the dominance of decoder-based LLMs for certain tasks. Check out the announcement on X.com.Big CO LLMs + APIsRIP GPT-5 and o3 - OpenAI Announces Public RoadmapOpenAI shook things up this week with a roadmap update from Sam Altman, announcing a shift in strategy for GPT-5 and the o-series models. Get ready for GPT-4.5 (Orion) and a unified GPT-5 system!* GPT-4.5 (Orion) is Coming: This will be the last non-chain-of-thought model from OpenAI.* GPT-5: A Unified System: GPT-5 will integrate technologies from both the GPT and o-series models into a single, seamless system.* No Standalone o3: o3 will not be released as a standalone model; its technology will be integrated into GPT-5. "We will no longer ship O3 as a standalone model," Sam Altman stated.* Simplified User Experience: The model picker will be eliminated in ChatGPT and the API, aiming for a more intuitive experience.* Subscription Tier Changes:* Free users will get unlimited access to GPT-5 at a standard intelligence level.* Plus and Pro subscribers will gain access to increasingly advanced intelligence settings of GPT-5.* Expanded Capabilities: GPT-5 will incorporate voice, canvas, search, deep research, and more.This roadmap signals a move towards more integrated and user-friendly AI experiences. As Wolfram noted, "Having a unified access and the AI should be smart enough... AI has, we need an AI to pick which AI to use." This seems to be OpenAI's direction. Read Sam Altman's full announcement on X.com.OpenAI Releases ModelSpec v2OpenAI also released ModelSpec v2, an update to their document defining desired AI model behaviors, emphasizing customizability, transparency, and intellectual freedom.* Chain of Command: Defines a hierarchy to balance user/developer control with platform-level rules.* Truth-Seeking and User Empowerment: Encourages models to "seek the truth together" with users and empower decision-making.* Core Principles: Sets standards for competence, accuracy, avoiding harm, and embracing intellectual freedom.* Open Source: OpenAI open-sourced the Spec and evaluation prompts for broader use and collaboration on GitHub.This release reflects OpenAI's ongoing efforts to align AI behavior and promote responsible development. Wolfram praised ModelSpec, saying, "I was all over the original models back when it was announced in the first place... That is one very important aspect when you have the AI agent going out on the web and get information from not trusted sources." Explore ModelSpec v2 on the dedicated website.VP Vance Speech at AI Summit in Paris - Deregulate and Dominate!Vice President Vance delivered a powerful speech at the AI Summit in Paris, advocating for pro-growth AI policies and deregulation to maintain American leadership in AI.* Pro-Growth and Deregulation: VP Vance urged for policies that encourage AI innovation and cautioned against excessive regulation, specifically mentioning GDPR.* American AI Leadership: Emphasized ensuring American AI technology remains the global standard and blocks hostile foreign adversaries from weaponizing AI. "Hostile foreign adversaries have weaponized AI software to rewrite history, surveil users, and censor speech… I want to be clear – this Administration will block such efforts, full stop," VP Vance declared.* Key Points:* Ensure American AI leadership.* Encourage pro-growth AI policies.* Maintain AI's freedom from ideological bias.* Prioritize a pro-worker approach to AI development.* Safeguard American AI and chip technologies.* Block hostile foreign adversaries' weaponization of AI.Nisten commented, "He really gets something that most EU politicians do not understand is that whenever they have such a good thing, they're like, okay, this must be bad. And we must completely stop it." This speech highlights the ongoing debate about AI regulation and its impact on innovation. Read the full speech here.Cerebras Powers Perplexity with Blazing Speed (1200 t/s!)Perplexity is now powered by Cerebras, achieving inference speeds exceeding 1200 tokens per second.* Unprecedented Speed: Perplexity's Sonar model now flies at over 1200 tokens per second thanks to Cerebras' massive LPU chips. "Like perplexity sonar, their specific LLM for search is now powered by Cerebras and it's like 12. 100 tokens per second. It's it matches Google now on speed," I noted on the show.* Google-Level Speed: Perplexity now matches Google in inference speed, making it incredibly fast and responsive.This partnership significantly enhances Perplexity's performance, making it an even more compelling search and AI tool. See Perplexity's announcement on X.com.Anthropic Claude Incoming - Combined LLM + Reasoning ModelRumors are swirling that Anthropic is set to release a new Claude model that will be a combined LLM and reasoning model, similar to OpenAI's GPT-5 roadmap.* Unified Architecture: Claude's next model is expected to integrate both LLM and reasoning capabilities into a single, hybrid architecture.* Reasoning Powerhouse: Rumors suggest Anthropic has had a reasoning model stronger than Claude 3 for some time, hinting at a significant performance leap.This move suggests a broader industry trend towards unified AI models that seamlessly blend different capabilities. Stay tuned for official announcements from Anthropic.Elon Musk Teases Grok 3 "Weeks Out"Elon Musk continues to tease the release of Grok 3, claiming it will be "a few weeks out" and the "most powerful AI" they have tested, with enhanced reasoning capabilities.* Grok 3 Hype: Elon Musk claims Grok 3 will be the most powerful AI X.ai has released, with a focus on reasoning.* Reasoning Focus: Grok 3's development may have shifted towards reasoning capabilities, potentially causing a slight delay in release.While details remain scarce, the anticipation for Grok 3 is building, especially in light of the advancements in open source reasoning models.This Week's Buzz
What's up friends, Alex here, back with another ThursdAI hot off the presses.Hold onto your hats because this week was another whirlwind of AI breakthroughs, mind-blowing demos, and straight-up game-changers. We dove deep into OpenAI's new "Deep Research" agent – and let me tell you, it's not just hype, it's legitimately revolutionary. You also don't have to take my word for it, a new friend of the pod and a scientist DR Derya Unutmaz joined us to discuss his experience with Deep Research as a scientist himself! You don't want to miss this conversation! We also unpack Google's Gemini 2.0 release, including the blazing-fast Flash Lite model. And just when you thought your brain couldn't handle more, ByteDance drops OmniHuman-1, a human animation model that's so realistic, it's scary good.I've also saw maybe 10 moreTLDR & Show Notes* Open Source LLMs (and deep research implementations)* Jina Node-DeepResearch (X, Github)* HuggingFace - OpenDeepResearch (X)* Deep Agent - R1 -V (X, Github)* Krutim - Krutim 2 12B, Chitrath VLM, Embeddings and more from India (X, Blog, HF)* Simple Scaling - S1 - R1 (Paper)* Mergekit updated - * Big CO LLMs + APIs* OpenAI ships o3-mini and o3-mini High + updates thinking traces (Blog, X)* Mistral relaunches LeChat with Cerebras for 1000t/s (Blog)* OpenAI Deep Research - the researching agent that uses o3 (X, Blog)* Google ships Gemini 2.0 Pro, Gemini 2.0 Flash-lite in AI Studio (Blog)* Anthropic Constitutional Classifiers - announced a universal jailbreak prevention (Blog, Try It)* Cloudflare to protect websites from AI scraping (News)* HuggingFace becomes the AI Appstore (link)* This weeks Buzz - Weights & Biases updates* AI Engineer workshop (Saturday 22) * Tinkerers Toronto workshops (Sunday 23 , Monday 24)* We released a new Dataset editor feature (X)* Audio and Sound* KyutAI open sources Hibiki - simultaneous translation models (Samples, HF)* AI Art & Diffusion & 3D* ByteDance OmniHuman-1 - unparalleled Human Animation Models (X, Page)* Pika labs adds PikaAdditions - adding anything to existing video (X)* Google added Imagen3 to their API (Blog)* Tools & Others* Mistral Le Chat has ios an and adroid apps now (X)* CoPilot now has agentic workflows (X)* Replit launches free apps agent for everyone (X)* Karpathy drops a new 3 hour video on youtube (X, Youtube)* OpenAI canvas links are now shareable (like Anthropic artifacts) - (example)* Show Notes & Links * Guest of the week - Dr Derya Umnutaz - talking about Deep Research* He's examples of Ehlers-Danlos Syndrome (ChatGPT), (ME/CFS) Deep Research, Nature article about Deep Reseach with Derya comments* Hosts* Alex Volkov - AI Evangelist & Host @altryne* Wolfram Ravenwolf - AI Evangelist @WolframRvnwlf* Nisten Tahiraj - AI Dev at github.GG - @nisten* LDJ - Resident data scientist - @ldjconfirmedBig Companies products & APIsOpenAI's new chatGPT moment with Deep Research, their second "agent" product (X)Look, I've been reporting on AI weekly for almost 2 years now, and been following the space closely since way before chatGPT (shoutout Codex days) and this definitely feels like another chatGPT moment for me.DeepResearch is OpenAI's new agent, that searches the web for any task you give it, is able to reason about the results, and continue searching those sources, to provide you with an absolute incredible level of research into any topic, scientific or ... the best taqueria in another country. The reason why it's so good is it's ability to do multiple search trajectories, backtrack if it needs to, and react in real time to new information. It also has python tool use (to do plots and calculations) and of course, the brain of it is o3, the best reasoning model from OpenAIDeep Research is only offered on the Pro tier ($200) of chatGPT, and it's the first publicly available way to use o3 full! and boy, does it deliver! I've had it review my workshop content, help me research LLM as a judge articles (which it did masterfully) and help me plan datenights in Denver (though it kind of failed at that, showing me a closed restaurant) A breakthrough for scientific researchBut I'm no scientist, so I've asked Dr Derya Unutmaz, M.D. to join us, and share his incredible findings as a doctor, a scientist and someone with decades of experience in writing grants, patent applications, paper etc. The whole conversation is very very much worth listening to on the pod, we talked for almost an hour, but the highlights are honestly quite crazy. So one of the first things I did was, I asked Deep Research to write a review on a particular disease that I've been studying for a decade. It came out with this impeccable 10-to-15-page review that was the best I've read on the topic— Dr. Derya UnutmazAnd another banger quoteIt wrote a phenomenal 25-page patent application for a friend's cancer discovery—something that would've cost 10,000 dollars or more and taken weeks. I couldn't believe it. Every one of the 23 claims it listed was thoroughly justifiedHumanity's LAST exam? OpenAI announced Deep Research and have showed that on HLE (Humanity's Last Exam) benchmark that was just released a few weeks ago, it scores a whopping 26.6 percent! When HLE was released (our coverage here) all the way back at ... checks notes... January 23 or this year! the top reasoning models at the time (o1, R1) scored just under 10%O3-mini and Deep Research now score 13% and 26.6% respectively, which means both that AI is advancing like crazy, but also.. that maybe calling this "last exam" was a bit premature?
What a week, folks, what a week! Buckle up, because ThursdAI just dropped, and this one's a doozy. We're talking seismic shifts in the open source world, a potential game-changer from DeepSeek AI that's got everyone buzzing, and oh yeah, just a casual $500 BILLION infrastructure project announcement. Plus, OpenAI finally pulled the trigger on "Operator," their agentic browser thingy – though getting it to actually operate proved to be a bit of a live show adventure, as you'll hear. This week felt like one of those pivotal moments in AI, a real before-and-after kind of thing. DeepSeek's R1 hit the open source scene like a supernova, and suddenly, top-tier reasoning power is within reach for anyone with a Mac and a dream. And then there's OpenAI's Operator, promising to finally bridge the gap between chat and action. Did it live up to the hype? Well, let's just say things got interesting.As I'm writing this, White House just published that an Executive Order on AI was just signed and published as well, what a WEEK.Open Source AI Goes Nuclear: DeepSeek R1 is HERE!Hold onto your hats, open source AI just went supernova! This week, the Chinese Whale Bros – DeepSeek AI, that quant trading firm turned AI powerhouse – dropped a bomb on the community in the best way possible: R1, their reasoning model, is now open source under the MIT license! As I said on the show, "Open source AI has never been as hot as this week."This isn't just a model, folks. DeepSeek unleashed a whole arsenal: two full-fat R1 models (DeepSeek R1 and DeepSeek R1-Zero), and a whopping six distilled finetunes based on Qwen (1.5B, 7B, 14B, and 32B) and Llama (8B, 72B). One stat that blew my mind, and Nisten's for that matter, is that DeepSeek-R1-Distill-Qwen-1.5B, the tiny 1.5 billion parameter model, is outperforming GPT-4o and Claude-3.5-Sonnet on math benchmarks! "This 1.5 billion parameter model that now does this. It's absolutely insane," I exclaimed on the show. We're talking 28.9% on AIME and 83.9% on MATH. Let that sink in. A model you can probably run on your phone is schooling the big boys in math.License-wise, it's MIT, which as Nisten put it, "MIT is like a jailbreak to the whole legal system, pretty much. That's what most people don't realize. It's like, this is, it's not my problem. You're a problem now." Basically, do whatever you want with it. Distill it, fine-tune it, build Skynet – it's all fair game.And the vibes? "Vibes are insane," as I mentioned on the show. Early benchmarks are showing R1 models trading blows with o1-preview and o1-mini, and even nipping at the heels of the full-fat o1 in some areas. Check out these numbers:And the price? Forget about it. We're talking 50x cheaper than o1 currently. DeepSeek R1 API is priced at $0.14 / 1M input tokens and $2.19 / 1M output tokens, compared to OpenAI's o1 at $15.00 / 1M input and a whopping $60.00 / 1M output. Suddenly, high-quality reasoning is democratized.LDJ highlighted the "aha moment" in DeepSeek's paper, where they talk about how reinforcement learning enabled the model to re-evaluate its approach and "think more." It seems like simple RL scaling, combined with a focus on reasoning, is the secret sauce. No fancy Monte Carlo Tree Search needed, apparently!But the real magic of open source is what the community does with it. Pietro Schirano joined us to talk about his "Retrieval Augmented Thinking" (RAT) approach, where he extracts the thinking process from R1 and transplants it to other models. "And what I found out is actually by doing so, you may even like smaller, quote unquote, you know, less intelligent model actually become smarter," Pietro explained. Frankenstein models, anyone? (John Lindquist has a tutorial on how to do it here)And then there's the genius hack from Voooogel, who figured out how to emulate a "reasoning_effort" knob by simply replacing the "end" token with "Wait, but". "This tricks the model into keeps thinking," as I described it. Want your AI to really ponder the meaning of life (or just 1+1)? Now you can, thanks to open source tinkering.Georgi Gerganov, the legend behind llama.cpp, even jumped in with a two-line snippet to enable speculative decoding, boosting inference speeds on the 32B model on my Macbook from a sluggish 5 tokens per second to a much more respectable 10-11 tokens per second. Open source collaboration at its finest and it's only going to get better! Thinking like a NeuroticMany people really loved the way R1 thinks, and what I found astonishing is that I just sent "hey" and the thinking went into a whole 5 paragraph debate of how to answer, a user on X answered with "this is Woody Allen-level of Neurotic" which... nerd sniped me so hard! I used Hauio Audio (which is great!) and ByteDance latentSync and gave R1 a voice! It's really something when you hear it's inner monologue being spoken out like this! ByteDance Enters the Ring: UI-TARS Controls Your PCNot to be outdone in the open source frenzy, ByteDance, the TikTok behemoth, dropped UI-TARS, a set of models designed to control your PC. And they claim SOTA performance, beating even Anthropic's computer use models and, in some benchmarks, GPT-4o and Claude.UI-TARS comes in 2B, 7B, and 72B parameter flavors, and ByteDance even released desktop apps for Mac and PC to go along with them. "They released an app it's called the UI TARS desktop app. And then, this app basically allows you to Execute the mouse clicks and keyboard clicks," I explained during the show.While I personally couldn't get the desktop app to work flawlessly (quantization issues, apparently), the potential is undeniable. Imagine open source agents controlling your computer – the possibilities are both exciting and slightly terrifying. As Nisten wisely pointed out, "I would use another machine. These things are not safe to tell people. I might actually just delete your data if you, by accident." Words to live by, folks.LDJ chimed in, noting that UI-TARS seems to excel particularly in operating system-level control tasks, while OpenAI's leaked "Operator" benchmarks might show an edge in browser control. It's a battle for desktop dominance brewing in open source!Noting that the common benchmark between Operator and UI-TARS is OSWorld, UI-Tars launched with a SOTA Humanity's Last Exam: The Benchmark to BeatSpeaking of benchmarks, a new challenger has entered the arena: Humanity's Last Exam (HLE). A cool new unsaturated bench of 3,000 challenging questions across over a hundred subjects, crafted by nearly a thousand subject matter experts from around the globe. "There's no way I'm answering any of those myself. I need an AI to help me," I confessed on the show.And guess who's already topping the HLE leaderboard? You guessed it: DeepSeek R1, with a score of 9.4%! "Imagine how hard this benchmark is if the top reasoning models that we have right now... are getting less than 10 percent completeness on this," MMLU and Math are getting saturated? HLE is here to provide a serious challenge. Get ready to hear a lot more about HLE, folks.Big CO LLMs + APIs: Google's Gemini Gets a Million-Token BrainWhile open source was stealing the show, the big companies weren't completely silent. Google quietly dropped an update to Gemini Flash Thinking, their experimental reasoning model, and it's a big one. We're talking 1 million token context window and code execution capabilities now baked in!"This is Google's scariest model by far ever built ever," Nisten declared. "This thing, I don't like how good it is. This smells AGI-ish" High praise, and high concern, coming from Nisten! Benchmarks are showing significant performance jumps in math and science evals, and the speed is, as Nisten put it, "crazy usable." They have enabled the whopping 1M context window for the new Gemini Flash 2.0 Thinking Experimental (long ass name, maybe let's call it G1?) and I agree, it's really really good!And unlike some other reasoning models cough OpenAI cough, Gemini Flash Thinking shows you its thinking process! You can actually see the chain of thought unfold, which is incredibly valuable for understanding and debugging. Google's Gemini is quietly becoming a serious contender in the reasoning race (especially with Noam Shazeer being responsible for it!)OpenAI's "Operator" - Agents Are (Almost) HereThe moment we were all waiting for (or at least, I was): OpenAI finally unveiled Operator, their first foray into Level 3 Autonomy - agentic capabilities with ChatGPT. Sam Altman himself hyped it up as "AI agents are AI systems that can do work for you. You give them a task and they go off and do it." Sounds amazing, right?Operator is built on a new model called CUA (Computer Using Agent), trained on top of GPT-4, and it's designed to control a web browser in the cloud, just like a human would, using screen pixels, mouse, and keyboard. "This is just using screenshots, no API, nothing, just working," one of the OpenAI presenters emphasized. They demoed Operator booking restaurant reservations on OpenTable, ordering groceries on Instacart, and even trying to buy Warriors tickets on StubHub (though that demo got a little… glitchy). The idea is that you can delegate tasks to Operator, and it'll go off and handle them in the background, notifying you when it needs input or when the task is complete.As I'm writing these words, I have an Operator running trying to get me some fried rice, and another one trying to book me a vacation with kids over the summer, find some options and tell me what it found. Benchmarks-wise, OpenAI shared numbers for OSWorld (38.1%) and WebArena (58.1%), showing Operator outperforming previous SOTA but still lagging behind human performance. "Still a way to go," as they admitted. But the potential is massive.The catch? Operator is initially launching in the US for Pro users only, and even then, it wasn't exactly smooth sailing. I immediately paid the $200/mo to try it out (pro mode didn't convince me, unlimited SORA videos didn't either, operator definitely did, SOTA agents from OpenAI is definitely something I must try!) and my first test? Writing a tweet
Hey folks, Alex here, writing this from the beautiful Vancouver BC, Canada. I'm here for NeurIPS 2024, the biggest ML conferences of the year, and let me tell you, this was one hell of a week to not be glued to the screen. After last week banger week, with OpenAI kicking off their 12 days of releases, with releasing o1 full and pro mode during ThursdAI, things went parabolic. It seems that all the AI labs decided to just dump EVERYTHING they have before the holidays?
Well well well, December is finally here, we're about to close out this year (and have just flew by the second anniversary of chatGPT
Hey ya'll, Happy Thanskgiving to everyone who celebrates and thank you for being a subscriber, I truly appreciate each and every one of you! We had a blast on today's celebratory stream, especially given that today's "main course" was the amazing open sourcing of a reasoning model from Qwen, and we had Junyang Lin with us again to talk about it! First open source reasoning model that you can run on your machine, that beats a 405B model, comes close to o1 on some metrics
This week is a very exciting one in the world of AI news, as we get 3 SOTA models, one in overall LLM rankings, on in OSS coding and one in OSS voice + a bunch of new breaking news during the show (which we reacted to live on the pod, and as we're now doing video, you can see us freak out in real time at 59:32)00:00 Welcome to ThursdAI00:25 Meet the Hosts02:38 Show Format and Community03:18 TLDR Overview04:01 Open Source Highlights13:31 Qwen Coder 2.5 Release14:00 Speculative Decoding and Model Performance22:18 Interactive Demos and Artifacts28:20 Training Insights and Future Prospects33:54 Breaking News: Nexus Flow36:23 Exploring Athene v2 Agent Capabilities36:48 Understanding ArenaHard and Benchmarking40:55 Scaling and Limitations in AI Models43:04 Nexus Flow and Scaling Debate49:00 Open Source LLMs and New Releases52:29 FrontierMath Benchmark and Quantization Challenges58:50 Gemini Experimental 1114 Release and Performance01:11:28 LLM Observability with Weave01:14:55 Introduction to Tracing and Evaluations01:15:50 Weave API Toolkit Overview01:16:08 Buzz Corner: Weights & Biases01:16:18 Nous Forge Reasoning API01:26:39 Breaking News: OpenAI's New MacOS Features01:27:41 Live Demo: ChatGPT Integration with VS Code01:34:28 Ultravox: Real-Time AI Conversations01:42:03 Tilde Research and Stargazer Tool01:46:12 Conclusion and Final ThoughtsThis week also, there was a debate online, whether deep learning (and scale is all you need) has hit a wall, with folks like Ilya Sutskever being cited by publications claiming it has, folks like Yann LeCoon calling "I told you so". TL;DR? multiple huge breakthroughs later, and both Oriol from DeepMind and Sam Altman are saying "what wall?" and Heiner from X.ai saying "skill issue", there is no walls in sight, despite some tech journalism love to pretend there is. Also, what happened to Yann?
Hey everyone, Happy Halloween! Alex here, coming to you live from my mad scientist lair! For the first ever, live video stream of ThursdAI, I dressed up as a mad scientist and had my co-host, Fester the AI powered Skeleton join me (as well as my usual cohosts haha) in a very energetic and hopefully entertaining video stream! Since it's Halloween today, Fester (and I) have a very busy schedule, so no super length ThursdAI news-letter today, as we're still not in the realm of Gemini being able to write a decent draft that takes everything we talked about and cover all the breaking news, I'm afraid I will have to wish you a Happy Halloween and ask that you watch/listen to the episode. The TL;DR and show links from today, don't cover all the breaking news but the major things we saw today (and caught live on the show as Breaking News) were, ChatGPT now has search, Gemini has grounded search as well (seems like the release something before Google announces it streak from OpenAI continues). Here's a quick trailer of the major things that happened: This weeks buzz - Halloween AI toy with WeaveIn this weeks buzz, my long awaited Halloween project is finally live and operational! I've posted a public Weave dashboard here and the code (that you can run on your mac!) hereReally looking forward to see all the amazing costumers the kiddos come up with and how Gemini will be able to respond to them, follow along! ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Ok and finally my raw TL;DR notes and links for this week. Happy halloween everyone, I'm running off to spook the kiddos (and of course record and post about it!)ThursdAI - Oct 31 - TL;DRTL;DR of all topics covered:* Open Source LLMs:* Microsoft's OmniParser: SOTA UI parsing (MIT Licensed)
Hey all, Alex here, coming to you from the (surprisingly) sunny Seattle, with just a mind-boggling week of releases. Really, just on Tuesday there was so much news already! I had to post a recap thread, something I do usually after I finish ThursdAI! From Anthropic reclaiming close-second sometimes-first AI lab position + giving Claude the wheel in the form of computer use powers, to more than 3 AI video generation updates with open source ones, to Apple updating Apple Intelligence beta, it's honestly been very hard to keep up, and again, this is literally part of my job! But once again I'm glad that we were able to cover this in ~2hrs, including multiple interviews with returning co-hosts ( Simon Willison came back, Killian came back) so definitely if you're only a reader at this point, listen to the show! Ok as always (recently) the TL;DR and show notes at the bottom (I'm trying to get you to scroll through ha, is it working?) so grab a bucket of popcorn, let's dive in
Hey folks, Alex here from Weights & Biases, and this week has been absolutely bonkers. From robots walking among us to rockets landing on chopsticks (well, almost), the future is feeling palpably closer. And if real-world robots and reusable spaceship boosters weren't enough, the open-source AI community has been cooking, dropping new models and techniques faster than a Starship launch. So buckle up, grab your space helmet and noise-canceling headphones (we'll get to why those are important!), and let's blast off into this week's AI adventures!TL;DR and show-notes + links at the end of the post
Hey Folks, we are finally due for a "relaxing" week in AI, no more HUGE company announcements (if you don't consider Meta Movie Gen huge), no conferences or dev days, and some time for Open Source projects to shine. (while we all wait for Opus 3.5 to shake things up) This week was very multimodal on the show, we covered 2 new video models, one that's tiny and is open source, and one massive from Meta that is aiming for SORA's crown, and 2 new VLMs, one from our friends at REKA that understands videos and audio, while the other from Rhymes is apache 2 licensed and we had a chat with Kwindla Kramer about OpenAI RealTime API and it's shortcomings and voice AI's in general. ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.All right, let's TL;DR and show notes, and we'll start with the 2 Nobel prizes in AI
Hey, it's Alex. Ok, so mind is officially blown. I was sure this week was going to be wild, but I didn't expect everyone else besides OpenAI to pile on, exactly on ThursdAI. Coming back from Dev Day (number 2) and am still processing, and wanted to actually do a recap by humans, not just the NotebookLM one I posted during the keynote itself (which was awesome and scary in a "will AI replace me as a podcaster" kind of way), and was incredible to have Simon Willison who was sitting just behind me most of Dev Day, join me for the recap! But then the news kept coming, OpenAI released Canvas, which is a whole new way of interacting with chatGPT, BFL released a new Flux version that's 8x faster, Rev released a Whisper killer ASR that does diarizaiton and Google released Gemini 1.5 Flash 8B, and said that with prompt caching (which OpenAI now also has, yay) this will cost a whopping 0.01 / Mtok. That's 1 cent per million tokens, for a multimodal model with 1 million context window.
Hey, Alex here. Super quick, as I'm still attending Dev Day, but I didn't want to leave you hanging (if you're a paid subscriber!), I have decided to outsource my job and give the amazing podcasters of NoteBookLM the whole transcript of the opening keynote of OpenAI Dev Day.You can see a blog of everything they just posted hereHere's a summary of all what was announced:* Developer-Centric Approach: OpenAI consistently emphasized the importance of developers in their mission to build beneficial AGI. The speaker stated, "OpenAI's mission is to build AGI that benefits all of humanity, and developers are critical to that mission... we cannot do this without you."* Reasoning as a New Frontier: The introduction of the GPT-4 series, specifically the "O1" models, marks a significant step towards AI with advanced reasoning capabilities, going beyond the limitations of previous models like GPT-3.* Multimodal Capabilities: OpenAI is expanding the potential of AI applications by introducing multimodal capabilities, particularly focusing on real-time speech-to-speech interaction through the new Realtime API.* Customization and Fine-Tuning: Empowering developers to customize models is a key theme. OpenAI introduced Vision for fine-tuning with images and announced easier access to fine-tuning with model distillation tools.* Accessibility and Scalability: OpenAI demonstrated a commitment to making AI more accessible and cost-effective for developers through initiatives like price reductions, prompt caching, and model distillation tools.Important Ideas and Facts:1. The O1 Models:* Represent a shift towards AI models with enhanced reasoning capabilities, surpassing previous generations in problem-solving and logical thought processes.* O1 Preview is positioned as the most powerful reasoning model, designed for complex problems requiring extended thought processes.* O1 Mini offers a faster, cheaper, and smaller alternative, particularly suited for tasks like code debugging and agent-based applications.* Both models demonstrate advanced capabilities in coding, math, and scientific reasoning.* OpenAI highlighted the ability of O1 models to work with developers as "thought partners," understanding complex instructions and contributing to the development process.Quote: "The shift to reasoning introduces a new shape of AI capability. The ability for our model to scale and correct the process is pretty mind-blowing. So we are resetting the clock, and we are introducing a new series of models under the name O1."2. Realtime API:* Enables developers to build real-time AI experiences directly into their applications using WebSockets.* Launches with support for speech-to-speech interaction, leveraging the technology behind ChatGPT's advanced voice models.* Offers natural and seamless integration of voice capabilities, allowing for dynamic and interactive user experiences.* Showcased the potential to revolutionize human-computer interaction across various domains like driving, education, and accessibility.Quote: "You know, a lot of you have been asking about building amazing speech-to-speech experiences right into your apps. Well now, you can."3. Vision, Fine-Tuning, and Model Distillation:* Vision introduces the ability to use images for fine-tuning, enabling developers to enhance model performance in image understanding tasks.* Fine-tuning with Vision opens up opportunities in diverse fields such as product recommendations, medical imaging, and autonomous driving.* OpenAI emphasized the accessibility of these features, stating that "fine-tuning with Vision is available to every single developer."* Model distillation tools facilitate the creation of smaller, more efficient models by transferring knowledge from larger models like O1 and GPT-4.* This approach addresses cost concerns and makes advanced AI capabilities more accessible for a wider range of applications and developers.Quote: "With distillation, you take the outputs of a large model to supervise, to teach a smaller model. And so today, we are announcing our own model distillation tools."4. Cost Reduction and Accessibility:* OpenAI highlighted its commitment to lowering the cost of AI models, making them more accessible for diverse use cases.* Announced a 90% decrease in cost per token since the release of GPT-3, emphasizing continuous efforts to improve affordability.* Introduced prompt caching, automatically providing a 50% discount for input tokens the model has recently processed.* These initiatives aim to remove financial barriers and encourage wider adoption of AI technologies across various industries.Quote: "Every time we reduce the price, we see new types of applications, new types of use cases emerge. We're super far from the price equilibrium. In a way, models are still too expensive to be bought at massive scale."Conclusion:OpenAI DevDay conveyed a strong message of developer empowerment and a commitment to pushing the boundaries of AI capabilities. With new models like O1, the introduction of the Realtime API, and a dedicated focus on accessibility and customization, OpenAI is paving the way for a new wave of innovative and impactful AI applications developed by a global community. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Hey everyone, it's Alex (still traveling!), and oh boy, what a week again! Advanced Voice Mode is finally here from OpenAI, Google updated their Gemini models in a huge way and then Meta announced MultiModal LlaMas and on device mini Llamas (and we also got a "better"? multimodal from Allen AI called MOLMO!)From Weights & Biases perspective, our hackathon was a success this weekend, and then I went down to Menlo Park for my first Meta Connect conference, full of news and updates and will do a full recap here as well. ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Overall another crazy week in AI, and it seems that everyone is trying to rush something out the door before OpenAI Dev Day next week (which I'll cover as well!) Get ready, folks, because Dev Day is going to be epic!TL;DR of all topics covered: * Open Source LLMs * Meta llama 3.2 Multimodal models (11B & 90B) (X, HF, try free)* Meta Llama 3.2 tiny models 1B & 3B parameters (X, Blog, download)* Allen AI releases MOLMO - open SOTA multimodal AI models (X, Blog, HF, Try It)* Big CO LLMs + APIs* OpenAI releases Advanced Voice Mode to all & Mira Murati leaves OpenAI * Google updates Gemini 1.5-Pro-002 and 1.5-Flash-002 (Blog)* This weeks Buzz * Our free course is LIVE - more than 3000 already started learning how to build advanced RAG++* Sponsoring tonights AI Tinkerers in Seattle, if you're in Seattle, come through for my demo* Voice & Audio* Meta also launches voice mode (demo)* Tools & Others* Project ORION - holographic glasses are here! (link)Meta gives us new LLaMas and AI hardwareLLama 3.2 Multimodal 11B and 90BThis was by far the biggest OpenSource release of this week (tho see below, may not be the "best"), as a rumored released finally came out, and Meta has given our Llama eyes! Coming with 2 versions (well 4 if you count the base models which they also released), these new MultiModal LLaMas were trained with an adapter architecture, keeping the underlying text models the same, and placing a vision encoder that was trained and finetuned separately on top. LLama 90B is among the best open-source mutlimodal models available— Meta team at launchThese new vision adapters were trained on a massive 6 Billion images, including synthetic data generation by 405B for questions/captions, and finetuned with a subset of 600M high quality image pairs. Unlike the rest of their models, the Meta team did NOT claim SOTA on these models, and the benchmarks are very good but not the best we've seen (Qwen 2 VL from a couple of weeks ago, and MOLMO from today beat it on several benchmarks) With text-only inputs, the Llama 3.2 Vision models are functionally the same as the Llama 3.1 Text models; this allows the Llama 3.2 Vision models to be a drop-in replacement for Llama 3.1 8B/70B with added image understanding capabilities.Seems like these models don't support multi image or video as well (unlike Pixtral for example) nor tool use with images. Meta will also release these models on meta.ai and every other platform, and they cited a crazy 500 million monthly active users of their AI services across all their apps
Hey folks, Alex here, back with another ThursdAI recap – and let me tell you, this week's episode was a whirlwind of open-source goodness, mind-bending inference techniques, and a whole lotta talk about talking AIs! We dove deep into the world of LLMs, from Alibaba's massive Qwen 2.5 drop to the quirky, real-time reactions of Moshi. We even got a sneak peek at Nous Research's ambitious new project, Forge, which promises to unlock some serious LLM potential. So grab your pumpkin spice latte (it's that time again isn't it?
March 14th, 2023 was the day ThursdAI was born, it was also the day OpenAI released GPT-4, and I jumped into a Twitter space and started chaotically reacting together with other folks about what a new release of a paradigm shifting model from OpenAI means, what are the details, the new capabilities. Today, it happened again! Hey, it's Alex, I'm back from my mini vacation (pic after the signature) and boy am I glad I decided to not miss September 12th! The long rumored
Welcome back everyone, can you believe it's another ThursdAI already? And can you believe me when I tell you that friends of the pod Matt Shumer & Sahil form Glaive.ai just dropped a LLama 3.1 70B finetune that you can download that will outperform Claude Sonnet 3.5 while running locally on your machine? Today was a VERY heavy Open Source focused show, we had a great chat w/ Niklas, the leading author of OLMoE, a new and 100% open source MoE from Allen AI, a chat with Eugene (pico_creator) about RWKV being deployed to over 1.5 billion devices with Windows updates and a lot more. In the realm of the big companies, Elon shook the world of AI by turning on the biggest training cluster called Colossus (100K H100 GPUs) which was scaled in 122 days
Hey, for the least time during summer of 2024, welcome to yet another edition of ThursdAI, also happy skynet self-awareness day for those who keep track :) This week, Cerebras broke the world record for fastest LLama 3.1 70B/8B inference (and came on the show to talk about it) Google updated 3 new Geminis, Anthropic artifacts for all, 100M context windows are possible, and Qwen beats SOTA on vision models + much more! As always, this weeks newsletter is brought to you by Weights & Biases, did I mention we're doing a hackathon in SF in September 21/22 and that we have an upcoming free RAG course w/ Cohere & Weaviate? TL;DR* Open Source LLMs * Nous DisTrO - Distributed Training (X , Report)* NousResearch/ hermes-function-calling-v1 open sourced - (X, HF)* LinkedIN Liger-Kernel - OneLine to make Training 20% faster & 60% more memory Efficient (Github)* Cartesia - Rene 1.3B LLM SSM + Edge Apache 2 acceleration (X, Blog)* Big CO LLMs + APIs* Cerebras launches the fastest AI inference - 447t/s LLama 3.1 70B (X, Blog, Try It)* Google - Gemini 1.5 Flash 8B & new Gemini 1.5 Pro/Flash (X, Try it)* Google adds Gems & Imagen to Gemini paid tier* Anthropic artifacts available to all users + on mobile (Blog, Try it)* Anthropic publishes their system prompts with model releases (release notes)* OpenAI has project Strawberry coming this fall (via The information)* This weeks Buzz* WandB Hackathon hackathon hackathon (Register, Join)* Also, we have a new RAG course w/ Cohere and Weaviate (RAG Course)* Vision & Video* Zhipu AI CogVideoX - 5B Video Model w/ Less 10GB of VRAM (X, HF, Try it)* Qwen-2 VL 72B,7B,2B - new SOTA vision models from QWEN (X, Blog, HF)* AI Art & Diffusion & 3D* GameNgen - completely generated (not rendered) DOOM with SD1.4 (project)* FAL new LORA trainer for FLUX - trains under 5 minutes (Trainer, Coupon for ThursdAI)* Tools & Others* SimpleBench from AI Explained - closely matches human experience (simple-bench.com)ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Open SourceLet's be honest - ThursdAI is a love letter to the open-source AI community, and this week was packed with reasons to celebrate.Nous Research DiStRO + Function Calling V1Nous Research was on fire this week (aren't they always?) and they kicked off the week with the release of DiStRO, which is a breakthrough in distributed training. You see, while LLM training requires a lot of hardware, it also requires a lot of network bandwidth between the different GPUs, even within the same data center. Proprietary networking solutions like Nvidia NVLink, and more open standards like Ethernet work well within the same datacenter, but training across different GPU clouds has been unimaginable until now. Enter DiStRo, a new decentralized training by the mad geniuses at Nous Research, in which they reduced the required bandwidth to train a 1.2B param model from 74.4GB to just 86MB (857x)! This can have massive implications for training across compute clusters, doing shared training runs, optimizing costs and efficiency and democratizing LLM training access! So don't sell your old GPUs just yet, someone may just come up with a folding@home but for training the largest open source LLM, and it may just be Nous! Nous Research also released their function-calling-v1 dataset (HF) that was used to train Hermes-2, and we had InterstellarNinja who authored that dataset, join the show and chat about it. This is an incredible unlock for the open source community, as function calling become a de-facto standard now. Shout out to the Glaive team as well for their pioneering work that paved the way!LinkedIn's Liger Kernel: Unleashing the Need for Speed (with One Line of Code)What if I told you, that whatever software you develop, you can add 1 line of code, and it'll run 20% faster, and require 60% less memory? This is basically what Linkedin researches released this week with Liger Kernel, yes you read that right, Linkedin, as in the website you career related posts on! "If you're doing any form of finetuning, using this is an instant win"Wing Lian - AxolotlThis absolutely bonkers improvement in training LLMs, now works smoothly with Flash Attention, PyTorch FSDP and DeepSpeed. If you want to read more about the implementation of the triton kernels, you can see a deep dive here, I just wanted to bring this to your attention, even if you're not technical, because efficiency jumps like these are happening all the time. We are used to seeing them in capabilities / intelligence, but they are also happening on the algorithmic/training/hardware side, and it's incredible to see!Huge shoutout to Byron and team at Linkedin for this unlock, check out their Github if you want to get involved!Qwen-2 VL - SOTA image and video understanding + open weights mini VLMYou may already know that we love the folks at Qwen here on ThursdAI, not only because Junyang Lin is a frequeny co-host and we get to hear about their releases as soon as they come out (they seem to be releasing them on thursdays around the time of the live show, I wonder why!) But also because, they are committed to open source, and have released 2 models 7B and 2B with complete Apache 2 license! First of all, their Qwen-2 VL 72B model, is now SOTA at many benchmarks, beating GPT-4, Claude 3.5 and other much bigger models. This is insane. I literally had to pause Junyang and repeat what he said, this is a 72B param model, that beats GPT-4o on document understanding, on math, on general visual Q&A. Additional Capabilities & Smaller modelsThey have added new capabilities in these models, like being able to handle arbitrary resolutions, but the one I'm most excited about is the video understanding. These models can now understand up to 20 minutes of video sequences, and it's not just "split the video to 10 frames and do image caption", no, these models understand video progression and if I understand correctly how they do it, it's quite genius. They the video embed time progression into the model using a new technique called M-RoPE, which turns the time progression into rotary positional embeddings. Now, the 72B model is currently available via API, but we do get 2 new small models with Apache 2 license and they are NOT too shabby either! 7B parameters (HF) and 2B Qwen-2 VL (HF) are small enough to run completely on your machine, and the 2B parameter, scores better than GPT-4o mini on OCR-bench for example! I can't wait to finish writing and go play with these models! Big Companies & LLM APIsThe biggest news this week came from Cerebras System, a relatively unknown company, that shattered the world record for LLM inferencing out of the blue (and came on the show to talk about how they are doing it)Cerebras - fastest LLM inference on wafer scale chipsCerebras has introduced the concept of wafer scale chips to the world, which is, if you imagine a microchip, they are the size of a post stamp maybe? GPUs are bigger, well, Cerebras are making chips the sizes of an iPad (72 square inches), largest commercial chips in the world. And now, they created an inference stack on top of those chips, and showed that they have the fastest inference in the world, how fast? Well, they can server LLama 3.1 8B at a whopping 1822t/s. No really, this is INSANE speeds, as I was writing this, I copied all the words I had so far, went to inference.cerebras.ai , asked to summarize, pasted and hit send, and I immediately got a summary! "The really simple explanation is we basically store the entire model, whether it's 8B or 70B or 405B, entirely on the chip. There's no external memory, no HBM. We have 44 gigabytes of memory on chip."James WangThey not only store the whole model (405B coming soon), but they store it in full fp16 precision as well, so they don't quantize the models. Right now, they are serving it with 8K tokens in context window, and we had a conversation about their next steps being giving more context to developers. The whole conversation is well worth listening to, James and Ian were awesome to chat with, and while they do have a waitlist, as they gradually roll out their release, James said to DM him on X and mention ThursdAI, and he'll put you through, so you'll be able to get an OpenAI compatible API key and be able to test this insane speed. P.S - we also did an independent verification of these speeds, using Weave, and found Cerebras to be quite incredible for agentic purposes, you can read our report here and the weave dashboard hereAnthropic - unlocking just-in-time applications with artifacts for allWell, if you aren't paying claude, maybe this will convince you. This week, anthropic announced that artifacts are available to all users, not only their paid customers. Artifacts are a feature in Claude that is basically a side pane (and from this week, a drawer in their mobile apps) that allows you to see what Claude is building, by rendering the web application almost on the fly. They have also trained Claude in working with that interface, so it knows about the different files etcEffectively, this turns Claude into a web developer that will build mini web applications (without backend) for you, on the fly, for any task you can think of. Drop a design, and it'll build a mock of it, drop some data in a CSV and it'll build an interactive onetime dashboard visualizing that data, or just ask it to build an app helping you split the bill between friends by uploading a picture of a bill. Artifacts are share-able and remixable, so you can build something and share with friends, so here you go, an artifact I made, by dropping my notes into claude, and asking for a magic 8 Ball, that will spit out a random fact from today's editing of ThursdAI. I also provided Claude with an 8Ball image, but it didn't work due to restrictions, so instead I just uploaded that image to claude and asked it to recreate it with SVG! And viola, a completely un-nessesary app that works! Google's Gemini Keeps Climbing the Charts (But Will It Be Enough?)Sensing a disturbance in the AI force (probably from that Cerebras bombshell), Google rolled out a series of Gemini updates, including a new experimental Gemini 1.5 Pro (0827) with sharper coding skills and logical reasoning. According to LMSys, it's already nipping at the heels of ChatGPT 4o and is number 2!Their Gemini 1.5 Flash model got a serious upgrade, vaulting to the #6 position on the arena. And to add to the model madness, they even released an Gemini Flash 8B parameter version for folks who need that sweet spot between speed and size.Oh, and those long-awaited Gems are finally starting to roll out. But get ready to open your wallet – this feature (preloading Gemini with custom context and capabilities) is a paid-tier exclusive. But hey, at least Imagen-3 is cautiously returning to the image generation game! AI Art & DiffusionDoom Meets Stable Diffusion: AI Dreams in 20FPS Glory (GameNGen)The future of video games is, uh, definitely going to be interesting. Just as everyone thought AI would be conquering Go or Chess, it seems we've stumbled into a different battlefield: first-person shooters.
Hey there, Alex here with an end of summer edition of our show, which did not disappoint. Today is the official anniversary of stable diffusion 1.4 can you believe it? It's the second week in the row that we have an exclusive LLM launch on the show (after Emozilla announced Hermes 3 on last week's show), and spoiler alert, we may have something cooking for next week as well!This edition of ThursdAI is brought to you by W&B Weave, our LLM observability toolkit, letting you evaluate LLMs for your own use-case easilyAlso this week, we've covered both ends of AI progress, doomerist CEO saying "Fck Gen AI" vs an 8yo coder and I continued to geek out on putting myself into memes (I promised I'll stop... at some point) so buckle up, let's take a look at another crazy week: TL;DR* Open Source LLMs * AI21 releases Jamba1.5 Large / Mini hybrid Mamba MoE (X, Blog, HF)* Microsoft Phi 3.5 - 3 new models including MoE (X, HF)* BFCL 2 - Berkley Function Calling Leaderboard V2 (X, Blog, Leaderboard)* NVIDIA - Mistral Nemo Minitron 8B - Distilled / Pruned from 12B (HF)* Cohere paper proves - code improves intelligence (X, Paper)* MOHAWK - transformer → Mamba distillation method (X, Paper, Blog)* AI Art & Diffusion & 3D* Ideogram launches v2 - new img diffusion king
Look these crazy weeks don't seem to stop, and though this week started out a bit slower (while folks were waiting to see how the speculation about certain red berry flavored conspiracies are shaking out) the big labs are shipping! We've got space uncle Elon dropping an "almost-gpt4" level Grok-2, that's uncensored, has access to real time data on X and can draw all kinds of images with Flux, OpenAI announced a new ChatGPT 4o version (not the one from last week that supported structured outputs, a different one!) and Anthropic dropping something that makes AI Engineers salivate! Oh, and for the second week in a row, ThursdAI live spaces were listened to by over 4K people, which is very humbling, and awesome because for example today, Nous Research announced Hermes 3 live on ThursdAI before the public heard about it (and I had a long chat w/ Emozilla about it, very well worth listening to)TL;DR of all topics covered: * Big CO LLMs + APIs* Xai releases GROK-2 - frontier level Grok, uncensored + image gen with Flux (
Hold on tight, folks, because THIS week on ThursdAI felt like riding a roller coaster through the wild world of open-source AI - extreme highs, mind-bending twists, and a sprinkle of "wtf is happening?" conspiracy theories for good measure.
Starting Monday, Apple released iOS 18.1 with Apple Intelligence, then Meta dropped SAM-2 (Segment Anything Model) and then Google first open sourced Gemma 2B and now (just literally 2 hours ago, during the live show) released Gemini 1.5 0801 experimental that takes #1 on LMsys arena across multiple categories, to top it all off we also got a new SOTA image diffusion model called FLUX.1 from ex-stability folks and their new Black Forest Lab.This week on the show, we had Joseph & Piotr Skalski from Roboflow, talk in depth about Segment Anything, and as the absolute experts on this topic (Skalski is our returning vision expert), it was an incredible deep dive into the importance dedicated vision models (not VLMs).We also had Lukas Atkins & Fernando Neto from Arcee AI talk to use about their new DistillKit and explain model Distillation in detail & finally we had Cristiano Giardina who is one of the lucky few that got access to OpenAI advanced voice mode + his new friend GPT-4o came on the show as well!Honestly, how can one keep up with all this? by reading ThursdAI of course, that's how but ⚠️ buckle up, this is going to be a BIG one (I think over 4.5K words, will mark this as the longest newsletter I penned, I'm sorry, maybe read this one on 2x?
Holy s**t, folks! I was off for two weeks, last week OpenAI released GPT-4o-mini and everyone was in my mentions saying, Alex, how are you missing this?? and I'm so glad I missed that last week and not this one, because while GPT-4o-mini is incredible (GPT-4o level distill with incredible speed and almost 99% cost reduction from 2 years ago?) it's not open source. So welcome back to ThursdAI, and buckle up because we're diving into what might just be the craziest week in open-source AI since... well, ever!This week, we saw Meta drop LLAMA 3.1 405B like it's hot (including updated 70B and 8B), Mistral joining the party with their Large V2, and DeepSeek quietly updating their coder V2 to blow our minds. Oh, and did I mention Google DeepMind casually solving math Olympiad problems at silver level medal
Hey all, Alex here… well, not actually here, I'm scheduling this post in advance, which I haven't yet done, because I'm going on vacation! That's right, next week is my birthday
Hey everyone! Happy 4th of July to everyone who celebrates! I celebrated today by having an intimate conversation with 600 of my closest X friends
Hey everyone, sending a quick one today, no deep dive, as I'm still in the middle of AI Engineer World's Fair 2024 in San Francisco (in fact, I'm writing this from the incredible floor 32 presidential suite, that the team here got for interviews, media and podcasting, and hey to all new folks who I've just met during the last two days!) It's been an incredible few days meeting so many ThursdAI community members, listeners and folks who came on the pod! The list honestly is too long but I've got to meet friends of the pod Maxime Labonne, Wing Lian, Joao Morra (crew AI), Vik from Moondream, Stefania Druga not to mention the countless folks who came up and gave high fives, introduced themselves, it was honestly a LOT of fun. (and it's still not over, if you're here, please come and say hi, and let's take a LLM judge selfie together!)On today's show, we recorded extra early because I had to run and play dress up, and boy am I relieved now that both the show and the talk are behind me, and I can go an enjoy the rest of the conference
Hey, this is Alex. Don't you just love when assumptions about LLMs hitting a wall just get shattered left and right and we get new incredible tools released that leapfrog previous state of the art models, that we barely got used to, from just a few months ago? I SURE DO! Today is one such day, this week was already busy enough, I had a whole 2 hour show packed with releases, and then Anthropic decided to give me a reason to use the #breakingNews button (the one that does the news show like sound on the live show, you should join next time!) and announced Claude Sonnet 3.5 which is their best model, beating Opus while being 2x faster and 5x cheaper! (also beating GPT-4o and Turbo, so... new king! For how long? ¯_(ツ)_/¯)Critics are already raving, it's been half a day and they are raving! Ok, let's get to the TL;DR and then dive into Claude 3.5 and a few other incredible things that happened this week in AI!
Happy Apple AI week everyone (well, those of us who celebrate, some don't) as this week we finally got told what Apple is planning to do with this whole generative AI wave and presented Apple Intelligence (which is AI, get it? they are trying to rebrand AI!)This weeks pod and newsletter main focus will be Apple Intelligence of course, as it was for most people compared to how the market reacted ($APPL grew over $360B in a few days after this announcement) and how many people watched each live stream (10M at the time of this writing watched the WWDC keynote on youtube, compared to 4.5 for the OpenAI GPT-4o, 1.8 M for Google IO) On the pod we also geeked out on new eval frameworks and benchmarks including a chat with the authors of MixEvals which I wrote about last week and a new benchmark called Live Bench from Abacus and Yan LecunPlus a new video model from Luma and finally SD3, let's go!
Hey everyone, Alex here! Can you believe it's already end of May? And that 2 huge AI companies conferences are behind us (Google IO, MSFT Build) and Apple's WWDC is just ahead in 10 days! Exciting! I was really looking forward to today's show, had quite a few guests today, I'll add all their socials below the TL;DR so please give them a follow and if you're only in reading mode of the newsletter, why don't you give the podcast a try
Hello hello everyone, this is Alex, typing these words from beautiful Seattle (really, it only rained once while I was here!) where I'm attending Microsoft biggest developer conference BUILD. This week we saw OpenAI get in the news from multiple angles, none of them positive and Microsoft clapped back at Google from last week with tons of new AI product announcements (CoPilot vs Gemini) and a few new PCs with NPU (Neural Processing Chips) that run alongside CPU/GPU combo we're familiar with. Those NPUs allow for local AI to run on these devices, making them AI native devices! While I'm here I also had the pleasure to participate in the original AI tinkerers thanks to my friend Joe Heitzberg who operates and runs the aitinkerers.org (of which we are a local branch in Denver) and it was amazing to see tons of folks who listen to ThursdAI + read the newsletter and talk about Weave and evaluations with all of them! (Btw, one the left is Vik from Moondream, which we covered multiple times). I Ok let's get to the news: TL;DR of all topics covered: * Open Source LLMs * HuggingFace commits 10M in ZeroGPU (X)* Microsoft open sources Phi-3 mini, Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (Blog, Demo)* Mistral 7B 0.3 - Base + Instruct (HF)* LMSys created a "hard prompts" category (X)* Cohere for AI releases Aya 23 - 3 models, 101 languages, (X)* Big CO LLMs + APIs* Microsoft Build recap - New AI native PCs, Recall functionality, Copilot everywhere * Will post a dedicated episode to this on Sunday* OpenAI pauses GPT-4o Sky voice because Scarlet Johansson complained* Microsoft AI PCs - Copilot+ PCs (Blog)* Anthropic - Scaling Monosemanticity paper - about mapping the features of an LLM (X, Paper)* Vision & Video* OpenBNB - MiniCPM-Llama3-V 2.5 (X, HuggingFace)* Voice & Audio* OpenAI pauses Sky voice due to ScarJo hiring legal counsel* Tools & Hardware* Humane is looking to sell (blog)Open Source LLMs Microsoft open sources Phi-3 mini, Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (Blog, Demo)Just in time for Build, Microsoft has open sourced the rest of the Phi family of models, specifically the small (7B) and the Medium (14B) models on top of the mini one we just knew as Phi-3. All the models have a small context version (4K and 8K) and a large that goes up to 128K (tho they recommend using the small if you don't need that whole context) and all can run on device super quick. Those models have MIT license, so use them as you will, and are giving an incredible performance comparatively to their size on benchmarks. Phi-3 mini, received an interesting split in the vibes, it was really good for reasoning tasks, but not very creative in it's writing, so some folks dismissed it, but it's hard to dismiss these new releases, especially when the benchmarks are that great! LMsys just updated their arena to include a hard prompts category (X) which select for complex, specific and knowledge based prompts and scores the models on those. Phi-3 mini actually gets a big boost in ELO ranking when filtered on hard prompts and beats GPT-3.5
Wow, holy s**t, insane, overwhelming, incredible, the future is here!, "still not there", there are many more words to describe this past week. (TL;DR at the end of the blogpost)I had a feeling it's going to be a big week, and the companies did NOT disappoint, so this is going to be a very big newsletter as well. As you may have read last week, I was very lucky to be in San Francisco the weekend before Google IO, to co-host a hackathon with Meta LLama-3 team, and it was a blast, I will add my notes on that in This weeks Buzz section. Then on Monday, we all got to watch the crazy announcements from OpenAI, namely a new flagship model called GPT-4o (we were right, it previously was im-also-a-good-gpt2-chatbot) that's twice faster, 50% cheaper (in English, significantly more so in other languages, more on that later) and is Omni (that's the o) which means it is end to end trained with voice, vision, text on inputs, and can generate text, voice and images on the output. A true MMIO (multimodal on inputs and outputs, that's not the official term) is here and it has some very very surprising capabilities that blew us all away. Namely the ability to ask the model to "talk faster" or "more sarcasm in your voice" or "sing like a pirate", though, we didn't yet get that functionality with the GPT-4o model, it is absolutely and incredibly exciting. Oh and it's available to everyone for free! That's GPT-4 level intelligence, for free for everyone, without having to log in!What's also exciting was how immediate it was, apparently not only the model itself is faster (unclear if it's due to newer GPUs or distillation or some other crazy advancements or all of the above) but that training an end to end omnimodel reduces the latency to incredibly immediate conversation partner, one that you can interrupt, ask to recover from a mistake, and it can hold a conversation very very well. So well, that indeed it seemed like, the Waifu future (digital girlfriends/wives) is very close to some folks who would want it, while we didn't get to try it (we got GPT-4o but not the new voice mode as Sam confirmed) OpenAI released a bunch of videos of their employees chatting with Omni (that's my nickname, use it if you'd like) and many online highlighted how thirsty / flirty it sounded. I downloaded all the videos for an X thread and I named one girlfriend.mp4, and well, just judge for yourself why: Ok, that's not all that OpenAI updated or shipped, they also updated the Tokenizer which is incredible news to folks all around, specifically, the rest of the world. The new tokenizer reduces the previous "foreign language tax" by a LOT, making the model way way cheaper for the rest of the world as wellOne last announcement from OpenAI was the desktop app experience, and this one, I actually got to use a bit, and it's incredible. MacOS only for now, this app comes with a launcher shortcut (kind of like RayCast) that let's you talk to ChatGPT right then and there, without opening a new tab, without additional interruptions, and it even can understand what you see on the screen, help you understand code, or jokes or look up information. Here's just one example I just had over at X. And sure, you could always do this with another tab, but the ability to do it without context switch is a huge win. OpenAI had to do their demo 1 day before GoogleIO, but even during the excitement about GoogleIO, they had announced that Ilya is not only alive, but is also departing from OpenAI, which was followed by an announcement from Jan Leike (who co-headed the superailgnment team together with Ilya) that he left as well. This to me seemed like a well executed timing to give dampen the Google news a bit. Google is BACK, backer than ever, Alex's Google IO recapOn Tuesday morning I showed up to Shoreline theater in Mountain View, together with creators/influencers delegation as we all watch the incredible firehouse of announcements that Google has prepared for us. TL;DR - Google is adding Gemini and AI into all it's products across workspace (Gmail, Chat, Docs), into other cloud services like Photos, where you'll now be able to ask your photo library for specific moments. They introduced over 50 product updates and I don't think it makes sense to cover all of them here, so I'll focus on what we do best."Google with do the Googling for you" Gemini 1.5 pro is now their flagship model (remember Ultra? where is that?