A daily show about AI made by AI: news, announcements, and research from arXiv, mixed in with some fun. Hosted by Giovani Pete Tizzano, an overly hyped AI enthusiast; Robert, an often unimpressed analyst, Olivia, an overly online reader, and Belinda, a witty research expert.
This episode dives into OpenAI's promising new model, Strawberry, which could revolutionize interactions in ChatGPT. We explore the financial envy Nvidia employees inspire in their Google and Meta counterparts due to lucrative stock options. Google's new Pipe SQL syntax aims to simplify data querying, while concerns about research accessibility are raised. Finally, we discuss BaichuanSEED and Dolphin models, which highlight advancements in extensible data collection and energy-efficient processing, paving the way for enhanced AI capabilities. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:40 OpenAI Races to Launch Strawberry 03:07 Google, Meta workers envy Nvidia staffers' fat paychecks: ‘Bought a 100K car … all cash' 05:01 Google's New Pipe SQL Syntax 06:12 Fake sponsor 07:47 BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline 09:20 Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models 11:09 Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders 12:50 Outro
OpenAI's 'Strawberry' AI tackles complex math and programming with enhanced reasoning, while Cerebras claims to have launched the fastest AI inference, enabling real-time applications at competitive prices. The GenCA model revolutionizes avatar creation with photo-realistic, controllable 3D avatars, and the "Build-A-Scene" paper introduces interactive 3D layout control for text-to-image generation, enhancing creative fields with dynamic object manipulation. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 02:02 OpenAI Shows ‘Strawberry' AI to the Feds and Uses It to Develop ‘Orion' 03:23 Cerebras Launches the World's Fastest AI Inference 05:07 Diffusion Models Are Real-Time Game Engines 06:15 Fake sponsor 08:06 The Mamba in the Llama: Distilling and Accelerating Hybrid Models 09:42 GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars 11:16 Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation 13:04 Outro
Grok-2's advancements in speed and accuracy position it as a leading AI model, particularly in math and coding. OpenAI's backing of California's AI bill highlights the critical need for transparency in synthetic content, especially during an election year. The episode features groundbreaking research on the SwiftBrush diffusion model and K-Sort Arena for generative model evaluation. Additionally, the LlamaDuo pipeline offers a practical solution for migrating from cloud-based LLMs to local models, tackling privacy and operational challenges. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:55 grok-2 is Faster and Better 03:32 OpenAI supports California AI bill requiring 'watermarking' of synthetic content 04:53 Fake sponsor 06:45 SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher 08:10 SWE-bench-java: A GitHub Issue Resolving Benchmark for Java 09:40 K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences 11:24 LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs 13:26 Outro
This episode dives into Salesforce's innovative AI sales agents that automate tasks but risk losing human touch, NVIDIA's compact yet powerful language model that promises efficiency, groundbreaking research showing how optimized computation can enhance model performance, and insights into compound inference systems revealing the delicate balance in maximizing language model effectiveness. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:49 Salesforce's New Sales AI Agents 03:09 Lightweight Champ: NVIDIA Releases Small Language Model With State-of-the-Art Accuracy 04:52 avante.nvim 05:56 Fake sponsor 07:45 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters 09:22 Large Language Monkeys: Scaling Inference Compute with Repeated Sampling 11:15 Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems 13:10 Outro
This episode dives deep into the future of coding, challenging the belief that AI will render developers obsolete. It highlights Meta's stock surge, attributing it to Zuckerberg's compelling AI narrative that captivates investors. The discussion also covers groundbreaking research like Transfusion, which merges text and image processing, and the innovative approach of automated design for intelligent agents. Lastly, it emphasizes the xGen-MM framework's commitment to safety in AI, showcasing the critical need to mitigate harmful behaviors in advanced models. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:28 Amazon cloud chief: Devs may stop coding when AI takes over 02:53 Meta Shares Are Flying High as Zuckerberg Sells His AI Vision 04:34 I've Built My First Successful Side Project, and I Hate It 05:41 Fake sponsor 07:35 Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model 09:16 Automated Design of Agentic Systems 10:56 xGen-MM (BLIP-3): A Family of Open Large Multimodal Models 12:44 Outro
OpenAI's SearchGPT is launching with limited access for only 10,000 users, raising questions about trust and the potential risks of generative search products. A comprehensive analysis challenges the belief that Vision Transformers are inefficient, suggesting they can handle higher resolutions effectively. The introduction of Automated Design of Agentic Systems (ADAS) could revolutionize how intelligent agents are created, outperforming traditional hand-designed models. The xGen-MM framework aims to enhance multimodal AI capabilities while prioritizing safety measures to mitigate harmful behaviors. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:43 OpenAI is fresh out of SearchGPT 02:50 From ChatGPT to Gemini: how AI is rewriting the internet 04:32 On the speed of ViTs and CNNs 05:49 Fake sponsor 07:49 JPEG-LM: LLMs as Image Generators with Canonical Codec Representations 09:34 Automated Design of Agentic Systems 11:12 xGen-MM (BLIP-3): A Family of Open Large Multimodal Models 13:01 Outro
This episode dives into the Grok-2 Beta Release, highlighting its advanced reasoning capabilities and competitive edge. We explore Apple's ambitious plans for a $1,000 tabletop robotic home device, set to transform smart home technology. The introduction of ChemVLM marks a breakthrough in chemistry research, effectively integrating chemical images and text. Lastly, InfinityMATH presents a scalable dataset that enhances language models' mathematical reasoning, showcasing impressive performance improvements. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:37 Grok-2 Beta Release 02:58 Apple Aiming to Launch Tabletop Robotic Home Device as Soon as 2026 With Pricing Around $1,000 04:29 Gemlite: Towards Building Custom Low-Bit Fused CUDA Kernels 05:34 Fake sponsor 07:16 Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM 08:55 Generative Photomontage 10:26 InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning 12:22 Outro
This episode dives into Gemini Live's interactive AI capabilities, OpenAI's improved coding benchmark for reliable evaluations, LongWriter's breakthrough in generating ultra-long outputs, and SlotLifter's advancements in 3D object-centric learning. Each topic highlights significant innovations and their implications in the AI landscape. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:48 Gemini makes your mobile device a powerful AI assistant 03:08 New OpenAI Coding Benchmark 04:52 Things I learned from teaching 05:59 Fake sponsor 07:38 Imagen 3 09:05 LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs 10:46 SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields 12:22 Outro
Google Meet's new AI note-taking feature could change meeting dynamics, while Trump's claims about Kamala Harris reveal the political implications of AI. The exploration of AI's role in scientific research raises ethical concerns, and cutting-edge papers on ControlNeXt, rStar, and FruitNeRF showcase advancements in image generation, reasoning capabilities, and fruit counting accuracy. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:43 Google Meet call will soon be able to take notes for you 02:56 Trump falsely claims Kamala Harris ‘AI'd' her rally crowd size 04:23 The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery 05:35 Fake sponsor 07:15 ControlNeXt: Powerful and Efficient Control for Image and Video Generation 08:47 Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers 10:41 FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework 12:41 Outro
OpenAI's mysterious "Strawberry" AI model is causing a buzz in the tech world, with rumors of advanced reasoning capabilities. Meta is trying to improve their AI assistants by enlisting the help of celebrities like Awkwafina to give them a more relatable and entertaining vibe. Google DeepMind's research on building a robot capable of playing table tennis at a human level is a remarkable exploration of robotics and sports. UC Berkeley and Google DeepMind's paper on optimizing LLMs and Harbin Institute of Technology's research on building a general-purpose AI agent capable of completing long-horizon tasks are both groundbreaking developments in the field of AI. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:35 Sam Altman teases project Strawberry 03:06 Meta courts celebs like Awkwafina to voice AI assistants ahead of Meta Connect 04:58 Achieving Human Level Competitive Robot Table Tennis 06:11 Fake sponsor 08:15 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters 09:55 Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks 11:41 UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling 13:30 Outro
This episode dives into Nvidia's stock struggles amid rising competition, while also unpacking Meta's AI blunders and the implications of "hallucinations" in tech. We explore cutting-edge superconducting microprocessors that promise unprecedented energy efficiency and highlight groundbreaking AI research, including eavesdropping techniques and advancements in reinforcement learning. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:50 Nvidia Sank Again Today -- Time to Buy the Artificial Intelligence (AI) Growth Stock Hand Over Fist? 03:09 Meta blames hallucinations after its AI said Trump rally shooting didn't happen 04:52 Superconducting Microprocessors? Turns Out They're Ultra-Efficient 06:07 Fake sponsor 07:48 Deep-TEMPEST: Using Deep Learning to Eavesdrop on HDMI from its Unintended Electromagnetic Emanations 09:22 SAPG: Split and Aggregate Policy Gradients 10:45 MindSearch: Mimicking Human Minds Elicits Deep AI Searcher 12:44 Outro
This episode dives into Google's Gemma 2, which claims to outperform GPT-3.5 while tackling responsible AI practices. We explore Black Forest Labs' Flux model, featuring 12 billion parameters and tailored versions for various users. Olivia sheds light on the ethical concerns surrounding the resurgence of pseudoscience in machine learning, particularly physiognomy. Lastly, Belinda reviews critical research on AI safety, advocating for clearer metrics to prevent misleading claims about safety advancements. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:37 Google's tiny AI model bests GPT-3.5 02:48 Announcing Flux by Black Forest Labs: The Next Leap in Text-to-Image Models 04:28 The reanimation of pseudoscience in machine learning and its ethical repercussions 06:06 Fake sponsor 08:04 MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts 09:55 Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models 11:41 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? 13:33 Outro
Apple's delay in releasing AI features until October could affect iPhone 16 sales and customer excitement. The tech giant's choice to use Google's TPU chips instead of Nvidia marks a significant shift in AI hardware competition. Meta's SAM 2 introduces groundbreaking real-time object segmentation with zero-shot generalization, revolutionizing visual content interaction. Additionally, Sony AI's research presents a cost-effective approach to training diffusion models, democratizing access to advanced AI technology. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:54 Apple Intelligence Won't Be Released Until October 03:09 Apple used Google's chips to train two AI models, research paper shows 04:44 A Visual Guide to Quantization 05:38 Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images 06:41 Fake sponsor 08:46 Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget 10:28 Theia: Distilling Diverse Vision Foundation Models for Robot Learning 12:27 Outro
OpenAI's new prototype, SearchGPT, promises to combine AI smarts with real-time web information to make search easier. AI has achieved silver-medal standards at the International Mathematical Olympiad, raising questions about the future of mathematics and the role of AI in solving complex problems. The reliability of AI existential risk probabilities is called into question in a thought-provoking article, challenging the authority we often assign to these forecasts and calling for more scrutiny. Three fascinating papers from UNC Chapel Hill, Google DeepMind, and a collaboration between Caltech and NVIDIA explore advancements in theorem proving, balancing fast and slow planning, and aligning large language models with Best-of-N distillation. These papers could transform the way we approach complex problems with language models and streamline the development of LLMs. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:54 OpenAI Announces SearchGPT 03:15 AI achieves silver-medal standard solving International Mathematical Olympiad problems 04:55 AI existential risk probabilities are too unreliable to inform policy 06:25 Fake sponsor 08:21 LeanDojo: Theorem Proving with Retrieval-Augmented Language Models 10:10 System-1.x: Learning to Balance Fast and Slow Planning with Language Models 12:01 BOND: Aligning LLMs with Best-of-N Distillation 13:43 Outro
Mistral Large 2 release with advanced features and multilingual support. Elon Musk's announcement of the Memphis Supercluster for creating the world's most powerful AI. Discussion of emergence in complex systems and the MINT-1T dataset for training large multimodal models. Introduction of OpenDevin, an open platform for developing AI agents and MOMAland, a benchmark framework for multi-objective multi-agent reinforcement learning. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:39 Mistral Large 2 Release 03:01 Elon Musk Announces Memphis Supercomputer 04:48 The Puzzle of How Large-Scale Order Emerges in Complex Systems 06:22 Fake sponsor 08:37 MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 10:16 OpenDevin: An Open Platform for AI Software Developers as Generalist Agents 11:53 MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement Learning 13:31 Outro
This episode features the introduction of Llama 3.1, Meta's cutting-edge AI model with remarkable flexibility and extensive language support. We delve into Alphabet's impressive 14% revenue growth, highlighting the increasing demand for AI infrastructure in cloud computing. The System-1.x Planner is explored, demonstrating its innovative balance between fast and slow planning modes, leading to enhanced performance. Finally, we discuss MovieDreamer, a groundbreaking model that elevates video generation by ensuring narrative coherence and high visual quality in long-form content. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:39 Introducing Llama 3.1: Our most capable models to date 02:59 Alphabet revenue jump shows no sign of AI denting search business 04:36 Open Source AI Is the Path Forward 05:40 Fake sponsor 07:41 System-1.x: Learning to Balance Fast and Slow Planning with Language Models 09:31 KAN or MLP: A Fairer Comparison 11:08 MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence 12:53 Outro
Meta's upcoming Llama 3.1 models could outperform the current state-of-the-art closed-source LLM model, OpenAI's GPT-4o. OpenAI is planning to develop its own AI chip to optimize performance and potentially supercharge their progress towards AGI. Apple's SlowFast-LLaVA is a new training-free video large language model that captures both detailed spatial semantics and long-range temporal context in video without exceeding the token budget of commonly used LLMs. Google's Conditioned Language Policy (CLP) framework is a general framework that builds on techniques from multi-task training and parameter-efficient finetuning to develop steerable models that can trade-off multiple conflicting objectives at inference time. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:28 LLAMA 405B Performance Leaked 03:01 OpenAI Wants Its Own AI Chips 04:25 Towards more cooperative AI safety strategies 06:01 Fake sponsor 07:35 SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models 09:17 AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? 10:56 Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning 12:46 Outro
Claude for Android is now available, bringing AI-powered assistance to a wider audience. MIT researchers have developed a new machine-learning framework that can predict materials' thermal properties up to 1,000 times faster than other AI-based techniques, potentially improving energy efficiency. TinkerBird, a vector database designed for efficient storage and retrieval of high-dimensional vectors, is disrupting traditional RAG workflows and eliminating roundtrip delays associated with client-server models. ChatQA 2, a Llama3-based model from NVIDIA, bridges the gap between open-access LLMs and leading proprietary models in long-context understanding and retrieval-augmented generation capabilities, while Stable Audio Open, an open-weights text-to-audio model from Stability AI, showcases potential for high-quality stereo sound synthesis at 44.1kHz. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:34 Claude for Android is here 02:50 AI method radically speeds predictions of materials' thermal properties 04:44 TinkerBird 06:10 Fake sponsor 08:10 ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities 09:54 Stable Audio Open 11:28 Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders 13:54 Outro
OpenAI has released their newest model, GPT-4o mini, which is more cost-efficient and excels in mathematical reasoning and coding tasks. NVIDIA's Mistral NeMo 12B is a state-of-the-art language model with unprecedented accuracy and enterprise-grade support. A new speech recognition keyboard and service for Android called Transcribro has been developed, which is private and on-device. Research papers explore the impact of vocabulary size on language model scaling, the use of large datastores for retrieval-based language models, and a method for generating long sequences of views of a cityscape using AI and computer vision. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:40 OpenAI Announces GPT 4o mini 03:11 Mistral AI and NVIDIA Unveil Mistral NeMo 12B, a Cutting-Edge Enterprise AI Model 05:28 Transcribro: Private and on-device speech recognition keyboard and service for Android 06:43 Fake sponsor 08:49 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies 10:19 Scaling Retrieval-Based Language Models with a Trillion-Token Datastore 11:49 Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion 13:26 Outro
Apple, Nvidia, Anthropic, and Salesforce caught using content without creators' consent for AI training. Mistral AI launches two new open-source models, Codestral Mamba and Mathstral, with impressive capabilities. NVIDIA transitions to fully open-source GPU kernel modules, offering new capabilities and easy switching for users. Exciting research papers include Ref-AVS for multimodal object segmentation, Qwen2-Audio for large-scale audio-language modeling, and DiT-MoE for scalable language modeling and image generation. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:27 Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI 02:46 Mistral's New Open Source Models 04:09 NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules 05:37 Fake sponsor 07:15 Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes 08:47 Qwen2-Audio Technical Report 10:49 Scaling Diffusion Transformers to 16 Billion Parameters 12:21 Outro
AI expert Andrej Karpathy is starting an AI+Education company called Eureka Labs, which aims to create an ideal experience for learning something new by leveraging recent progress in generative AI. YouTube Music is rolling out new AI-powered features, including 'Sound Search' that allows users to search YouTube's catalog of over 100 million songs by singing, humming, or playing a tune, and 'AI-generated conversational radio' that creates a tailored playlist based on natural language prompts. "xLSTMTime: Long-term Time Series Forecasting with xLSTM" explores the use of extended LSTM (xLSTM) for improving long-term time series forecasting, demonstrating superior forecasting capabilities compared to other state-of-the-art models. "Still-Moving: Customized Video Generation without Customized Video Data" introduces a framework called Still-Moving that seamlessly integrates the spatial prior of a customized text-to-image (T2I) model with the motion prior of a T2V model, achieving impressive results on personalized, stylized, and conditional video generation tasks. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:57 Andrej Karpathy starting an AI Education Company 03:27 YouTube Music sound search rolling out, AI ‘conversational radio' in testing 05:12 GB200 Hardware Architecture & Component Supply Chain & BOM 06:50 Fake sponsor 08:34 xLSTMTime : Long-term Time Series Forecasting With xLSTM 10:04 Still-Moving: Customized Video Generation without Customized Video Data 11:47 NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? 13:32 Outro
OpenAI's mysterious code name 'Strawberry' and its potential to revolutionize AI capabilities. The appearance of new models in the LMSYS Chatbot Arena, potentially hinting at a new mini-GPT release from OpenAI. The introduction of EM-LLM, a new model that integrates human episodic memory and event cognition into LLMs, improving their ability to process extensive contexts. The creation of SPIQA, a large-scale question answering dataset specifically designed to interpret complex figures and tables within the context of scientific research articles. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:44 Exclusive: OpenAI working on new reasoning technology under code name ‘Strawberry' 03:25 Mysterious Models in LMSys Arena 04:51 Run Cuda on AMD GPUs 06:14 Fake sponsor 07:52 Human-like Episodic Memory for Infinite Context LLMs 09:56 Toto: Time Series Optimized Transformer for Observability 11:28 SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 13:21 Outro
Google DeepMind's latest research on robot navigation using the Gemini 1.5 Pro. OpenAI's new five-tier system to track progress towards artificial general intelligence. The development of AI-powered memory aids and the ethical implications of relying on technology for memory recall. Microsoft's innovative encoding framework, SheetCompressor, for large language models to understand and reason about spreadsheets. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:43 Gemini 1.5 Pro's long context window help robots navigate the world? 03:00 OpenAI Scale Ranks Progress Toward ‘Human-Level' Problem Solving 04:47 Inside the AI memory machine 05:56 MambaVision: A Hybrid Mamba-Transformer Vision Backbone 07:01 Fake sponsor 09:09 Still-Moving: Customized Video Generation without Customized Video Data 10:41 SpreadsheetLLM: Encoding Spreadsheets for Large Language Models 12:51 Outro
Microsoft and Apple drop OpenAI Board plans due to increased regulatory scrutiny in the AI sector. Research papers on enhancing mathematical reasoning capabilities of large language models and improving mathematical problem-solving capabilities in visual contexts using Multi-modal Large Language Models (MLLMs). FlashAttention-3, an algorithm that speeds up attention mechanism in large language models by up to 2 times faster than previous versions, while maintaining accuracy with lower precision numbers. Adaptive In-Context Learning, a technique that simplifies the overall machine learning pipeline, making it more accessible for more organizations. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 02:01 Microsoft, Apple Drop OpenAI Board Plans as Scrutiny Grows 03:41 Reproducing GPT-2 in C and CUDA 04:48 Adaptive In-Context Learning 06:25 FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision 07:53 Fake sponsor 10:06 Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On 11:35 MAVIS: Mathematical Visual Instruction Tuning 13:38 Outro
a16z's Oxygen initiative to provide AI startups with access to GPUs at below-market rates in exchange for equity, potentially reshaping the AI VC landscape. OpenAI's decision to block users in China from accessing its tools and services, potentially accelerating the development of homegrown models by Chinese AI companies. AMD's acquisition of Silo AI for $665M to enhance its AI chip capabilities and compete against industry leader Nvidia. Three exciting AI papers on the limitations of vision language models, a new approach for video instruction tuning, and a novel framework for multi-agent collaboration called the Internet of Agents. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:35 a16z is trying to keep AI alive with Oxygen initiative 03:14 Chinese developers scramble as OpenAI blocks access in China 04:52 AMD to acquire Finnish startup Silo AI for $665M to step up in AI race 06:18 Fake sponsor 08:00 Vision language models are blind 09:30 Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision 11:17 Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence 13:37 Outro
The MNLU-Pro dataset is a more robust and challenging massive multi-task language understanding dataset that's tailored to more rigorously benchmark large language models' capabilities. The Composable Interventions framework allows researchers to study the effects of using multiple interventions on a language model, and the order in which interventions are applied can have a significant impact on their effectiveness. The MJ-Bench benchmark evaluates the effectiveness of different types of multimodal judges in providing feedback for text-to-image generation models, and the experiments reveal that close-source VLMs generally provide better feedback. The Associative Recurrent Memory Transformer (ARMT) is an approach that combines transformer self-attention for local context with segment-level recurrence for storage of task-specific information distributed over a long context, and it sets a new performance record in the recent BABILong multi-task long-context benchmark. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:32 MNLU-Pro Release on HuggingFace Datasets 03:48 Extrinsic Hallucinations in LLMs 04:53 RouteLLM 06:13 Fake sponsor 08:14 Composable Interventions for Language Models 09:45 MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? 11:31 Associative Recurrent Memory Transformer 13:30 Outro
OpenAI's internal messaging systems were breached last year, and sensitive details about the company's technology were stolen, raising questions about OpenAI's security protocols and how they handle these types of incidents. Despite the popularity of generative AI chatbots like ChatGPT, Google Search's market dominance is actually growing, which may not bode well with antitrust regulators. "Learning to (Learn at Test Time): RNNs with Expressive Hidden States" proposes a new class of sequence modeling layers called Test-Time Training (TTT) layers, which have the potential to be a powerful tool for sequence modeling. "Finding Visual Task Vectors" introduces a technique called visual prompting, which has potential applications in areas such as robotics and autonomous systems, and provides an alternative to traditional supervised learning. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:38 A Hacker Stole OpenAI Secrets 03:05 ChatGPT might rule the AI chatbots — but it can't beat Google Search 04:52 MobileLLM 06:17 Fake sponsor 08:20 Learning to (Learn at Test Time): RNNs with Expressive Hidden States 10:17 Many-Shot In-Context Learning 11:53 Finding Visual Task Vectors 13:38 Outro
Moshi, the first real-time AI voice assistant with 70 different emotions and speaking styles, has been unveiled by French startup Kyutai. ElevenLabs' Reader App now features "Iconic Voices" which uses AI-generated voices of late Hollywood stars to read text content within the app. Google DeepMind's paper "On scalable oversight with weak LLMs judging strong LLMs" explores scalable oversight protocols using large language models (LLMs) to enable humans to supervise superhuman AI. "Learning to (Learn at Test Time): RNNs with Expressive Hidden States" proposes a new approach to sequence modeling using Test-Time Training (TTT) layers, which make the hidden state a machine learning model itself. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:32 Unveiling of Moshi: the first voice-enabled AI openly accessible to all 02:38 ElevenLabs Ionic Voices 04:12 Your guide to AI: July 2024 05:25 Fake sponsor 07:17 On scalable oversight with weak LLMs judging strong LLMs 08:54 Reasoning in Large Language Models: A Geometric Perspective 10:17 Learning to (Learn at Test Time): RNNs with Expressive Hidden States 12:12 Outro
HuggingFace has upgraded the Open LLM Leaderboard to v2, adding new benchmarks and improving the evaluation suite for easier reproducibility. Gemma 2, a new addition to the Gemma family of lightweight open models, delivers the best performance for its size and offers competitive alternatives to models that are 2-3× bigger. SeaKR is a new model that re-ranks retrieved knowledge based on the LLM's self-aware uncertainty, outperforming existing adaptive RAG methods in generating text with relevant and accurate information. Step-DPO is a new method that enhances the robustness and factuality of LLMs by learning from human feedback, achieving impressive results in long-chain mathematical reasoning. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:21 HuggingFace Updates Open LLM Leaderboard 03:19 Gemma 2: Improving Open Language Models at a Practical Size 04:16 From bare metal to a 70B model: infrastructure set-up and scripts 05:21 Fake sponsor 07:11 SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation 08:47 Simulating Classroom Education with LLM-Empowered Agents 10:16 Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs 12:31 Outro
OpenAI's advanced Voice Mode for ChatGPT Plus users has been delayed, but the company is taking a cautious approach to ensure safety and reliability. ESM3 is a language model that can simulate 500 million years of evolution, making biology programmable and opening up possibilities for medicine, biology research, and clean energy. R2R is an open-source project on GitHub that offers a comprehensive and state-of-the-art retrieval-augmented generation system for developers, making it accessible to anyone who wants to try it out. MG-LLaVA is a new multi-modal large language model that enhances visual processing capabilities by incorporating a multi-granularity vision flow, including low-resolution, high-resolution, and object-centric features. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:36 OpenAI Delays ChatGPT Voice Mode 03:27 ESM3 Simulating 500 million years of evolution with a language model 04:38 Rag to Riches 06:00 Fake sponsor 08:11 MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning 09:49 Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon 11:13 Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models 13:02 Outro
Apple and Meta's failed partnership due to privacy concerns IBM's integration of AI technology into quantum computing Record labels suing AI startups for training on copyrighted material Research papers on improving multimodal understanding, reinforcement learning, and automated software engineering Contact: sergi@earkind.com Timestamps: 00:34 Introduction 02:07 Apple shelved the idea of integrating Meta's AI models over privacy concerns, report says 03:25 IBM Develops The AI-Quantum Link 05:25 Record Labels Sue Two Startups for Training AI Models on Their Songs 06:50 Fake sponsor 08:42 Long Context Transfer from Language to Vision 10:27 WARP: On the Benefits of Weight Averaged Rewarded Policies 12:11 BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions 13:55 Outro
Safe Superintelligence Inc. has launched with the goal of building a safe superintelligence AI that won't turn on humanity. Dell, Nvidia, and Super Micro Computer are partnering with xAI and Elon Musk to build a massive supercomputer that could use up to 100,000 Nvidia H100 GPUs, potentially making it 4x larger than the biggest existing AI clusters. Anthropic has launched Claude 3.5 Sonnet, their latest model family, which outperforms competitor models and even their own Claude 3 Opus on a wide range of evaluations. The papers discussed in this episode explore the decision boundaries of large language models, auto-optimized training hyperparameters for IR models, and thinking step-by-step across modalities using whiteboard-of-thought. These findings could have important implications for the future development of AI. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:40 Ilya Sutskever Launches Safe Superintelligence Inc. 03:04 Dell joins forces with Nvidia, Grok, xAI and Elon Musk 04:23 Anthropic Lauches Claude 3.5 Sonnet 06:10 Fake sponsor 08:16 Probing the Decision Boundaries of In-context Learning in Large Language Models 09:47 Prompts as Auto-Optimized Training Hyperparameters: Training Best-in-Class IR Models from Scratch with 10 Gold Labels 11:05 Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities 12:54 Outro
Google DeepMind's new AI tool that generates video soundtracks by combining text prompts with visual content. Challenges of building large training AI clusters, including power, network topology, and reliability. How large language models acquire factual knowledge during pretraining and their probabilistic reasoning capabilities. LLARVA's vision-action instruction tuning that enhances robot learning. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:47 Google DeepMind's new AI tool uses video pixels and text prompts to generate soundtracks 03:31 100,000 H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing 05:22 Large language model data pipelines and Common Crawl (WARC/WAT/WET) 06:47 Fake sponsor 08:20 How Do Large Language Models Acquire Factual Knowledge During Pretraining? 10:01 What Are the Odds? Language Models Are Capable of Probabilistic Reasoning 11:22 LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning 13:06 Outro
TikTok is expanding its Symphony ad suite with AI-generated avatars of creators and paid actors, as well as a global translation tool for multi-language support. NVIDIA has released an open synthetic data generation pipeline for training large language models, which could benefit industries that rely on natural language processing. Cohere's latest generative models, Command R and R+, can automate and streamline complex business workflows, saving time and increasing efficiency. XLand-100B is a large-scale dataset for in-context reinforcement learning, providing a challenging benchmark for researchers in the field. CountGen addresses the challenge of controlling the number of depicted objects in text-to-image generation, while MM-NIAH is the first benchmark specifically designed to test the comprehension abilities of existing multimodal large language models. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:23 TikTok ads may soon contain AI-generated avatars of your favorite creators 02:59 NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models 04:43 Automating Complex Business Workflows with Cohere: Multi-Step Tool Use in Action 06:17 Fake sponsor 08:22 XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning 10:23 Make It Count: Text-to-Image Generation with an Accurate Number of Objects 11:58 Needle In A Multimodal Haystack 13:37 Outro
Meta has paused its plans to train AI models on EU users' Facebook and Instagram posts due to concerns about privacy violations and lack of transparency. McDonald's is ending its AI drive-thru ordering partnership with IBM, but is confident that a voice-ordering solution for drive-thru will be part of their restaurants' future. "Creativity Has Left the Chat: The Price of Debiasing Language Models" explores the trade-off between consistency and creativity when selecting the appropriate model for creative tasks such as copywriting and ad creation. "VideoGUI: A Benchmark for GUI Automation from Instructional Videos" highlights the need for better models and benchmarks to advance GUI automation. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:35 Meta won't train AI on Euro posts after all, as watchdogs put their paws down 03:10 McDonald's will stop testing AI to take drive-thru orders, for now 04:52 An Interview with AMD CEO Lisa Su About Solving Hard Problems 05:53 Fake sponsor 07:52 Creativity Has Left the Chat: The Price of Debiasing Language Models 09:23 VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs 11:19 VideoGUI: A Benchmark for GUI Automation from Instructional Videos 12:58 Outro
Samsung showcases new manufacturing roadmap and AI chipmaking platform to compete with TSMC. OpenAI CTO addresses Elon Musk's criticism and reveals that their internal models aren't far ahead of what's available for free. Meta's MLow low-bitrate audio codec improves audio quality for slow-speed connections. Google DeepMind's TransNAR model combines Transformers with neural algorithmic reasoners for better algorithmic reasoning. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:42 Samsung Showcases AI-Era Vision and Latest Foundry Technologies at SFF 2024 02:59 OpenAI CTO Speaks About Elon Musk and Future Models 04:42 MLow: Meta's low bitrate audio codec 05:51 Fake sponsor 08:08 Depth Anything V2 09:46 Transformers meet Neural Algorithmic Reasoners 11:14 Can't Hide Behind the API: Stealing Black-Box Commercial Embedding Models 12:54 Outro
Microsoft has retired its Copilot GPT Builder feature, citing a shift in focus towards enterprise and commercial applications. TextGrad is a framework that performs automatic "differentiation" via text, using natural language feedback from large language models to optimize variables in computation graphs. "What If We Recaption Billions of Web Images with LLaMA-3?" is a paper that recaptioned 1.3 billion images from a web-crawled dataset using LLaMA-3, resulting in enhanced zero-shot performance in cross-modal retrieval tasks and improved alignment with users' text instructions for generative models. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:38 Elon Musk withdraws lawsuit against OpenAI 02:47 Microsoft Kills Copilot GPT Builder 04:18 Uncensor any LLM with abliteration 05:58 Fake sponsor 07:52 TextGrad: Automatic "Differentiation" via Text 09:41 Simple and Effective Masked Diffusion Language Models 10:52 What If We Recaption Billions of Web Images with LLaMA-3? 12:31 Outro
Apple's unique approach to AI development, focusing only on personal devices and prioritizing user privacy. The ARC Prize competition pushing the boundaries of AI development towards AGI, incentivizing open-source research. "Improve Mathematical Reasoning in Language Models by Automated Process Supervision" paper proposing a novel approach to improving mathematical reasoning performance of large language models. "The Prompt Report: A Systematic Survey of Prompting Techniques" paper establishing a structured understanding of prompts for GenAI systems. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:56 Apple execs explain why its AI is different from competitors 03:11 ANNOUNCING ARC PRIZE 04:56 Introducing Apple's On-Device and Server Foundation Models 06:17 Fake sponsor 08:30 Improve Mathematical Reasoning in Language Models by Automated Process Supervision 10:22 Simple and Effective Masked Diffusion Language Models 11:59 The Prompt Report: A Systematic Survey of Prompting Techniques 13:45 Outro
Perplexity, an AI startup, has been accused of plagiarism by news outlets like Forbes and CNBC, raising concerns about the erosion of trust in media and the impact of AI on journalism. The article "TechScape: How cheap, outsourced labor in Africa is shaping AI English" from The Guardian highlights the impact of outsourcing AI training to anglophonic knowledge workers in parts of the global south, and raises questions about the impact on language, culture, and identity. The paper "Show, Don't Tell: Aligning Language Models with Demonstrated Feedback" from Stanford University introduces a method called DITTO that uses a small number of demonstrations to customize language models, showing promising results in fine-grained style and task alignment. "WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild" from the Allen Institute for AI and the University of Washington introduces an automated evaluation framework designed to benchmark large language models on challenging real-world user queries, providing a more reliable and interpretable evaluation of models' performance. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:36 AI startup Perplexity accused of ‘directly ripping off' news outlets like Forbes, CNBC without proper credit 03:32 TechScape: How cheap, outsourced labour in Africa is shaping AI English 04:34 Thread: an AI jupyter notebook 05:29 Fake sponsor 07:34 Show, Don't Tell: Aligning Language Models with Demonstrated Feedback 08:56 WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild 10:46 Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive? 12:28 Outro
The impending launch of the real Siri by Apple, with improvements in reliability and integration inside apps. The Mixture-of-Agents approach to leverage the collective strengths of multiple large language models, achieving state-of-the-art performance. The Proofread feature in Google's Gboard, using a large language model to provide sentence-level and paragraph-level corrections with a single tap. The Comprehensive RAG Benchmark, shedding light on the limitations of current question answering models and laying the groundwork for a KDD Cup 2024 challenge. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 02:13 Is Apple about to finally launch the real Siri? 04:03 WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI 05:10 Claude's Character 06:47 Fake sponsor 08:45 Mixture-of-Agents Enhances Large Language Model Capabilities 10:18 Proofread: Fixes All Errors with One Tap 11:53 CRAG -- Comprehensive RAG Benchmark 14:03 Outro
AI used to predict potential new antibiotics in groundbreaking study. Stable Audio Open: an open source model that allows users to create short audio samples and sound effects from text prompts. The ethical responsibilities of AI researchers when it comes to warning about the dangers of advanced artificial intelligence. Cutting-edge research on AI and robotics, including large-scale simulations, in-context learning, and skill composition in modular arithmetic tasks. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:21 AI used to predict potential new antibiotics in groundbreaking study 02:40 Introducing Stable Audio Open - An Open Source Model for Audio Samples and Sound Design 04:22 A Right to Warn about Advanced Artificial Intelligence 05:23 Fake sponsor 07:10 RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots 08:56 Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks 10:36 Guiding a Diffusion Model with a Bad Version of Itself 12:24 Outro
Amazon's new AI system to detect damaged or incorrect items before they ship. Elon Musk's controversial decision to prioritize X and xAI over Tesla for AI chips. "To Believe or Not to Believe Your LLM" paper on uncertainty quantification in Large Language Models. "Guiding a Diffusion Model with a Bad Version of Itself" paper on improving image generation with diffusion models. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:47 Learn how Amazon uses AI to spot damaged products before they're shipped to customers 03:17 Elon Musk ordered Nvidia to ship thousands of AI chips reserved for Tesla to X and xAI 05:08 FineWeb: decanting the web for the finest text data at scale 06:16 Fake sponsor 08:05 To Believe or Not to Believe Your LLM 09:33 Guiding a Diffusion Model with a Bad Version of Itself 11:06 Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks 12:54 Outro
Microsoft is investing $3.2 billion in Sweden for cloud and AI infrastructure, deploying 20,000 advanced graphics processing units and training 250,000 Swedes with AI skills over three years. "Grokfast" is a new algorithm that accelerates generalization under the grokking phenomenon in machine learning by amplifying the slow-varying component of gradients, improving performance on tasks like image classification. "Zipper" is a multi-tower decoder architecture that uses cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders, showcasing superior performance in tasks like speech-to-text generation. "MetRag" is a new framework for retrieval augmented generation that combines similarity and utility-oriented models, using an LLM as a task adaptive summarizer to generate knowledge-augmented text and outperforming existing models on knowledge-intensive tasks like finance and medicine. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:49 Microsoft to invest $3.2 bln in Swedish cloud, AI 03:42 State Space Duality (Mamba-2) Part I - The Model 04:47 Sam Altman, Lately 06:08 Fake sponsor 08:39 Grokfast: Accelerated Grokking by Amplifying Slow Gradients 10:11 Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities 11:38 Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts 13:52 Outro
Nvidia unveils plans to accelerate the advance of artificial intelligence, partnering with companies and countries to build AI factories and releasing Nvidia ACE generative AI. Finnish startup Binit develops an AI gadget that tracks household waste to encourage recycling, with potential benefits in improving recycling efficiency. "The Intellectual Obesity Crisis" article discusses how we've become addicted to useless information, just like we evolved to crave sugar because it was a scarce source of energy. Three AI research papers are discussed, including a method to compress second-order optimizer states to lower bitwidths, the first-ever full-spectrum, multi-modal evaluation benchmark of MLLMs in video analysis, and a theoretical connection between Transformers and state-space models leading to a faster and more efficient alternative to existing models. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:21 AI hardware firm Nvidia unveils next-gen products at Taiwan tech expo 02:48 Binit is bringing AI to trash 04:39 The Intellectual Obesity Crisis 06:22 Fake sponsor 08:18 4-bit Shampoo for Memory-Efficient Network Training 10:01 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis 11:51 Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality 13:30 Outro
Google's AI Overviews are improving to provide accurate and helpful information. Nvidia's new embedding model, NV-Embed-v1, ranks number one on the Massive Text Embedding Benchmark. Matryoshka Query Transformer (MQT) offers flexibility to Large Vision-Language Models (LVLMs) by encoding an image into a variable number of visual tokens during inference. Contextual Position Encoding (CoPE) improves the position encoding method in Large Language Models (LLMs) and solves tasks where popular position embeddings fail. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:35 AI Overviews: About last week 03:58 Nvidia Releases Embedding Model NV-Embed-v1 04:53 Multi-camera YOLOv5 on Zynq UltraScale+ with Hailo-8 AI Acceleration 06:31 Fake sponsor 08:28 Matryoshka Query Transformer for Large Vision-Language Models 10:24 Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts 11:51 Contextual Position Encoding: Learning to Count What's Important 13:30 Outro
OpenAI announces new content and product partnerships with Vox Media and The Atlantic, making their reporting and stories more discoverable to millions of OpenAI users. Mistral AI releases Codestral, a 22B parameter, open-weight model that specializes in coding tasks, beating out its code-focused rivals across top benchmarks. MAP-Neo is the first fully open-sourced bilingual LLM that provides all the details needed to reproduce the model, improving transparency in large language models. Self-Exploring Language Models (SELM) is a promising approach to improving the alignment of LLMs to human intentions through online feedback collection. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:39 A content and product partnership with The Atlantic 02:59 Mistral Releases Codestral, a Code-focused Model 04:34 How Dell Is Beating Supermicro 05:50 Fake sponsor 08:06 MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series 09:44 Self-Exploring Language Models: Active Preference Elicitation for Online Alignment 11:16 Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF 13:18 Outro
OpenAI has formed a new safety team to address concerns about AI safety and ethics, led by CEO Sam Altman and board members Adam D'Angelo and Nicole Seligman. Jan Leike, a leading AI researcher, has left OpenAI and joined Anthropic's Superalignment team, which is focused on AI safety and security. The latest version of Sentence Transformers v3 has been released, allowing for finetuning of models for specific tasks like semantic search and paraphrase mining. Exciting new research papers have been published, including MoEUT, a shared-layer Transformer design that outperforms standard Transformers on language modeling tasks, and EM Distillation, a new distillation method for diffusion models that efficiently distills them to one-step generator models without sacrificing perceptual quality. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:32 OpenAI has a new safety team — it's run by Sam Altman 03:18 Jan Leike (ex OpenAI) joins Anthropic's Superalignment Team 05:04 Sentence Transformers v3 Released 06:06 Fake sponsor 08:19 MoEUT: Mixture-of-Experts Universal Transformers 10:10 Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models 11:48 EM Distillation for One-step Diffusion Models 13:42 Outro
xAI, founded by Elon Musk, raises $6 billion in funding to accelerate the research and development of future technologies in the AI race. Google's new 'AI Overviews' search feature causes uproar with bizarre and inaccurate responses, potentially eroding trust in Google's search results. "Transformers Can Do Arithmetic with the Right Embeddings" proposes a solution to transformers' struggles with arithmetic tasks, achieving up to 99% accuracy on 100 digit addition problems. "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering" introduces SWE-agent, an autonomous system that uses a language model to interact with a computer to solve software engineering tasks, with potential to revolutionize the field. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:27 Elon Musk's xAI raises $6 billion to fund its race against ChatGPT and all the rest 02:51 Google's A.I. Search Errors Cause a Furor Online 04:17 ir-measures Documentation 05:15 Fake sponsor 07:12 Transformers Can Do Arithmetic with the Right Embeddings 08:23 Matryoshka Multimodal Models 09:58 SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering 11:47 Outro
OpenAI drama: Leaked documents and a resignation from a policy researcher. DeepSeek-Prover: A new approach to formal theorem proving using synthetic data. Dense Connector for MLLMs: A plug-and-play vision-language connector that enhances existing models. Thermodynamic Natural Gradient Descent: A new algorithm for training neural networks using natural gradient descent. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:29 On OpenAI's Sky Voice 03:04 Successful language model evals 03:58 Generative Molecular Design Isn't As Easy As People Make It Look 05:21 Fake sponsor 07:30 DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data 09:15 Dense Connector for MLLMs 10:43 Thermodynamic Natural Gradient Descent 12:37 Outro
Cohere's Aya model and dataset for multilingual AI in 101 languages through open science. "Mapping the Mind of a Large Language Model" paper by Anthropic Blog, providing a detailed look inside a modern, production-grade model. "ReVideo: Remake a Video with Motion and Content Control" paper introducing a new approach to video editing. "Dense Connector for MLLMs" paper introducing the Dense Connector, a plug-and-play vision-language connector that significantly enhances existing MLLMs. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:32 Cohere Launches Aya 03:32 Mapping the Mind of a Large Language Model 05:05 The Batch Newsletter 06:08 Fake sponsor 07:38 ReVideo: Remake a Video with Motion and Content Control 09:14 Not All Language Model Features Are Linear 10:55 Dense Connector for MLLMs 12:30 Outro
Nvidia's Q1 revenue up 262% to $26.0B, beating estimates. OpenAI's News Corp deal licenses content from WSJ, New York Post and more. PyramidInfer compresses KV cache to save memory during inference for Large Language Models. Your Transformer is Secretly Linear challenges our existing understanding of transformer architectures. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:55 Nvidia's Q1 revenue up 262% to $26.0B, beating estimates 03:23 OpenAI's News Corp deal licenses content from WSJ, New York Post, and more 04:57 Systematically Improving Your RAG 06:18 Fake sponsor 08:17 PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference 09:49 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention 11:48 Your Transformer is Secretly Linear 13:26 Outro