Podcasts about vram

uber eats gar promo code sentai skipthedishes vram

Play Episode Listen Later Jun 28, 2025 115:56

We changed the feature topic because Gar wasn't feeling well again. We discuss the debut of Vram's new final Mode, Sumino finally getting to be a detective in a closed room murder, and how the Gotchard cast finally graduated high school. Casters Present: Blue Gray North Show Notes: https://www.patreon.com/posts/132550972 Required Viewing: Kamen Rider Gavv 40, No.1 Sentai Gozyuger 18, Kamen Rider Gotchard Graduations Watch on YouTube: https://www.youtube.com/watch?v=WllVPCsL5sk Hungry? Get CA$15 off your first 3 UberEats orders of CA$20 or more! https://ubereats.com/feed?promoCode=eats-christopherm5931ue Get $5 off your first order with SkipTheDishes! https://www.skipthedishes.com/r/6YaJc65HKg

[CTRL TECHNIQUE #11] Xbox Ally | Radeon RX 9060XT | L'arnaque de 8 Go de VRAM

Canard PC

Play Episode Listen Later Jun 21, 2025 55:18

Soutenez cette chaîne pour 3€ par mois et accédez en avant-première aux VOD de cette émission : https://fr.ulule.com/canardpc/Ces gens écrivent des trucs sur le jeu vidéo ! : https://www.canardpc.comTous nos magazines papiers et nos offres d'abonnement : https://boutique.canardpc.com Les derniers numéro de Canard PC : https://boutique.canardpc.com/common/categories/4Et Canard PC Hardware : https://boutique.canardpc.com/common/categories/7 Ecoutez l'émission en podcast: https://linktr.ee/canardpc Retrouvez-nous aussi:► Twitch : https://www.twitch.tv/canardpc► Twitter : https://twitter.com/Canardpcredac► Facebook : https://www.facebook.com/CanardPCmagazine► Instagram : https://www.instagram.com/canardpc/► Discord : https://discord.gg/nJJFe9r► Tiktok : @canardpcredac Replay de l'émission du 18/06/25. Crédits :Présenté par : Olivier ''ackboo'' PeronAvec la participation de : Arnaud "Caféine" Chaudron Réalisation : Jean-Ludovic " Monsieur Chat " Vignon. Musique : Stéphane "Fishbone" Hébert, alias FB-1, disponible sur BandCamp : https://fb-1.bandcamp.com/album/canardpc-tv Tous droits réservés Presse Non-Stop / Canard PC. Aucun youtubeur n'a été maltraité pendant le tournage.

pr twitch fb discord xbox ces technique bandcamp olivier vod arnaud aucun ctrl fishbone soutenez radeon rx vignon vram canard pc

The Dodgy "RTX 5090" Laptop Situation (feat. Jarrod's Tech)

The Data Center Frontier Show

Play Episode Listen Later Jun 19, 2025 50:24

Episode 75: Jarrod from Jarrod's Tech joins us to talk about gaming laptops. We discuss the situation with horrible laptop GPU names, poor VRAM configurations, absurd pricing for higher tier models and whether gaming laptops actually make sense to begin with.JARROD'S TECHCheck out Jarrod's Channel: https://www.youtube.com/@jarrodstechCheck out Jarrod's website: https://gaminglaptop.deals/CHAPTERS00:00 - Intro01:32 - VRAM on Laptops06:25 - RTX 5090 Laptop vs Desktop18:22 - Absurd Laptop Pricing30:37 - Do Gaming Laptops Make Sense?38:43 - Laptop DisplaysSUBSCRIBE TO THE PODCASTAudio: https://shows.acast.com/the-hardware-unboxed-podcastVideo: https://www.youtube.com/channel/UCqT8Vb3jweH6_tj2SarErfwSUPPORT US DIRECTLYPatreon: https://www.patreon.com/hardwareunboxedLINKSYouTube: https://www.youtube.com/@Hardwareunboxed/Twitter: https://twitter.com/HardwareUnboxedBluesky: https://bsky.app/profile/hardwareunboxed.bsky.social Hosted on Acast. See acast.com/privacy for more information.

tech situation acast laptops jarrod gpu dodgy rtx podcast video vram

Open Source, AMD GPUs, and the Future of Edge Inference: Vultr's Big AI Bet

Play Episode Listen Later Jun 12, 2025 25:00

In this episode of the Data Center Frontier Show, we sit down with Kevin Cochrane, Chief Marketing Officer of Vultr, to explore how the company is positioning itself at the forefront of AI-native cloud infrastructure, and why they're all-in on AMD's GPUs, open-source software, and a globally distributed strategy for the future of inference. Cochrane begins by outlining the evolution of the GPU market, moving from a scarcity-driven, centralized training era to a new chapter focused on global inference workloads. With enterprises now seeking to embed AI across every application and workflow, Vultr is preparing for what Cochrane calls a “10-year rebuild cycle” of enterprise infrastructure—one that will layer GPUs alongside CPUs across every corner of the cloud. Vultr's recent partnership with AMD plays a critical role in that strategy. The company is deploying both the MI300X and MI325X GPUs across its 32 data center regions, offering customers optimized options for inference workloads. Cochrane explains the advantages of AMD's chips, such as higher VRAM and power efficiency, which allow large models to run with fewer GPUs—boosting both performance and cost-effectiveness. These deployments are backed by Vultr's close integration with Supermicro, which delivers the rack-scale servers needed to bring new GPU capacity online quickly and reliably. Another key focus of the episode is ROCm (Radeon Open Compute), AMD's open-source software ecosystem for AI and HPC workloads. Cochrane emphasizes that Vultr is not just deploying AMD hardware; it's fully aligned with the open-source movement underpinning it. He highlights Vultr's ongoing global ROCm hackathons and points to zero-day ROCm support on platforms like Hugging Face as proof of how open standards can catalyze rapid innovation and developer adoption. “Open source and open standards always win in the long run,” Cochrane says. “The future of AI infrastructure depends on a global, community-driven ecosystem, just like the early days of cloud.” The conversation wraps with a look at Vultr's growth strategy following its $3.5 billion valuation and recent funding round. Cochrane envisions a world where inference workloads become ubiquitous and deeply embedded into everyday life—from transportation to customer service to enterprise operations. That, he says, will require a global fabric of low-latency, GPU-powered infrastructure. “The world is going to become one giant inference engine,” Cochrane concludes. “And we're building the foundation for that today.” Tune in to hear how Vultr's bold moves in open-source AI infrastructure and its partnership with AMD may shape the next decade of cloud computing, one GPU cluster at a time.

ai open chief marketing officers open source amd gpu gpus cochrane cpus inference hpc supermicro vram amd gpus

Podcast #824 - NVIDIA Money, ASRock Ryzen Fix, 10 Dollar 10GbE, Two Great Sub-$25 IEMs, 6 Ghz WiFi + MORE!

PC Perspective Podcast

Play Episode Listen Later May 31, 2025 96:06

Another quarter is behind us, and NVIDIA just made more money than ever before. Thankfully this "ai" bubble will NEVER burst...right?? ASRock says they have fixed their AM5 issues, 10Gbe is going get inexpensive, and what can Brown do for your AIO? Also, there is only one rage quit incident. All that and so much more!00:00 Intro00:23 Patreon plea (and thanks)01:30 Food with Josh03:15 NVIDIA made a lot more money13:11 AMD acquired Enosemi in the name of "ai"15:20 Microsoft wants all software to update just like your phone20:02 This time ASRock really has the fix for dying Ryzen CPUs21:18 TSMC reminds us that they don't need High-NA EUV for leading process tech27:41 What can brown do for your AiO?29:10 Wi-Fi 6 traffic jams31:42 10GbE for 10 bucks (and extended home network speed discussion)41:43 AMD says 8GB is enough VRAM (for 1080p)47:03 Podcast sponsor NordLayer48:45 (in)Security Corner56:55 Gaming Quick Hits1:09:41 Kent reviews two affordable IEMs: Moondrop CHU II and KBEAR Flash1:22:34 Picks of the Week1:35:11 Outro1:35:28 Bonus clip: Kent rage quit the podcast because his mic stopped working ★ Support this podcast on Patreon ★

money food microsoft kent wifi dollar nvidia amd tsmc ryzen 8gb aio iems asrock vram ghz wifi

ThursdAI - May 8th - new Gemini pro, Mistral Medium, OpenAI restructuring, HeyGen Realistic Avatars & more AI news

The top AI news from the past week, every ThursdAI

Play Episode Listen Later May 9, 2025 93:54

Hey folks, Alex here (yes, real me, not my AI avatar, yet)Compared to previous weeks, this week was pretty "chill" in the world of AI, though we did get a pretty significant Gemini 2.5 Pro update, it basically beat itself on the Arena. With Mistral releasing a new medium model (not OSS) and Nvidia finally dropping Nemotron Ultra (both ignoring Qwen 3 performance) there was also a few open source updates. To me the highlight of this week was a breakthrough in AI Avatars, with Heygen's new IV model, Beating ByteDance's OmniHuman (our coverage) and Hedra labs, they've set an absolute SOTA benchmark for 1 photo to animated realistic avatar. Hell, Iet me record all this real quick and show you how good it is! How good is that?? I'm still kind of blown away. I have managed to get a free month promo code for you guys, look for it in the TL;DR section at the end of the newsletter. Of course, if you're rather watch than listen or read, here's our live recording on YTOpenSource AINVIDIA's Nemotron Ultra V1: Refining the Best with a Reasoning Toggle

TQ255: Treiber-Chaos bei Nvidia, 5060 Ti mit 8GB VRAM immer noch schlecht, Clair Obscur: Expedition 33 begeistert, ePA unsicher und rechtlich problematisch

Technikquatsch

Play Episode Listen Later May 5, 2025 113:19

Nach einer kurzen Folge muss eine lange kommen, nur so kann das Universum im Gleichgewicht bleiben. Nvidia schafft es auch ohne neue Grafikkarten im Gespräch zu bleiben: Seit Release von RTX 50 Blackwell gibt es ständig Probleme mit den Treibern, zuletzt Hotfix zum Hotfix, und endlich verspricht Treiber 576.28 alles besser zu machen. Und vermutlich ist es die bisher beste Version in diesem Jahr, auch wenn sie nicht alles bereinigt. Auch Computerbase haben nun eine RTX 5060 Ti mit 8 GB VRAM getestet und die Ergebnisse sind deutlich, manche Spiele werden von ordentlich spielbar zu kaum spielbar oder unspielbar. Mal sehen, wie das dann mit AMDs RX 9060 XT wird, die vermutlich am 21. Mai vorgestellt wird. Apple hat in den USA gerichtlich eins auf den Deckel bekommen und muss nun endlich App-Entwicklern erlauben, auf Bezahlmöglichkeiten abseits von Apple Pay zu leiten. Und Apple darf keinen Anteil davon verlangen. Die elektronische Patientenakte (ePA) ist kein großer Wurf, ständig neue Sicherheitslücken und -bedenken, rechtlich fragwürdig. In dem Zustand verzichten wir. Und dann ist da Clair Obscur: Expedition 33. Ein JRPG aus Frankreich. Wir sind begeistert! Viel Spaß mit Folge 255! Sprecher: Meep, Michael Kister, Mohammed Ali DadAudioproduktion: Michael KisterVideoproduktion: Mohammed Ali DadTitelbild: MeepBildquellen: NvidiaAufnahmedatum: 02.05.2025 Besucht unsim Discord https://discord.gg/SneNarVCBMauf Bluesky https://bsky.app/profile/technikquatsch.deauf TikTok https://www.tiktok.com/@technikquatschauf Youtube https://www.youtube.com/@technikquatschauf Instagram https://www.instagram.com/technikquatschauf Twitch https://www.twitch.tv/technikquatsch RSS-Feed https://technikquatsch.de/feed/podcast/Spotify https://open.spotify.com/show/62ZVb7ZvmdtXqqNmnZLF5uApple Podcasts https://podcasts.apple.com/de/podcast/technikquatsch/id1510030975 00:00:00 Themen: Treiber-Chaos bei Nvidia - Hotfix auf Hotfix; RTX 5060 Ti mit 8GB VRAM getestet - von gut spielbar zu unspielbar; RTX 5070 und 5070 Ti unter UVP, RX 9070 (XT) weiter darüber; Preiserhöhungen bei Xbox; GTA 6 auf 26. Mai 2026 verschoben; ein Sieg für Epic Games gegen Apple; ePA unsicher und rechtlich problematisch; Clair Obscur: Expedition 33 begeistert auch uns 00:01:39 sommerlich warm, Mo plant ein Balkonkraftwerk 00:07:33 Nvidia Treiber 576.28 endlich als WHQL, aber immer noch nicht frei von Problemenhttps://www.computerbase.de/news/grafikkarten/geforce-576-28-whql-neuester-nvidia-treiber-setzt-den-fokus-auf-bugfixes.92434/ 00:13:27 kostenloser Fotokurs bei DMhttps://www.dm.de/services/services-im-markt/fotoservice/dm-fotografie-werkstatt-134442 00:16:28 Nvidia RTX 5060 Ti mit 8GB VRAM getestethttps://www.computerbase.de/artikel/grafikkarten/nvidia-geforce-rtx-5060-ti-8-gb-test.92401/ 00:21:58 Preisüberblick: Nvidia RTX 5070 und 5070 Ti unter UVP, AMD Radeon RX 9070 und 9070 XT weiter über UVP 00:24:15 AMD Radeon RX 9070 GRE in China gestartethttps://www.computerbase.de/news/grafikkarten/radeon-rx-9070-gre-beschnittene-navi-48-gpu-startet-in-china-ab-500-euro.92374/ 00:26:47 Meep hat einen Riss in der Autoscheibe, daher hat sie sich endlich einen neuen Tankdeckel bestellt 00:29:49 AMD Radeon 9060 (XT) soll am 21. Mai auf der Computex vorgestellt werden, 8GB-Version evtl. nur für OEMshttps://www.computerbase.de/news/grafikkarten/radeon-rx-9060-xt-amd-soll-weiter-fest-mit-einer-8-gb-version-planen.92446/ 00:36:04 neue Regeln für Hörbücher bei Spotifyhttps://www.heise.de/news/Spotify-bietet-jetzt-auch-in-Deutschland-Hoerbuecher-an-10351541.htmlhttps://support.spotify.com/de/article/audiobooks-premium-plans/ 00:49:01 neues Urteil Epic vs Apple: Apple muss App-Entwicklern erlauben, auf externe Zahlungsmöglichkeiten zu lenken und darf keine Gebühren darauf erhebenhttps://www.theverge.com/news/659246/apple-epic-app-store-judge-ruling-controlhttps://store.epicgames.com/en-US/news/new-epic-games-store-webshops-and-revenue-...

The top AI news from the past week, every ThursdAI

Play Episode Listen Later May 1, 2025 90:21

Hey everyone, Alex here

8GB GPUs Are Very Bad Now, Is The RX 9060 XT in Trouble?

dead acast gpus podcast video geforce rtx amd radeon rx vram

Play Episode Listen Later Apr 25, 2025 73:36

Episode 69: The GeForce RTX 5060 Ti 8GB is really bad, there are many problems with it (especially at the price), so is the upcoming AMD Radeon RX 9060 XT 8GB in trouble? We discuss all of that in today's episode, and yes, we're getting into VRAM yet again.CHAPTERS00:00 - Intro00:33 - 8GB GPUs are Dead on Arrival13:54 - The Main Problem is the Name34:56 - Can it Use the Advertised Features?41:18 - AMD Radeon RX 9060 XT Rumor Talk59:16 - Updates From Our Boring LivesSUBSCRIBE TO THE PODCASTAudio: https://shows.acast.com/the-hardware-unboxed-podcastVideo: https://www.youtube.com/channel/UCqT8Vb3jweH6_tj2SarErfwSUPPORT US DIRECTLYPatreon: https://www.patreon.com/hardwareunboxedLINKSYouTube: https://www.youtube.com/@Hardwareunboxed/Twitter: https://twitter.com/HardwareUnboxedBluesky: https://bsky.app/profile/hardwareunboxed.bsky.social Hosted on Acast. See acast.com/privacy for more information.

ThursdAI - Apr 23rd - GPT Image & Grok APIs Drop, OpenAI ❤️ OS? Dia's Wild TTS & Building Better Agents!

The top AI news from the past week, every ThursdAI

Play Episode Listen Later Apr 24, 2025 96:54

Hey everyone, Alex here

Radio Sentai Castranger [530] Come On & Vram

Play Episode Listen Later Apr 12, 2025 152:16

The full seven casters are here this week as we continue Akiba April! Shouma and Rakia get to some Space Jam-nanigans, Doctor Who-rannoRanger teaches the Gozyugers how to be a Sentai, and then for Akibaranger, cosplay goes out of control, Yumeria's mom comes to visit, and the team visits the Toei filming lot. Casters Present: Blue Gray Yellow Orange Green North Red Show Notes: https://www.patreon.com/posts/126479116 Required Viewing: Kamen Rider Gavv 29, No.1 Sentai Gozyuger 7, Hikonin Sentai Akibaranger 4-6 Watch on YouTube: https://www.youtube.com/watch?v=nHrmRB-B3hg Hungry? Get CA$15 off your first 3 UberEats orders of CA$20 or more! https://ubereats.com/feed?promoCode=eats-christopherm5931ue Get $5 off your first order with SkipTheDishes! https://www.skipthedishes.com/r/6YaJc65HKg

space jam uber eats promo code sentai toei skipthedishes vram rakia akibaranger

303. AMD 9070 XTX 32GB, RX 9060 XT & RTX 5070 VRAM, FSR 4, Zen 6 | Hardware Unboxed

Pixel Perfect Videojuegos

Play Episode Listen Later Mar 30, 2025 156:50

Tim joins to discuss the GPU market, AMD Zen 6 Medusa, and Nvidia shafting PC Gamers… [SPON: Use "brokensilicon“ at CDKeyOffer for $23 Win11 Pro: https://www.cdkeyoffer.com/cko/Moore11 ] [SPON: Get a $10 coupon for Flex PCBs at JLCPCB: https://shorturl.at/mkloy ] [SPON: Save BIG on the MINISFORUM BD795 Series Motherboards: https://amzn.to/43Oy6P1 ] 0:00 Hardware Unboxed's role in the Techtuber Space 3:35 SteamOS coming to Desktop - How will HUB handle this? 7:10 0.1% Lows and Good Testing Practices 15:45 What made Steve decide to review the RTX 5070 from his roof? 17:44 Is the 5070 12GB worse than the 3070 8GB for its time? 30:42 RX 9060 XT VRAM & Nvidia's Design Decisions 46:35 FSR 4 vs DLSS 4 1:01:45 Porting FSR 4 to RDNA 3 1:15:53 Does AMD even need MFG? 1:19:46 RDNA 5 Strategy, RX 9070 XTX, Future of RADEON 1:30:37 How badly has Nvidia damaged their Mindshare w/ Blackwell? 1:51:57 AMD Zen 6 on TSMC N2X 1:58:59 Medusa Halo & Nvidia APUs 2:06:18 Biggest mistakes made by Intel, AMD, XBOX 2:20:55 AI Sucks Subscribe to the HUB Podcast: https://www.youtube.com/@TheHardwareUnboxedPodcast Subscribe to Monitors Unboxed: https://www.youtube.com/@monitorsunboxed HUB RT Noise video: https://youtu.be/9ptUApTshik HUB FSR 4 vs DLSS 4 Review: https://youtu.be/H38a0vjQbJg HUB 9070 XT vs 5070 Ti: https://youtu.be/tHI2LyNX3ls HUB Pricing Analysis: https://youtu.be/eGx_T8zCkWc MUB 27" OLED Review: https://youtu.be/tBjB5ZUAfAE MLID Zen 6 Leak: https://youtu.be/970JyCapx8A MLID Sound Wave Leak: https://youtu.be/9lEsAA6zVjo MLID 9070 / 5070 Launch Analysis: https://youtu.be/huy65HPPLSY https://www.tomshardware.com/pc-components/gpus/lisa-su-says-radeon-rx-9070-series-gpu-sales-are-10x-higher-than-its-predecessors-for-the-first-week-of-availability

strategy future xbox intel nvidia hardware leak lows hub amd medusa blackwell gpu rx rtx unboxed mindshare xt steamos dlss fsr 8gb mfg 32gb rdna 12gb amd zen vram

Geekshow Helpdesk: Artificial Intelligence and Hearts

Geekshow Podcast

Play Episode Listen Later Mar 20, 2025 60:54

-RTX Pro 6000: Nvidia's RTX Pro 6000 has 96GB of VRAM and 600W of power -Bambu big printer!!! https://www.tomshardware.com/3d-printing/bambu-lab-announces-new-printer-h2d# -The first Sodium Ion battery for the masses: https://www.theverge.com/news/631357/elecom-power-bank-battery-sodium-ion -Stranded Astronauts make it back finally: https://www.npr.org/2025/03/18/nx-s1-5331907/nasa-astronauts-return-long-space-station-suni-williams-butch-wilmore -AI search engines are wrong 60% of the time: AI search engines give incorrect answers at an alarming 60% rate, study says -E2EE is coming for RCS messaging on iOS and Android: RCS Messaging Adds End-to-End Encryption Between Android and iOS -Idiocracy has begun: Have Humans Passed Peak Brain Power? -PEBBLE IS BACK! With actual products now: The first new Pebble smartwatches are coming later this year -Alexa+ https://www.aboutamazon.com/news/devices/new-alexa-generative-artificial-intelligence -Oh Roku… not you too. I may switch to Apple (barf) TV…. https://arstechnica.com/gadgets/2025/03/roku-says-unpopular-autoplay-ads-are-just-a-test/ -Reduce your Surgery Risk https://gizmodo.com/why-surgeries-on-fridays-are-riskier-2000571312 -Artificial Hearts are cool! https://gizmodo.com/patient-with-artificial-heart-smashes-survival-record-2000574948

tv ai apple artificial intelligence hearts ios reduce nvidia pebble rcs help desk bambu geek show vram

Episode 110 - Jäger der verlorenen FPS

Die Simulanten

Play Episode Listen Later Mar 16, 2025 98:46

Lossless Scaling? Ente? Hund? TLOD? OLOD? Frame Generation? 4K? VRAM? Bei der Jagd nach hohen Bildwiederholraten im Flugsimulator gibt es eine Menge Fachbegriffe, die man sich um die Ohren hauen kann. Daher wollen auch eure drei Simulanten das Thema mal genauer beleuchten – mit der Frage: Wie bekommt man den Sim am besten flüssig. Alles kann, nichts muss? Hört rein und findet es raus!

thema alles bei sim ohren daher 4k hund frage wie jagd ente verlorenen flugsimulator vram

E104 - Death Stranding 2, Split Fiction, Nintendo Switch 2, Gráficas AMD, Portatil Xbox, Doom Dark Ages

Play Episode Listen Later Mar 13, 2025 141:45

Pixel Perfect Videojuegos, el programa de radio de ninguna radio, presenta el episodio 104 (13/03/2025). ¡Estamos en YouTube (@elpixelpodcast), TikTok y Patreon (audio en 320kbps antes que nadie)! Agradecemos a nuestra pedazo de comunidad y damos la bienvenida al nuevo patreon Germán. Suscríbete y activa la campanita.En Lo Más Fresco seguimos de cerca los rumores de la Nintendo Switch 2. Tras el primer vistazo oficial, filtraciones técnicas revelan soporte para NFC en Joy-Cons, Wi-Fi 6 y hasta función de ratón. ¿Chat de voz integrado? Una patente lo confirma. Por otro lado, en Made in Japan analizamos el lanzamiento de las RX 9070 y 9070 XT de AMD. Por fin, competencia real para NVIDIA: 16 GB de VRAM, rasterización igualada y FSR4 (desarrollado con Sony) que empata con DLSS gracias a IA. ¡Y están disponibles en tiendas! Eso sí, el precio recomendado duró poco. Aunque pierden en productividad y Frame Generation, para jugar son una opción sólida. ¿El giro que necesitaba el mercado?Como siempre, noticias de toda la actualidad: Split Fiction (de los creadores de It Takes Two) revoluciona el coop con su Pase de Amigo: juega con un colega sin que compre el juego. Death Stranding 2 estrena tráiler de 10 minutos: caminatas épicas, combates en vehículos y un villano kojimesco que promete arrasar en junio. Microsoft confirma Xbox Next (2027-2028) con GPU equivalente a RTX 5070 y 32 GB de RAM, mientras planea un portátil con Asus para competir con Steam Deck. Y en Terminator 2D: No Fate, el pixel art de Bitmap Bureau revive al T-800 en un beat 'em up retro que hará llorar a las recreativas.Y para terminar, en Quemando Controles: Dani se pierde en Arcade Planet 3.0, uno de los salones recreativos más grandes de Europa (House of the Dead, Time Crisis, hasta Luigi's Mansion), mientras Nacho sigue enfrascado en Kingdom Come Deliverance 2 (actualización 1.2 con barberos y modo hardcore en abril). Como siempre, cerramos leyendo vuestros comentarios y reaccionando en directo. ¡Gracias a nuestros patreons, y todos los que hacéis posible este programa! Nos vemos en el episodio 105.Música: NFS Underground 2, Super Mario Bros Remix, FIFA 05, Street of Rage 2, Metal Gear Solid, eFootball

Radio Sentai Castranger [525] The Return of Raven

venom outsiders uber eats promo code sentai skipthedishes vram

Play Episode Listen Later Mar 8, 2025 141:19

It's Castranger's 11th Anniversary, and we're celebrating by welcoming back into the fold one of the OG Castrangers - Red Caster ChouRaven! This week we talk about how an evil vtuber gets foiled by Gavv and Vram while Hanto continues to explore his Venom arc, the popularity contest and a terrifying new incarnation of Don Momotaro, and how Dan Kuroto wishes himself human again in Outsiders. Casters Present: Blue Gray Yellow Orange North Red Show Notes: https://www.patreon.com/posts/123895280 Required Viewing: Kamen Rider Gavv 25, No.1 Sentai Gozyuger 3, Kamen Rider Outsiders 5 Watch on YouTube: https://www.youtube.com/watch?v=vwYGZD538Is Hungry? Get CA$15 off your first 3 UberEats orders of CA$20 or more! https://ubereats.com/feed?promoCode=eats-christopherm5931ue Get $5 off your first order with SkipTheDishes! https://www.skipthedishes.com/r/6YaJc65HKg

LCC 322 - Maaaaveeeeen 4 !

Les Cast Codeurs Podcast

Play Episode Listen Later Feb 9, 2025 77:13

Arnaud et Emmanuel discutent des nouvelles de ce mois. On y parle intégrité de JVM, fetch size de JDBC, MCP, de prompt engineering, de DeepSeek bien sûr mais aussi de Maven 4 et des proxy de répository Maven. Et d'autres choses encore, bonne lecture. Enregistré le 7 février 2025 Téléchargement de l'épisode LesCastCodeurs-Episode-322.mp3 ou en vidéo sur YouTube. News Langages Les evolutions de la JVM pour augmenter l'intégrité https://inside.java/2025/01/03/evolving-default-integrity/ un article sur les raisons pour lesquelles les editeurs de frameworks et les utilisateurs s'arrachent les cheveux et vont continuer garantir l'integrite du code et des données en enlevant des APIs existantes historiquemnt agents dynamiques, setAccessible, Unsafe, JNI Article expliques les risques percus par les mainteneurs de la JVM Franchement c'est un peu leg sur les causes l'article, auto propagande JavaScript Temporal, enfin une API propre et moderne pour gérer les dates en JS https://developer.mozilla.org/en-US/blog/javascript-temporal-is-coming/ JavaScript Temporal est un nouvel objet conçu pour remplacer l'objet Date, qui présente des défauts. Il résout des problèmes tels que le manque de prise en charge des fuseaux horaires et la mutabilité. Temporal introduit des concepts tels que les instants, les heures civiles et les durées. Il fournit des classes pour gérer diverses représentations de date/heure, y compris celles qui tiennent compte du fuseau horaire et celles qui n'en tiennent pas compte. Temporal simplifie l'utilisation de différents calendriers (par exemple, chinois, hébreu). Il comprend des méthodes pour les comparaisons, les conversions et le formatage des dates et des heures. La prise en charge par les navigateurs est expérimentale, Firefox Nightly ayant l'implémentation la plus aboutie. Un polyfill est disponible pour essayer Temporal dans n'importe quel navigateur. Librairies Un article sur les fetch size du JDBC et les impacts sur vos applications https://in.relation.to/2025/01/24/jdbc-fetch-size/ qui connait la valeur fetch size par default de son driver? en fonction de vos use cases, ca peut etre devastateur exemple d'une appli qui retourne 12 lignes et un fetch size de oracle a 10, 2 a/r pour rien et si c'est 50 lignres retournées la base de donnée est le facteur limitant, pas Java donc monter sont fetch size est avantageux, on utilise la memoire de Java pour eviter la latence Quarkus annouce les MCP servers project pour collecter les servier MCP en Java https://quarkus.io/blog/introducing-mcp-servers/ MCP d'Anthropic introspecteur de bases JDBC lecteur de filke system Dessine en Java FX demarrables facilement avec jbang et testes avec claude desktop, goose et mcp-cli permet d'utliser le pouvoir des librarires Java de votre IA d'ailleurs Spring a la version 0.6 de leur support MCP https://spring.io/blog/2025/01/23/spring-ai-mcp-0 Infrastructure Apache Flink sur Kibernetes https://www.decodable.co/blog/get-running-with-apache-flink-on-kubernetes-2 un article tres complet ejn deux parties sur l'installation de Flink sur Kubernetes installation, setup mais aussi le checkpointing, la HA, l'observablité Data et Intelligence Artificielle 10 techniques de prompt engineering https://medium.com/google-cloud/10-prompt-engineering-techniques-every-beginner-should-know-bf6c195916c7 Si vous voulez aller plus loin, l'article référence un très bon livre blanc sur le prompt engineering https://www.kaggle.com/whitepaper-prompt-engineering Les techniques évoquées : Zero-Shot Prompting: On demande directement à l'IA de répondre à une question sans lui fournir d'exemple préalable. C'est comme si on posait une question à une personne sans lui donner de contexte. Few-Shot Prompting: On donne à l'IA un ou plusieurs exemples de la tâche qu'on souhaite qu'elle accomplisse. C'est comme montrer à quelqu'un comment faire quelque chose avant de lui demander de le faire. System Prompting: On définit le contexte général et le but de la tâche pour l'IA. C'est comme donner à l'IA des instructions générales sur ce qu'elle doit faire. Role Prompting: On attribue un rôle spécifique à l'IA (enseignant, journaliste, etc.). C'est comme demander à quelqu'un de jouer un rôle spécifique. Contextual Prompting: On fournit des informations supplémentaires ou un contexte pour la tâche. C'est comme donner à quelqu'un toutes les informations nécessaires pour répondre à une question. Step-Back Prompting: On pose d'abord une question générale, puis on utilise la réponse pour poser une question plus spécifique. C'est comme poser une question ouverte avant de poser une question plus fermée. Chain-of-Thought Prompting: On demande à l'IA de montrer étape par étape comment elle arrive à sa conclusion. C'est comme demander à quelqu'un d'expliquer son raisonnement. Self-Consistency Prompting: On pose plusieurs fois la même question à l'IA et on compare les réponses pour trouver la plus cohérente. C'est comme vérifier une réponse en la posant sous différentes formes. Tree-of-Thoughts Prompting: On permet à l'IA d'explorer plusieurs chemins de raisonnement en même temps. C'est comme considérer toutes les options possibles avant de prendre une décision. ReAct Prompting: On permet à l'IA d'interagir avec des outils externes pour résoudre des problèmes complexes. C'est comme donner à quelqu'un les outils nécessaires pour résoudre un problème. Les patterns GenAI the thoughtworks https://martinfowler.com/articles/gen-ai-patterns/ tres introductif et pre RAG le direct prompt qui est un appel direct au LLM: limitations de connaissance et de controle de l'experience eval: evaluer la sortie d'un LLM avec plusieurs techniques mais fondamentalement une fonction qui prend la demande, la reponse et donc un score numerique evaluation via un LLM (le meme ou un autre), ou evaluation humaine tourner les evaluations a partir de la chaine de build amis aussi en live vu que les LLMs puvent evoluer. Decrit les embedding notament d'image amis aussi de texte avec la notion de contexte DeepSeek et la fin de la domination de NVidia https://youtubetranscriptoptimizer.com/blog/05_the_short_case_for_nvda un article sur les raisons pour lesquelles NVIDIA va se faire cahllengert sur ses marges 90% de marge quand meme parce que les plus gros GPU et CUDA qui est proprio mais des approches ardware alternatives existent qui sont plus efficientes (TPU et gros waffle) Google, MS et d'autres construisent leurs GPU alternatifs CUDA devient de moins en moins le linga franca avec l'investissement sur des langages intermediares alternatifs par Apple, Google OpenAI etc L'article parle de DeepSkeek qui est venu mettre une baffe dans le monde des LLMs Ils ont construit un competiteur a gpt4o et o1 avec 5M de dollars et des capacites de raisonnements impressionnant la cles c'etait beaucoup de trick d'optimisation mais le plus gros est d'avoir des poids de neurores sur 8 bits vs 32 pour les autres. et donc de quatizer au fil de l'eau et au moment de l'entrainement beaucoup de reinforcemnt learning innovatifs aussi et des Mixture of Expert donc ~50x moins chers que OpenAI Donc plus besoin de GPU qui on des tonnes de vRAM ah et DeepSeek est open source un article de semianalytics change un peu le narratif le papier de DeepSkeek en dit long via ses omissions par ensemple les 6M c'est juste l'inference en GPU, pas les couts de recherches et divers trials et erreurs en comparaison Claude Sonnet a coute 10M en infererence DeepSeek a beaucoup de CPU pre ban et ceratins post bans evalués a 5 Milliards en investissement. leurs avancées et leur ouverture reste extremement interessante Une intro à Apache Iceberg http://blog.ippon.fr/2025/01/17/la-revolution-des-donnees-lavenement-des-lakehouses-avec-apache-iceberg/ issue des limites du data lake. non structuré et des Data Warehouses aux limites en diversite de données et de volume entrent les lakehouse Et particulierement Apache Iceberg issue de Netflix gestion de schema mais flexible notion de copy en write vs merge on read en fonction de besoins garantie atomicite, coherence, isoliation et durabilite notion de time travel et rollback partitions cachées (qui abstraient la partition et ses transfos) et evolution de partitions compatbile avec les moteurs de calcul comme spark, trino, flink etc explique la structure des metadonnées et des données Guillaume s'amuse à générer des histoires courtes de Science-Fiction en programmant des Agents IA avec LangChain4j et aussi avec des workflows https://glaforge.dev/posts/2025/01/27/an-ai-agent-to-generate-short-scifi-stories/ https://glaforge.dev/posts/2025/01/31/a-genai-agent-with-a-real-workflow/ Création d'un générateur automatisé de nouvelles de science-fiction à l'aide de Gemini et Imagen en Java, LangChain4j, sur Google Cloud. Le système génère chaque nuit des histoires, complétées par des illustrations créées par le modèle Imagen 3, et les publie sur un site Web. Une étape d'auto-réflexion utilise Gemini pour sélectionner la meilleure image pour chaque chapitre. L'agent utilise un workflow explicite, drivé par le code Java, où les étapes sont prédéfinies dans le code, plutôt que de s'appuyer sur une planification basée sur LLM. Le code est disponible sur GitHub et l'application est déployée sur Google Cloud. L'article oppose les agents de workflow explicites aux agents autonomes, en soulignant les compromis de chaque approche. Car parfois, les Agent IA autonomes qui gèrent leur propre planning hallucinent un peu trop et n'établissent pas un plan correctement, ou ne le suive pas comme il faut, voire hallucine des “function call”. Le projet utilise Cloud Build, le Cloud Run jobs, Cloud Scheduler, Firestore comme base de données, et Firebase pour le déploiement et l'automatisation du frontend. Dans le deuxième article, L'approche est différente, Guillaume utilise un outil de Workflow, plutôt que de diriger le planning avec du code Java. L'approche impérative utilise du code Java explicite pour orchestrer le workflow, offrant ainsi un contrôle et une parallélisation précis. L'approche déclarative utilise un fichier YAML pour définir le workflow, en spécifiant les étapes, les entrées, les sorties et l'ordre d'exécution. Le workflow comprend les étapes permettant de générer une histoire avec Gemini 2, de créer une invite d'image, de générer des images avec Imagen 3 et d'enregistrer le résultat dans Cloud Firestore (base de donnée NoSQL). Les principaux avantages de l'approche impérative sont un contrôle précis, une parallélisation explicite et des outils de programmation familiers. Les principaux avantages de l'approche déclarative sont des définitions de workflow peut-être plus faciles à comprendre (même si c'est un YAML, berk !) la visualisation, l'évolutivité et une maintenance simplifiée (on peut juste changer le YAML dans la console, comme au bon vieux temps du PHP en prod). Les inconvénients de l'approche impérative incluent le besoin de connaissances en programmation, les défis potentiels en matière de maintenance et la gestion des conteneurs. Les inconvénients de l'approche déclarative incluent une création YAML pénible, un contrôle de parallélisation limité, l'absence d'émulateur local et un débogage moins intuitif. Le choix entre les approches dépend des exigences du projet, la déclarative étant adaptée aux workflows plus simples. L'article conclut que la planification déclarative peut aider les agents IA à rester concentrés et prévisibles. Outillage Vulnérabilité des proxy Maven https://github.blog/security/vulnerability-research/attacks-on-maven-proxy-repositories/ Quelque soit le langage, la techno, il est hautement conseillé de mettre en place des gestionnaires de repositories en tant que proxy pour mieux contrôler les dépendances qui contribuent à la création de vos produits Michael Stepankin de l'équipe GitHub Security Lab a cherché a savoir si ces derniers ne sont pas aussi sources de vulnérabilité en étudiant quelques CVEs sur des produits comme JFrog Artifactory, Sonatype Nexus, et Reposilite Certaines failles viennent de la UI des produits qui permettent d'afficher les artifacts (ex: mettez un JS dans un fichier POM) et même de naviguer dedans (ex: voir le contenu d'un jar / zip et on exploite l'API pour lire, voir modifier des fichiers du serveur en dehors des archives) Les artifacts peuvent aussi être compromis en jouant sur les paramètres propriétaires des URLs ou en jouant sur le nomage avec les encodings. Bref, rien n'est simple ni niveau. Tout système rajoute de la compléxité et il est important de les tenir à mettre à jour. Il faut surveiller activement sa chaine de distribution via différents moyens et ne pas tout miser sur le repository manager. L'auteur a fait une présentation sur le sujet : https://www.youtube.com/watch?v=0Z_QXtk0Z54 Apache Maven 4… Bientôt, c'est promis …. qu'est ce qu'il y aura dedans ? https://gnodet.github.io/maven4-presentation/ Et aussi https://github.com/Bukama/MavenStuff/blob/main/Maven4/whatsnewinmaven4.md Apache Maven 4 Doucement mais surement …. c'est le principe d'un projet Maven 4.0.0-rc-2 est dispo (Dec 2024). Maven a plus de 20 ans et est largement utilisé dans l'écosystème Java. La compatibilité ascendante a toujours été une priorité, mais elle a limité la flexibilité. Maven 4 introduit des changements significatifs, notamment un nouveau schéma de construction et des améliorations du code. Changements du POM Séparation du Build-POM et du Consumer-POM : Build-POM : Contient des informations propres à la construction (ex. plugins, configurations). Consumer-POM : Contient uniquement les informations nécessaires aux consommateurs d'artefacts (ex. dépendances). Nouveau Modèle Version 4.1.0 : Utilisé uniquement pour le Build-POM, alors que le Consumer-POM reste en 4.0.0 pour la compatibilité. Introduit de nouveaux éléments et en marque certains comme obsolètes. Modules renommés en sous-projets : “Modules” devient “Sous-projets” pour éviter la confusion avec les Modules Java. L'élément remplace (qui reste pris en charge). Nouveau type de packaging : “bom” (Bill of Materials) : Différencie les POMs parents et les BOMs de gestion des dépendances. Prend en charge les exclusions et les imports basés sur les classifiers. Déclaration explicite du répertoire racine : permet de définir explicitement le répertoire racine du projet. Élimine toute ambiguïté sur la localisation des racines de projet. Nouvelles variables de répertoire : ${project.rootDirectory}, ${session.topDirectory} et ${session.rootDirectory} pour une meilleure gestion des chemins. Remplace les anciennes solutions non officielles et variables internes obsolètes. Prise en charge de syntaxes alternatives pour le POM Introduction de ModelParser SPI permettant des syntaxes alternatives pour le POM. Apache Maven Hocon Extension est un exemple précoce de cette fonctionnalité. Améliorations pour les sous-projets Versioning automatique des parents Il n'est plus nécessaire de définir la version des parents dans chaque sous-projet. Fonctionne avec le modèle de version 4.1.0 et s'étend aux dépendances internes au projet. Support complet des variables compatibles CI Le Flatten Maven Plugin n'est plus requis. Prend en charge les variables comme ${revision} pour le versioning. Peut être défini via maven.config ou la ligne de commande (mvn verify -Drevision=4.0.1). Améliorations et corrections du Reactor Correction de bug : Gestion améliorée de --also-make lors de la reprise des builds. Nouvelle option --resume (-r) pour redémarrer à partir du dernier sous-projet en échec. Les sous-projets déjà construits avec succès sont ignorés lors de la reprise. Constructions sensibles aux sous-dossiers : Possibilité d'exécuter des outils sur des sous-projets sélectionnés uniquement. Recommandation : Utiliser mvn verify plutôt que mvn clean install. Autres Améliorations Timestamps cohérents pour tous les sous-projets dans les archives packagées. Déploiement amélioré : Le déploiement ne se produit que si tous les sous-projets sont construits avec succès. Changements de workflow, cycle de vie et exécution Java 17 requis pour exécuter Maven Java 17 est le JDK minimum requis pour exécuter Maven 4. Les anciennes versions de Java peuvent toujours être ciblées pour la compilation via Maven Toolchains. Java 17 a été préféré à Java 21 en raison d'un support à long terme plus étendu. Mise à jour des plugins et maintenance des applications Suppression des fonctionnalités obsolètes (ex. Plexus Containers, expressions ${pom.}). Mise à jour du Super POM, modifiant les versions par défaut des plugins. Les builds peuvent se comporter différemment ; définissez des versions fixes des plugins pour éviter les changements inattendus. Maven 4 affiche un avertissement si des versions par défaut sont utilisées. Nouveau paramètre “Fail on Severity” Le build peut échouer si des messages de log atteignent un niveau de gravité spécifique (ex. WARN). Utilisable via --fail-on-severity WARN ou -fos WARN. Maven Shell (mvnsh) Chaque exécution de mvn nécessitait auparavant un redémarrage complet de Java/Maven. Maven 4 introduit Maven Shell (mvnsh), qui maintient un processus Maven résident unique ouvert pour plusieurs commandes. Améliore la performance et réduit les temps de build. Alternative : Utilisez Maven Daemon (mvnd), qui gère un pool de processus Maven résidents. Architecture Un article sur les feature flags avec Unleash https://feeds.feedblitz.com//911939960/0/baeldungImplement-Feature-Flags-in-Java-With-Unleash Pour A/B testing et des cycles de développements plus rapides pour « tester en prod » Montre comment tourner sous docker unleash Et ajouter la librairie a du code java pour tester un feature flag Sécurité Keycloak 26.1 https://www.keycloak.org/2025/01/keycloak-2610-released.html detection des noeuds via la proble base de donnée aulieu echange reseau virtual threads pour infinispan et jgroups opentelemetry tracing supporté et plein de fonctionalités de sécurité Loi, société et organisation Les grands morceaux du coût et revenus d'une conférence. Ici http://bdx.io|bdx.io https://bsky.app/profile/ameliebenoit33.bsky.social/post/3lgzslhedzk2a 44% le billet 52% les sponsors 38% loc du lieu 29% traiteur et café 12% standiste 5% frais speaker (donc pas tous) Ask Me Anything Julien de Provin: J'aime beaucoup le mode “continuous testing” de Quarkus, et je me demandais s'il existait une alternative en dehors de Quarkus, ou à défaut, des ressources sur son fonctionnement ? J'aimerais beaucoup avoir un outil agnostique utilisable sur les projets non-Quarkus sur lesquels j'intervient, quitte à y metttre un peu d'huile de coude (ou de phalange pour le coup). https://github.com/infinitest/infinitest/ Conférences La liste des conférences provenant de Developers Conferences Agenda/List par Aurélie Vache et contributeurs : 6-7 février 2025 : Touraine Tech - Tours (France) 21 février 2025 : LyonJS 100 - Lyon (France) 28 février 2025 : Paris TS La Conf - Paris (France) 6 mars 2025 : DevCon #24 : 100% IA - Paris (France) 13 mars 2025 : Oracle CloudWorld Tour Paris - Paris (France) 14 mars 2025 : Rust In Paris 2025 - Paris (France) 19-21 mars 2025 : React Paris - Paris (France) 20 mars 2025 : PGDay Paris - Paris (France) 20-21 mars 2025 : Agile Niort - Niort (France) 25 mars 2025 : ParisTestConf - Paris (France) 26-29 mars 2025 : JChateau Unconference 2025 - Cour-Cheverny (France) 27-28 mars 2025 : SymfonyLive Paris 2025 - Paris (France) 28 mars 2025 : DataDays - Lille (France) 28-29 mars 2025 : Agile Games France 2025 - Lille (France) 3 avril 2025 : DotJS - Paris (France) 3 avril 2025 : SoCraTes Rennes 2025 - Rennes (France) 4 avril 2025 : Flutter Connection 2025 - Paris (France) 4 avril 2025 : aMP Orléans 04-04-2025 - Orléans (France) 10-11 avril 2025 : Android Makers - Montrouge (France) 10-12 avril 2025 : Devoxx Greece - Athens (Greece) 16-18 avril 2025 : Devoxx France - Paris (France) 23-25 avril 2025 : MODERN ENDPOINT MANAGEMENT EMEA SUMMIT 2025 - Paris (France) 24 avril 2025 : IA Data Day 2025 - Strasbourg (France) 29-30 avril 2025 : MixIT - Lyon (France) 7-9 mai 2025 : Devoxx UK - London (UK) 15 mai 2025 : Cloud Toulouse - Toulouse (France) 16 mai 2025 : AFUP Day 2025 Lille - Lille (France) 16 mai 2025 : AFUP Day 2025 Lyon - Lyon (France) 16 mai 2025 : AFUP Day 2025 Poitiers - Poitiers (France) 24 mai 2025 : Polycloud - Montpellier (France) 24 mai 2025 : NG Baguette Conf 2025 - Nantes (France) 5-6 juin 2025 : AlpesCraft - Grenoble (France) 5-6 juin 2025 : Devquest 2025 - Niort (France) 10-11 juin 2025 : Modern Workplace Conference Paris 2025 - Paris (France) 11-13 juin 2025 : Devoxx Poland - Krakow (Poland) 12-13 juin 2025 : Agile Tour Toulouse - Toulouse (France) 12-13 juin 2025 : DevLille - Lille (France) 13 juin 2025 : Tech F'Est 2025 - Nancy (France) 17 juin 2025 : Mobilis In Mobile - Nantes (France) 24 juin 2025 : WAX 2025 - Aix-en-Provence (France) 25-26 juin 2025 : Agi'Lille 2025 - Lille (France) 25-27 juin 2025 : BreizhCamp 2025 - Rennes (France) 26-27 juin 2025 : Sunny Tech - Montpellier (France) 1-4 juillet 2025 : Open edX Conference - 2025 - Palaiseau (France) 7-9 juillet 2025 : Riviera DEV 2025 - Sophia Antipolis (France) 18-19 septembre 2025 : API Platform Conference - Lille (France) & Online 2-3 octobre 2025 : Volcamp - Clermont-Ferrand (France) 6-10 octobre 2025 : Devoxx Belgium - Antwerp (Belgium) 9-10 octobre 2025 : Forum PHP 2025 - Marne-la-Vallée (France) 16-17 octobre 2025 : DevFest Nantes - Nantes (France) 4-7 novembre 2025 : NewCrafts 2025 - Paris (France) 6 novembre 2025 : dotAI 2025 - Paris (France) 7 novembre 2025 : BDX I/O - Bordeaux (France) 12-14 novembre 2025 : Devoxx Morocco - Marrakech (Morocco) 28-31 janvier 2026 : SnowCamp 2026 - Grenoble (France) 23-25 avril 2026 : Devoxx Greece - Athens (Greece) 17 juin 2026 : Devoxx Poland - Krakow (Poland) Nous contacter Pour réagir à cet épisode, venez discuter sur le groupe Google https://groups.google.com/group/lescastcodeurs Contactez-nous via X/twitter https://twitter.com/lescastcodeurs ou Bluesky https://bsky.app/profile/lescastcodeurs.com Faire un crowdcast ou une crowdquestion Soutenez Les Cast Codeurs sur Patreon https://www.patreon.com/LesCastCodeurs Tous les épisodes et toutes les infos sur https://lescastcodeurs.com/

netflix google apple france spring ms data fail tree web dans car expert tout ia science fiction faire chain gemini unleash nvidia api blue sky peut conf sous ui nouvelle nouveau java 5m github guillaume workflow apis aur bref mise ici temporal 10m imagen nouvelles gestion arnaud llm warn prise suppression wax cpu maven gpu genai php google cloud bient unsafe vall 6m loi js kubernetes prend quelque urls rag aix intelligence artificielle pom orl enregistr changements montre modules mixture milliards paris france possibilit fonctionne flink mcp cuda nosql utilis constructions firebase dessine poms jvm vache data warehouses yaml remplace tpu cves versioning devcon jdk lyon france nouveau mod vram boms cloud run apache iceberg provence france jdbc strasbourg france firestore lille france cloud build cloud firestore

Radio Sentai Castranger [520] Send Gra-nudes

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Feb 1, 2025 76:01

We took a week off because Ichi was in the hospital, but he dragged his way home so we wouldn't miss another week. Lane's injured his knee, Gar ate a bad tangerine, but on the plus side, Emily's back! We talk about a very plot-heavy batch of episodes for both shows. In Gavv, Vram's full backstory is revealed and he begins his change to the side of good, but Hanto's angry. In Boonboomger, things are dire as Taiya tries to revive Boondorio, and Spindo's true allegiances begin to show. And/or explode. Casters Present: Blue Gray Orange North Show Notes: https://www.patreon.com/posts/121242800 Required Viewing: Kamen Rider Gavv 19-20, Bakuage Sentai Boonboomger 45-46 Watch on YouTube: https://www.youtube.com/watch?v=QevyuVgX5Gk Hungry? Get CA$15 off your first 3 UberEats orders of CA$20 or more! https://ubereats.com/feed?promoCode=eats-christopherm5931ue Get $5 off your first order with SkipTheDishes! https://www.skipthedishes.com/r/6YaJc65HKg

uber eats nudes gar promo code ichi sentai skipthedishes vram

Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)

Play Episode Listen Later Jan 19, 2025 60:04

Sponsorships and applications for the AI Engineer Summit in NYC are live! (Speaker CFPs have closed) If you are building AI agents or leading teams of AI Engineers, this will be the single highest-signal conference of the year for you.Right after Christmas, the Chinese Whale Bros ended 2024 by dropping the last big model launch of the year: DeepSeek v3. Right now on LM Arena, DeepSeek v3 has a score of 1319, right under the full o1 model, Gemini 2, and 4o latest. This makes it the best open weights model in the world in January 2025.There has been a big recent trend in Chinese labs releasing very large open weights models, with TenCent releasing Hunyuan-Large in November and Hailuo releasing MiniMax-Text this week, both over 400B in size. However these extra-large language models are very difficult to serve.Baseten was the first of the Inference neocloud startups to get DeepSeek V3 online, because of their H200 clusters, their close collaboration with the DeepSeek team and early support of SGLang, a relatively new VLLM alternative that is also used at frontier labs like X.ai. Each H200 has 141 GB of VRAM with 4.8 TB per second of bandwidth, meaning that you can use 8 H200's in a node to inference DeepSeek v3 in FP8, taking into account KV Cache needs. We have been close to Baseten since Sarah Guo introduced Amir Haghighat to swyx, and they supported the very first Latent Space Demo Day in San Francisco, which was effectively the trial run for swyx and Alessio to work together! Since then, Philip Kiely also led a well attended workshop on TensorRT LLM at the 2024 World's Fair. We worked with him to get two of their best representatives, Amir and Lead Model Performance Engineer Yineng Zhang, to discuss DeepSeek, SGLang, and everything they have learned running Mission Critical Inference workloads at scale for some of the largest AI products in the world.The Three Pillars of Mission Critical InferenceWe initially planned to focus the conversation on SGLang, but Amir and Yineng were quick to correct us that the choice of inference framework is only the simplest, first choice of 3 things you need for production inference at scale:“I think it takes three things, and each of them individually is necessary but not sufficient: * Performance at the model level: how fast are you running this one model running on a single GPU, let's say. The framework that you use there can, can matter. The techniques that you use there can matter. The MLA technique, for example, that Yineng mentioned, or the CUDA kernels that are being used. But there's also techniques being used at a higher level, things like speculative decoding with draft models or with Medusa heads. And these are implemented in the different frameworks, or you can even implement it yourself, but they're not necessarily tied to a single framework. But using speculative decoding gets you massive upside when it comes to being able to handle high throughput. But that's not enough. Invariably, that one model running on a single GPU, let's say, is going to get too much traffic that it cannot handle.* Horizontal scaling at the cluster/region level: And at that point, you need to horizontally scale it. That's not an ML problem. That's not a PyTorch problem. That's an infrastructure problem. How quickly do you go from, a single replica of that model to 5, to 10, to 100. And so that's the second, that's the second pillar that is necessary for running these machine critical inference workloads.And what does it take to do that? It takes, some people are like, Oh, You just need Kubernetes and Kubernetes has an autoscaler and that just works. That doesn't work for, for these kinds of mission critical inference workloads. And you end up catching yourself wanting to bit by bit to rebuild those infrastructure pieces from scratch. This has been our experience. * And then going even a layer beyond that, Kubernetes runs in a single. cluster. It's a single cluster. It's a single region tied to a single region. And when it comes to inference workloads and needing GPUs more and more, you know, we're seeing this that you cannot meet the demand inside of a single region. A single cloud's a single region. In other words, a single model might want to horizontally scale up to 200 replicas, each of which is, let's say, 2H100s or 4H100s or even a full node, you run into limits of the capacity inside of that one region. And what we had to build to get around that was the ability to have a single model have replicas across different regions. So, you know, there are models on Baseten today that have 50 replicas in GCP East and, 80 replicas in AWS West and Oracle in London, etc.* Developer experience for Compound AI Systems: The final one is wrapping the power of the first two pillars in a very good developer experience to be able to afford certain workflows like the ones that I mentioned, around multi step, multi model inference workloads, because more and more we're seeing that the market is moving towards those that the needs are generally in these sort of more complex workflows. We think they said it very well.Show Notes* Amir Haghighat, Co-Founder, Baseten* Yineng Zhang, Lead Software Engineer, Model Performance, BasetenFull YouTube EpisodePlease like and subscribe!Timestamps* 00:00 Introduction and Latest AI Model Launch* 00:11 DeepSeek v3: Specifications and Achievements* 03:10 Latent Space Podcast: Special Guests Introduction* 04:12 DeepSeek v3: Technical Insights* 11:14 Quantization and Model Performance* 16:19 MOE Models: Trends and Challenges* 18:53 Baseten's Inference Service and Pricing* 31:13 Optimization for DeepSeek* 31:45 Three Pillars of Mission Critical Inference Workloads* 32:39 Scaling Beyond Single GPU* 33:09 Challenges with Kubernetes and Infrastructure* 33:40 Multi-Region Scaling Solutions* 35:34 SG Lang: A New Framework* 38:52 Key Techniques Behind SG Lang* 48:27 Speculative Decoding and Performance* 49:54 Future of Fine-Tuning and RLHF* 01:00:28 Baseten's V3 and Industry TrendsBaseten's previous TensorRT LLM workshop: Get full access to Latent Space at www.latent.space/subscribe

v29 - Super Switch

Breaking Change

Play Episode Listen Later Jan 19, 2025 185:16

In this episode: Justin goes to a birthday party, drives a Tesla, and configures your BIOS. The compliments department is, as always, available at podcast@searls.co. Have some URLs: This is the combination air fryer / grill I bought Microsoft dropped support for non-SecureBoot PC updates last month Aaron's puns, ranked Nobody Cares Things we learned about LLMs in 2024 Judge ends man's 11-year quest to dig up landfill and recover $765M in bitcoin The Consensus on Havana Syndrome Is Cracking (News+) Meta kills diversity programs, claiming DEI has become “too charged” Google kills JavaScript-free searches Sonos still seems kinda fucked 5090s seem kind of like a scam The official Elder Scrolls: Oblivion remake leaked Switch 2 was unveiled Guy with 200bpm heart rate complains his watch isn't working (before admitting his heart isn't working) The Diplomat Conclave Severance Season 2 is out Marvel Rivals is a hit (with the Thirstlords) Indiana Jones and the Great Circle P.T. A Short Hike Transcript: [00:00:29] Well, good morning, everyone. If it's evening, where you are, well, it's not here. So that's just what you get. You get a good morning. You can save it for later, put it in your pocket, and then the next time the sun comes up, you can just remember, ah, yes, someone did wish me a good morning today. [00:00:48] You are currently, your ears are residing inside of Breaking Change, which is an audio production. Not to be confused with Breaking Bad, certainly not Breaking Good, just broken. [00:01:03] You know, now that officially, officially or unofficially, TikTok is down. It's unreachable in the U.S. Aaron has reported, our Seattle correspondent, for the broadcast, that even over his VPN, he can't get to TikTok. [00:01:24] His arms are itchy. He's scratching. He, ah, I hope, wherever you are, I hope that you and your loved ones and your teenagers are okay. [00:01:33] But yeah, anyway, now the TikTok is down. Maybe some of you are here, because you've got nothing else to do, and you need something to fill that void. So thank you for joining. [00:01:45] Something that I've been meaning to do at the beginning of this, of the show, for the last, well, seven versions, has been to kindly ask that you go into your podcast player of choice, and you rate and review the show. [00:02:02] I would prefer five stars on a five-star scale, but if it was a ten-star scale, you know, ten stars would be better. [00:02:10] Thumbs up, or whatever. Write a little review explaining why the fuck somebody would want to listen to an explicit language, you know, tech-adjacent programmer-ish gaming movie, whatever the fuck this is. [00:02:23] Dialogue, uh, because, uh, I have found that breaking change is a really hard pitch, you know, when, when, when, when explaining to people, it's like, oh, this is me talking, just like drive-time AM radio used to be, except instead of talking about a bunch of politically charged propaganda, uh, we're just hanging out, uh, and instead of having a commute, you know, you're walking a dog, or you're doing the dishes. [00:02:50] Although, I guess, you know, maybe you listen on a commute. [00:02:53] I, I, I've heard, I've heard from, from listeners on road trips, listening to entire episodes all in one stretch, and that's something else. [00:03:03] Uh, I have not heard from a lot of commuters, so if you listen to this while you're commuting, shout out at podcast at searles.co, uh, you know, if you're driving, don't, don't try to rate and review, you know, in a distracted fashion. [00:03:16] But, but next time you think of it, you know, you, you, you slam that five-star button. [00:03:20] You know what, it's, it's, I got a lot of subversive elements, you know, in my cadre of people, because I am a total piece of shit, and I attract, I attract the good and the bad, everyone in between. [00:03:32] But some of us, you know, we, we, we appreciate a good troll. [00:03:35] There is no better way to stick it to the man and, and confuse the hell out of people than for all of you to go and give this five stars in, in, in iTunes and, in your podcast player. [00:03:46] And then have a whole bunch of people, you know, have it surface in the algorithm for others. [00:03:51] And then they listen to this, and then they're like, what, what, what the fuck is going on to my ears right now? [00:03:55] Uh, I am very confused. [00:03:57] And if that's you, hell, you know what? [00:03:59] Oh, shoot. [00:03:59] But I'm, I'm speaking from the past. [00:04:01] Maybe this is the, the future where this is a lot of five-star reviews and some, some, some rando outside of Argentina is, is, is getting this put into their feed for them. [00:04:11] And now they're like, four minutes have passed. [00:04:14] What am I doing with my life? [00:04:15] Well, hello. [00:04:16] You are also welcome. [00:04:17] Good morning to you as well. [00:04:18] Uh, by the time you're listening to this, you know, I'm recording Sunday morning. [00:04:24] First thing, uh, I know from experience that it can be hard to pretend to work during a Trump inauguration. [00:04:33] So, uh, I figured that instead of pretending to work, you could be here with me instead if you're listening on Monday. [00:04:41] And if you're, if you're fortunate enough to have Monday off, um, you know, I guess one difference between the, uh, uh, the previous Trump inauguration. [00:04:51] And this one is that the, you know, inclusivity backlash against the Trump admin, you know, that has now recently receded. [00:05:02] If you're to believe the Bezos and billionaire class, uh, uh, has resulted in way, way more people who don't work at post offices getting MLK junior day off. [00:05:13] So I suppose many of us are not working on Monday, but regardless, this is a version 29 of the program titled super switch. [00:05:24] Which, you know, depending on the audience, I think a lot of, you know, probably what I mean by that. [00:05:29] We'll, we'll talk about it later. [00:05:30] Uh, in life news, it feels like it's been a way more than two weeks since I talked to y'all. [00:05:37] Uh, uh, uh, when you live in a theme park, there's just a lot going on. [00:05:42] People coming and going stuff to do, uh, uh, stimulation overload. [00:05:49] That's why I sound so just, you know, demure downbeat chill here is because I am exhausted permanently all the time. [00:06:02] Cause every time I leave the house, I am, I am just overstimulated. [00:06:05] Uh, last night we went to a birthday party of a friend, uh, in the, uh, Orlando proper part of Orlando, [00:06:12] whereas we live in theme park, Orlando. [00:06:14] So we had to, uh, drive over the, uh, the treacherous terrain known as I four, the deadliest stretch of highway in the United States in terms of, uh, only in terms of the number of people who die on it. [00:06:26] And the party was, uh, it was funny cause our, our friends, uh, they're building a house on this beautiful lake, huge property. [00:06:34] It's, it's absolutely gorgeous. [00:06:36] It's going to, the house is a custom build. [00:06:39] And a couple of years ago, uh, the one who's, whose birthday ended up being said, you know, we're going to have my 45th birthday party here at the house. [00:06:47] After it opens the water slide, you're going to DJs. [00:06:50] We're going to have, it's going to be a big blowout fest. [00:06:52] It's going to be awesome. [00:06:53] And then his husband was like, you know, it's, it's not going to be ready yet. [00:06:57] Don't get your hopes up. [00:06:58] And, uh, uh, sure enough, uh, both things came to pass. [00:07:04] The house is nowhere near ready. [00:07:05] It is an active construction site. [00:07:07] And they trolled us hard. [00:07:08] They said, Hey, come to this hotel. [00:07:09] We're going to have, you know, uh, uh, free valet or whatever. [00:07:12] And then like, like we go into like a normal kind of like typical ballroom thing and you get a cocktail. [00:07:19] And then these construction workers show up and they, they, they, they heard us into buses. [00:07:24] Uh, and so people are in their cocktail attire, you know, Becky wore, uh, I don't know if you'd call them heels, [00:07:32] but elevated shoes for, for first time in a while, more of a flats person, which I respect. [00:07:39] Cause I'm also a flats person and, uh, we all get into the bus and everyone's dressed up. [00:07:44] And then, uh, they, they, they drive us to, uh, the active construction site. [00:07:47] That is our friend's house. [00:07:49] And, uh, they had, uh, the events planners and everyone like, like actually just decorate the shit out of, you know, what, what is a lot of concrete block first floor of most homes around here is concrete. [00:08:01] And so the bones of the house are up and they just decorated it with kind of construction paraphernalia, orange cones. [00:08:07] All of the staff had, uh, you know, orange vests on, uh, we were all given hard hats. [00:08:11] Uh, the theming was truly on point. [00:08:15] Weather was perfect. [00:08:16] Uh, and, uh, you know, it was a big raucous affair, raucous raucous, you know what I mean? [00:08:23] So that was great. [00:08:24] Uh, we didn't even stay out that late, but I feel like I got hit by a truck, uh, this morning. [00:08:29] Uh, I, I kept it to a two drink maximum, which is my new go-to rule of thumb. [00:08:34] Uh, uh, cause I always end up barely regretting the third from a, from a, an ability to sleep perspective. [00:08:43] Afterwards, uh, other life stuff, you know, like the logistics following the death of my father. [00:08:48] First of all, thank you very much for many of you wrote in to express sympathies, uh, probably don't, don't need to put them all in the mailbag. [00:08:55] Cause that after a certain point, it started reads like, you know, reading birthday cards on air, uh, in terms of they all, you know, not to diminish anyone's, uh, extension of grief, uh, or, or, or sharing their own stories. [00:09:08] But there's a certain, you know, beginning, middle and end format to, to, to, to, to, no one knows what the fuck to say. [00:09:15] I don't know what to thank you. [00:09:18] Um, but yeah, like I know just sort of like finances and, and forensics front of trying to figure out how to tease out all the complexities of his life that he never really told anyone about and didn't certainly didn't document, uh, that the work continues still trying to help my mom consolidate her situation. [00:09:36] It's been, you know, just a lot of very procedural. [00:09:42] All right, find all the stuff, organize the stuff, come up with a to-do list, figure out how to like approach this, make all the phone calls that you need to make to all these institutions to, to, to, to, to iron it out and to, to continue fact finding or to, to, to give, you know, furnish whatever documentation they need. [00:09:57] And, and, and because it's been so, uh, I guess transactional wrote, like not to say it's colored my perception of dad or anything, you know, one way or another. [00:10:11] Uh, but it's definitely, when I look back on this era of my life, of course, his passing is going to stand out in sharp relief, but like, that was like a week of stuff. [00:10:21] And then the rest of it is going to be like three months of stuff. [00:10:25] Uh, and so I wonder how that's going to affect how I, how I, how I look back on it. [00:10:28] But one of the things I noticed is a lot of different service providers, uh, like banks, for example, that have, uh, uh, you know, bills coming up, you know, you got a credit card bill and let's say it's due. [00:10:45] Uh, I, I don't know why I'm blanking, but January 25th and then January 18th comes around and it says, Hey, you have a statement due January 25th. [00:10:54] Or you got an upcoming bill or you, your bill is ready to be paid. [00:10:58] And when I get an email like that, so I just got one from dad or, you know, for dad's account from us bank. [00:11:05] And I was like, shit. [00:11:07] Cause I know he didn't have auto pay set up in a lot of places. [00:11:09] Uh, and like, do I have that login? [00:11:12] Like, you know, do I have to coordinate with mom to get the SMS thing? [00:11:15] Like I get into it. [00:11:16] And then sure enough, like, cause I thought I'd set up auto pay. [00:11:19] I even had a to-do list that said, set up auto pay for this. [00:11:21] And, uh, auto pay was set up. [00:11:23] It was just emailing me unnecessarily anyway. [00:11:25] You know, if you're going to have a recurring payment or an auto payment set up, it, you know, it's, it's okay to notify the customer that there's another bill coming, but it would be really sweet. [00:11:36] If like auto pay is enabled, just so you know, you're going to, you're set to auto pay this on X and X date, uh, because if you got, you know, as many cards as some people have, uh, it can get kind of exhausting to, to just worry about, uh, well, I hope that's, that's all set up. [00:11:53] So it's, uh, things like that are just like random nonsense stressors and the amount of context switching, because you're constantly getting emails and calls from different, from all corners. [00:12:03] I normally screen my calls really aggressively, but you know, this month I've got a pretty much [00:12:08] answer it no matter who's calling, which is not my favorite. [00:12:10] And I've, I've found myself falling into something that I never thought I would do. [00:12:17] Uh, maybe it's cause I turned 40 this week, but I'm, uh, I've always associated this with like [00:12:24] an old, a generational thing. [00:12:26] When somebody asks me a yes, no question, I've started saying yes or no. [00:12:31] Like the literal word, yes. [00:12:33] And that might sound mundane to you, but in my family growing up, the word, yes, always felt [00:12:41] violent because everyone always had more to say, or they had a compulsion to soften it, you know, [00:12:49] like, yeah, sounds a lot, um, neutral, accepting, open, soft. [00:12:58] Then yes, there's a certain like hardness to yes. [00:13:01] You ask a yes, no question. [00:13:02] The person says, yes, it feels like there's a period at the end of that. [00:13:05] And when you say, yeah, or okay, or all right, or, you know, you give some sort of like, you know, [00:13:11] like an invitation to either continue with a follow-up question or, you know, be, be open to maybe a retort or something. [00:13:20] And so I had a colleague once who is, you know, the previous generation who is my superior. [00:13:25] And, uh, his name was Daryl. [00:13:28] Daryl's a lovely person. [00:13:29] But every time I asked Daryl a question and I was asking him a lot of questions because I didn't know shit about fuck. [00:13:34] And he knew a lot of things about everything he would, he would answer every yes, no question with just the word yes or the word no. [00:13:43] And it felt so stifling and cruel and like, you know, like, why is he shutting me down like this? [00:13:51] Even though he's literally answering in the affirmative, there's something about the word yes. [00:13:55] When unadorned with any sort of softeners or explanation or exposition or, or, or, or, or justification or, or invitation to, to, to follow up that feel there's the finality of it feels just rude, even though it is very literally fine. [00:14:12] So I caught myself doing that and I guess I've become a yes man. [00:14:16] Other life stuff. [00:14:22] Our ninja, we have a, uh, we seem to have like every ninja kitchen appliance, um, just in some sort of rotation around, uh, you know, our, our kitchen and it feels to me like every modern home that every year, the, there's like a, a counter surface inflation where the counters keep getting bigger. [00:14:44] The kitchen islands keep getting bigger. [00:14:46] And then the, almost a, um, sort of like how a, a gas will expand to fill its container. [00:14:54] Like ninja appliances will continue getting invented to fill all available counter space in every home. [00:14:59] Uh, and the reason that ninjas been so successful is that unlike Hamilton beach and Cuisinart and stuff like their, their products are actually pretty good and do what they say on the tin. [00:15:09] But we had a, uh, one of the air fryer units that can also, you know, pretend to be a grill, even though like all that's really happening is a hairdryer is blowing downward onto your food and any sort of heating element underneath is indirect. [00:15:20] Uh, we had one of those and, you know, it just kind of got grody and gross from lots of oil and, and repeat washings and, you know, food stuck to the basket. [00:15:31] And it was, it was, it was no longer, you know, how sometimes you use one of these appliances, you don't clean it as intentionally or as frequently as maybe the instruction manual tells you to. [00:15:42] And eventually your food starts tasting like, you know, the bottom of the, uh, the, the, the, the, the deep fryer at, at McDonald's, like, just like that oil tarry kind of like, you know, afterglow. [00:15:55] Which makes, it takes, it really takes the shine off of, uh, whatever the omega threes that you're trying to get out of your fishes. [00:16:00] Uh, so, so we, we bought a new one and what I really wanted out of a new one was one with like multiple heating elements. [00:16:08] Like where, where there was an actual grill that could sear stuff and cook from the bottom up, but also a convection oven that could crisp it up and, and, and, and sort of dehumidify. [00:16:18] And amazingly, Ninja does sell this product. [00:16:22] Uh, it was called, uh, see if I can link to it. [00:16:25] The Ninja convection plus grill. [00:16:27] Oh no, that wasn't it. [00:16:28] It's, it's got a name. [00:16:29] Uh, something, something, grid IG 651. [00:16:35] Okay. [00:16:35] There you go. [00:16:35] I'll put a link in the show notes. [00:16:37] Uh, so the IG 651, whatever, it's got like a barbecue griddle on it. [00:16:41] It seems, it seems nice. [00:16:43] Uh, and it does exactly that. [00:16:46] It's got like a big wide surface element. [00:16:48] You can, you, you plug it in. [00:16:49] It's a very complicated, unnecessarily. [00:16:51] So a complicated thing where it's, it looks like you kind of take a George Foreman style griddle. [00:16:55] It's angled forward, meaning like it's got, you know, uh, I said griddle at just like the slabby kind of, of, of metal slats, slats, you know, where you, you put the burger on it. [00:17:07] And then it's like, you know, remember the George Foreman marketing? [00:17:10] I'm sure you do like, you know, like it's like at the, like, like the, the squeezing iconography to, to indicate like the fat is coming out and then that will make this healthier, even though the fat is often the best part. [00:17:20] Uh, so it's, it's got that it plugs into some like electrical, you know, electrode input thing with two little donguses. [00:17:28] I don't know why I'm even trying to explain this. [00:17:30] It's fine. [00:17:30] And you plug that in, you can wash it separately, but you can put a griddle on top that kind of maps to it. [00:17:36] So it'll pick up that heat. [00:17:37] And that is a flat surface, which can be nice. [00:17:40] If you're, if you're maybe, you know, toasting a sandwich or something. [00:17:46] And yeah, the thing about it, the thing about that search was that trying to answer the question of what heating elements are in this smart cooking appliance proved to be extremely difficult. [00:18:00] You go to the Amazon listing, you go to the product page. [00:18:03] I read up on every single Ninja product that does this. [00:18:06] I started looking at other products that do this. [00:18:09] I started looking at things that ran themselves as smart ovens that, you know, advertise having, uh, multiple heating elements, you know, like the June oven did this. [00:18:16] I think that's out of business now. [00:18:18] Tovala did this. [00:18:18] I think that's going out of business now where they would have, you know, like, um, maybe a microwave element plus a steam cooking element, or maybe they'd have a convection fan inside and also, um, an induction plate underneath. [00:18:31] And none of them have really taken off in the U S unfortunately, uh, such that. [00:18:39] It is a product category that the consumers are educated about, like what they're getting into in Japan. [00:18:45] There's a product called health. [00:18:46] You know, like literally like health EO, but THs are hard and it's got like the basic models have four or five different ways to heat your food. [00:18:56] And then like, it's really smart in that you, you punch in a code, like a recipe code, and it'll just do everything cradle to grave for you with the advanced sensors that it has. [00:19:04] And kind of move between whatever combination at whatever point in the cooking process, all of those heating elements need to be arranged. [00:19:11] And so things come out almost better than a human could do them because they never have to be removed from this hermetically sealed environment, you know, for people's hands to come in and, and, and adjust how the thing is being heated. [00:19:26] Because in Japan, that product has been so successful that the two or three different tiers of that product, not only are they all good, but like, no one needs to be explained what's there. [00:19:36] Like the, the, the, the, it could just be like the higher level of literacy and, and, and education generally in Japan. [00:19:42] But in general, like, it's just, it's really straightforward. [00:19:46] And here, it seems to be that like people just want a device that they can throw food in. [00:19:52] And then as long as they're picking off a menu and it has words like grill, they will feel good about it. [00:19:58] And no one's going to ask, where's the heat coming from? [00:20:01] How is this getting cooked? [00:20:02] Which now that I say it, of course, like Americans don't give a fuck how the thing gets accomplished or without it gets accomplished well, typically, uh, just that, uh, you know, they know what box to put the food in and then the button to hit, which is, you know, a little bit condescending, but, you know, y'all have earned it in my opinion. [00:20:20] Uh, so yeah, we got it. [00:20:22] It works. [00:20:22] Uh, uh, as far as I know, I turned it on the preheating started. [00:20:26] We have not yet, you know, broken the seal and actually cooked with it yet, but I'm glad, I'm glad to have that because I think, I think, I think. [00:20:32] Shit will turn out better, especially salmon, which is increasingly the number one thing that we were using our air fryer for, which was an inefficient, uh, use case. [00:20:40] Speaking of the parks being really busy, uh, and, and life here being overstimulating on Friday, I found myself really testing the fences on this new being 40 year old thing. [00:20:55] I, uh, got up at 5am with Becky. [00:20:59] We had a special event at Disney's Hollywood studios that started at six. [00:21:03] We got there. [00:21:04] There were other people there. [00:21:05] We went to bed early, you know, to, to, to, to be able to, to do this and not be super groggy and miserable, had a great time. [00:21:13] And then we had some friends coming into the park just about an hour after that, that, that event wrapped. [00:21:18] And so we went and visited with them for a little bit. [00:21:20] Then we came home and tried to recover some sort of a productive day by then it was noon. [00:21:25] Uh, and then that evening, cause the same friends that they had their big day, I wanted to debrief with, uh, uh, my buddy before he, uh, John, his name is John. [00:21:35] He is a listener of the program. [00:21:38] So hi, John. [00:21:38] Hello. [00:21:40] Uh, when to do debrief with him. [00:21:43] So we went over to a bar called trader Sam's, which is a grog grotto. [00:21:47] It's in the Polynesian resort hotel. [00:21:49] And it's one of my favorite bars because it's got like a lot of like little imagineering knickknacks and stage elements that, that have since become very common at Tiki bars. [00:21:58] But we got in there, we spent a couple hours and then pretty soon I realized, Oh fuck, it's midnight. [00:22:03] And I've literally been Disney it up to some extent, uh, since 6am. [00:22:10] And so, you know, I actually, I got a second wind in there, but I ultimately didn't get, get to bed until like two. [00:22:16] Uh, so that was a, it was a big day. [00:22:19] I feel like I did all right. [00:22:20] Uh, from an energy level perspective, I think I, I was the person that I needed to be in all of the interactions I had that day. [00:22:28] And that's probably the most I can say. [00:22:29] Uh, I'm simultaneously finding that my body is falling apart. [00:22:33] My, my, uh, left hip is pretty grumpy. [00:22:38] Uh, it's just some sort of like a constant dull discomfort, uh, feels like a dislocated shoulder, but no matter how much PT I do, [00:22:46] I, I, I seem to never fully, fully beat it. [00:22:49] Um, I need a smart, the smart oven equivalent for, for, uh, you know, muscle therapies that people do. [00:23:00] It's like, Oh, you can get some of the, it'll, it'll apply the icy hot and also, you know, drill you with a Theragun and also massage you and also use the, you know, resistant bands exercises to strengthen it. [00:23:09] Uh, just all simultaneously. [00:23:10] Cause it's like this round robin of, of attempts I've had to, to restore this fucking hip. [00:23:17] Uh, it has been great. [00:23:19] So that's been a constant thing. [00:23:21] New things are like my right knee now hurts like hell. [00:23:23] My left, my left heel, just the skin started cracking from how dry it's been here. [00:23:28] And of course it's still way more humid here than the rest of the nation, but apparently my skin is so used to the humidity, uh, that I just woke up one morning and it hurt to walk because all my skin was exposed because all my skin and my foot had cracked. [00:23:40] You know, like what the hell's going on? [00:23:42] So, uh, if you're, uh, approaching 40 and you're worried about it, good. [00:23:48] I don't know that I recommend it so far, uh, but I'm still here, still kicking. [00:23:53] Uh, uh, well, I, so far I almost didn't make it to be honest. [00:23:59] Uh, you know, well, I, if I'm going to talk about this next topic, uh, it's something that's come up in the show before. [00:24:09] And so I think that technically makes it follow up. [00:24:11] So let me hit this button right here. [00:24:13] Yeah. [00:24:20] So speaking of dying right before you turn 40, I, I'd mentioned that I four interstate four that runs east, west in, uh, through bisecting Orlando. [00:24:37] It's, uh, known to be, and I fact checked this against GPT cause I knew I'd probably end up talking about it. [00:24:45] Deadliest stretch of highway in the U S and you know, I'm a, I'm an experienced driver insofar as I've been driving for 24 years. [00:24:54] I don't like love it. [00:24:56] I'm not a car guy. [00:24:57] Uh, I, I feel like I drive fine, relatively safely, probably more on the conservative side. [00:25:05] Overall. [00:25:06] I do speed from time to time, but you know, as long as if you're in America and you're speeding, as long as you use the phrase flow of traffic, uh, you can do whatever you want. [00:25:17] And the problem is that when you live in theme park Orlando and you need literally anything that is not entertainment and hospitality related, uh, like for example, you know, I, I, and this is what puts this into the followup bucket of content. [00:25:35] Uh, I've been talking on and off about having, uh, struggling with snoring. [00:25:38] You know, I've been, uh, uh, doing that thing that a lot of middle-aged husbands start doing and deciding to interrupt their spouse's sleep by, by, by suddenly picking up this cool new habit. [00:25:49] That is just making wheezing sounds all night long. [00:25:53] And mine's really inconsistent. [00:25:56] It's clearly triggered by something. [00:25:57] Couldn't really tell what, you know, is it diet or whatever. [00:26:00] It's like clearly like none of the symptoms of apnea. [00:26:03] So that's probably not it. [00:26:04] Given that I feel fully rested after like four hours and I've never feeling short of breath. [00:26:08] Uh, you know, the new Apple watch has an apnea detection and it seems to not be detecting any apnea. [00:26:16] So I finally got a sleep study ordered and the doctor who is a very nice lady, she, you know, she's just like the reality of insurance right now is, uh, I will put in a request for an in, in a let in lab sleep study. [00:26:33] So we can watch you because the alternative is an at home sleep study. [00:26:36] And based on everything you're saying, there is a 0.0% chance that that at home sleep study is going to find anything. [00:26:44] Uh, and then I was like, well, then let's just do the in lab. [00:26:46] Like you're saying, well, she's like, oh, the insurance will surely deny based on what you're saying, uh, an in lab sleep study. [00:26:53] Uh, you have to do, you have to go through the motions of this at home sleep study first, and then it has to show nothing. [00:27:00] And then I can put in a script again for the in lab. [00:27:04] Uh, and, and then the prior authorization will go through and then you'll be able to do that. [00:27:09] And so I have to kind of do this performative nothing operation, just nothing like procedure, operation procedure. [00:27:18] It's over, you know, like diagnostic, you know, just to check some boxes and money is changing hands invisibly to me at every step. [00:27:27] Of course, for the most part, thanks, thanks to having health insurance. [00:27:30] So I, I, I schedule this and it's an at home sleep study. [00:27:36] Like there are services that mail these units, you know, they could ship it. [00:27:40] I could, I don't know, find a courier or something, but nope, this one, I have to drive to the other fucking side of Orlando, which is, you know, it's 20 miles, but it's like a 45 minute hour long adventure. [00:27:49] And I have to calling them the rules of the game were that I had to, uh, drive there Sunday night to pick it up, come back Tuesday night to drop it off. [00:28:00] And they, because of sleep study locations, this is like an actual, you know, testing center. [00:28:07] Uh, they literally open at 6 30 PM in the evening. [00:28:10] Uh, you know, so that's when their shift starts. [00:28:13] So I had to get there at 6 30. [00:28:15] So that means like, I'm basically fighting through rush hour into town and then pick it up and now I'm coming back home and now it's like eight. [00:28:22] So I guess I'll just eat dinner by myself or whatever. [00:28:25] Uh, and it's not like in a part of town where it's like, Hey, we can go downtown and like make a date, make a night date night out of it and go to like a fun restaurant. [00:28:33] It's like, this is a, I don't know what I, I have many times in this program suggested you should move to Orlando. [00:28:41] Orlando's great. [00:28:41] I love life in Orlando, but like whenever I leave the bubble of like theme park party time, Orlando, where everything's just really, really nice and customer service is incredible. [00:28:50] And the food's really great. [00:28:52] And, and it's just a party. [00:28:53] Uh, and I go to like real Florida. [00:28:56] I'm like, Oh yeah, I need to stop recommending people move to Orlando. [00:28:59] Cause this is like the median experience. [00:29:01] And I wouldn't, I would not, I can't do this for an hour. [00:29:05] I don't know how I would possibly live here. [00:29:07] No offense to Orlando, but I, uh, I went and I picked it up. [00:29:12] I drove my car there on Sunday night and traffic was pretty bad, but it's always pretty bad. [00:29:18] I had numerous cases of people jumping in front of the car on the way onto the highway. [00:29:23] Once I was on the highway, I get into the new express lanes, which do make things easier. [00:29:27] You pay a toll and you get, uh, you know, expedited traffic. [00:29:30] Um, and somebody had pulled over into the shoulder. [00:29:34] And as soon as he pulls over, he just whips open his, his driver's side door off of the shoulder. [00:29:41] And now the door is in my lane. [00:29:43] And there's of course, somebody on my left causing me to, uh, flip out and have to slam the brakes to, to the point of like, you know, bad enough that smoke is happening. [00:29:53] Right. [00:29:53] Like you can smell the burnt tire because this dude is just like, I'm on the highway. [00:29:57] I can open my door. [00:29:58] I'm a, I'm a big man. [00:29:59] I'm driving a truck. [00:30:00] So I chose not to blow his door off. [00:30:05] Uh, then on the way home, it was one of those ordeals where, uh, it's a, a sign said congestion, like eight, four miles ahead. [00:30:16] I was like, oh, four miles. [00:30:17] Okay. [00:30:17] Maybe I'll find an opportunity to take, get off the highway or I'll get onto the express lane and try to avoid it. [00:30:21] And, uh, Apple maps was saying I should turn right at the Kia center, which is like where the Orlando magic play. [00:30:27] And then take three more rights and then get back on the highway. [00:30:30] And I was like extremely convinced that this was just some sort of, you know, Apple maps fuckery. [00:30:36] Uh, and, and the nav and the computer being wrong because it often is, I was like, I'm going to stay on the highway. [00:30:42] I'm a smart guy and the instant that I passed that exit that it wanted me to take, everything became a parking lot and, and such a parking lot that it became road ragey pretty quickly with people driving and shoulders and honking and trying to edge each other out and motorcycles going between lanes. [00:30:58] And, and, and there's just a, you know, there's probably a metric that you could use for any civilization called like, uh, TTMM time to Mad Max. [00:31:10] And Florida has a very low TTMM, you know, it doesn't take long at all for every man for himself, uh, instincts to seemingly kick in. [00:31:22] So I, I did the rerouting and now, now the phone is telling me, all right, well, you know, literally it's so demoralizing. [00:31:32] You see the ETA to your home arrival move literally 40 minutes immediately because I chose not to take it's very wonky prescription of three right turns. [00:31:42] And now I realized in hindsight, the reason it wanted me to do that is there's a direct entrance onto the express lane. [00:31:47] And so not only did the ETA go up, not only do I have the regret that I didn't listen to the computer for, for telling me to do a stupid thing, but I also now am shamed by the insult on wounds here. [00:31:58] The left of me, the express lanes are wide open and there's just like five cars just having a great time going 80 miles an hour to get to where they want. [00:32:05] And everybody else is left in just this, this, this, this absolutely falling down style, uh, traffic jam, uh, or just after dark. [00:32:17] I did get home, I, I took a side street and it was one of those ordeals where you, you know, you take the side street, go up a couple of blocks, you go, you know, uh, turn left, kind of go, I don't know, maybe a half mile just past wherever, whatever accident was causing the congestion. [00:32:34] Then you get back on the highway. [00:32:34] And the problem was, of course, we all have automated navigation systems. [00:32:41] They all reroute us. [00:32:42] And so that was immediately backed up there that it was three traffic lights of people in the left lane, trying to, to turn onto that third traffic light. [00:32:52] And I, it would have been another 20 minutes just waiting for those light changes. [00:32:56] And so I just, you know, fortunately I had a brain and I was like, all right, I'm going to just blow past this and go in the right lane and drive forward three, three intersections and then do a U-turn turn right. [00:33:08] And then I, I successfully beat the rush and I got home and I, it merely only wasted 20 minutes of my time, but here, this story has already wasted five minutes of your time. [00:33:16] So it was death defying because even once off the highway, virtually none of those drivers had ever been on those side streets or in that neighborhood before. [00:33:27] And they were all driving like it and they were all driving like it and it was dark and there were not adequate streetlights. [00:33:31] So, uh, you know, it's not just that like Florida drivers are bad, but like you are surrounded by a certain number of frazzled dads who just picked up rental cards, cars from MCO, who are trying to get to their Disney hotel, who just had a flight delay, whose kids are screaming. [00:33:48] And nobody's happy like that is the default and that is the best case energy because like, you know, that's before you consider the, the, the capital F capital M Florida men and the tweakers and everyone else that just kind of contributes to this diverse fabric of society that we live in. [00:34:08] So, uh, that was a bad experience. [00:34:12] I, I did get home, you know, I am still with us, but by the time I got home, I was, I was so fried. [00:34:18] Like I, I, I, I, I didn't want to hang out. [00:34:22] I didn't want to talk to Becky. [00:34:22] Just wanted to like pour a whiskey and collapse. [00:34:25] Uh, the stress level is so high. [00:34:28] Like, and you can, I looked at my watch, right. [00:34:30] And I was looking at like the heart rate history and I was like, you know, I was white knuckling it. [00:34:34] Um, and that's, and that's partly on me, right? [00:34:36] Like I just, I don't, I don't like that kind of driving. [00:34:39] I don't like that stress. [00:34:39] Two days later, when I had to drop this device off, uh, the device itself was terrible, by the way, it was probably less sophisticated than my Apple watch and probably reading like less accurate, uh, heart rate. [00:34:57] And, and even the, the modern Apple watch like does track breathing. [00:35:00] That's how it does a sleep apnea thing, uh, uh, through the magic of gyroscopes. [00:35:05] And, uh, this device is a piece of shit and I'm sure somehow the rental fee for, for a one-time use was $1,500 to my insure. [00:35:12] Uh, and I'm sure it found nothing. [00:35:15] I can totally, like, I don't know how it would find anything. [00:35:17] Uh, it looked like it was built out of, you know, Teddy Ruxpin era, you know, technology in the mid eighties with, with the, the quality of the, the, the straps and the plastic. [00:35:29] I could just, but when I had to, when it, when time came to drop it off, I really did not want to repeat that experience on a weeknight when you, you know, traffic would be even worse. [00:35:41] And so I, I humbly asked my brother who has a Tesla, I said, Hey, uh, there's another follow-up item. [00:35:48] We, we, we, we picked it up together just in October. [00:35:51] I think, uh, I said, Hey man, like, can I swing by or you swing by drop off your Tesla? [00:35:59] He did some stuff to do at our house anyway. [00:36:01] And he's got the full self-driving like, like, uh, they keep renewing a 30 day trial for him. [00:36:09] And, uh, you know, full self-driving isn't, it is, uh, the car will drive itself. [00:36:14] You don't have to touch the wheel. [00:36:16] It, it, it, it, it's very conservative. [00:36:18] It has three modes, chill, uh, normal and hurried or hurry. [00:36:23] I've never tried hurry. [00:36:24] I don't need to try hurry. [00:36:26] I just stick on chill because at the end of the day, as long as I get to where I'm going, [00:36:29] I sort of don't care. [00:36:30] I'm not in a big rush. [00:36:32] Uh, I have the luxury of not needing to be anywhere in any particular pace. [00:36:37] As long as I leave on time, you know, I'm, and I'm going to get there by the time I promise [00:36:41] the chill is good with me and the, you have to supervise it. [00:36:48] And it was the case when the full self-driving crap and Tesla's first hit that people were, [00:36:55] you know, at first it was just like pressure testing the steering column. [00:36:58] And so people would like use like, uh, uh, weights, like, like weighted wristbands and [00:37:04] stuff to like make it trick the steering column into thinking that somebody was holding onto [00:37:08] the wheel. [00:37:08] Uh, and now they have cameras that look at you like inside the cabin and that, that camera [00:37:15] is using some amount of intelligence to determine that you're distracted or not. [00:37:19] So if you are looking a lot at the central, uh, tablet, it'll bark at you and say, Hey, pay [00:37:23] attention to the road. [00:37:25] If you're looking at your phone, it'll do the same. [00:37:26] If you're looking at a watch, you know, like I've had it even like when I'm talking to the [00:37:30] watch and looking forward, have it bark at me. [00:37:31] And as soon, as soon as it does it, it makes a beep and then it gets increasingly aggressive [00:37:36] and beeps louder. [00:37:37] You impressively. [00:37:39] I say this because like, you know, I'm sure that the reason it's like this is because Tesla [00:37:43] is trying to minimize it's like legal liability for accidents caused by its system. [00:37:47] If, if, if, if you ignore its beeps three times in a day, uh, you, you get a strike, the system [00:37:56] will disengage and you will be forced to manually drive your car like a plebeian for the rest [00:38:01] of the day. [00:38:01] At least that's how Jeremy explained it to me. [00:38:03] If you get five strikes, I want to say it is, um, you're just exited from your, you're ejected [00:38:12] from the full self-driving program. [00:38:14] And I am impressed not only that it's as aggressive as it is, like, you know, if you got to look [00:38:22] at the screen for something, you've got to adjust it. [00:38:23] You basically have seven or eight seconds to, you know, fix the mirrors or whatever it is [00:38:28] before you got to be looking at the road again. [00:38:29] I'm also like finding myself that when I'm driving his vehicle, I actually am significantly less [00:38:36] distracted than in my own Ford escape, which has car play. [00:38:39] And I typically don't touch the phone itself, but I, um, you know, I tune out a little bit [00:38:44] or, uh, you know, might look at something or might be tapping away at the, uh, you know, [00:38:49] the eye messages and, and, and, and whatnot seemingly longer in those cases than like what the Tesla [00:38:55] would let me get away with. [00:38:56] So I'm paying more attention to the road because the computer is telling me to, or forcing me [00:39:01] to, and I am also doing less of the driving. [00:39:05] So, you know, my foot's off the pedal, my foot, my hands are off the steering. [00:39:08] And when they say supervised, it's actually like the right word, like it is doing the [00:39:14] driving, but like the, it feels almost like a pilot co-pilot thing where I, your head's [00:39:22] on a swivel. [00:39:23] Like I can look to the left and I can look to the right and I have far greater situational [00:39:27] awareness as the car is driving. [00:39:28] Now, granted a lot of these like semi-autonomous and, and adaptive, you know, uh, uh, uh, assistance [00:39:35] in cars will for most people lull them into a false sense of security and result in further [00:39:44] driver inattentiveness and unsafety, right? [00:39:46] Like people will, you'll train them out of the vigilance that you need at all times when [00:39:52] you're the one driving a vehicle or being driven in a vehicle. [00:39:55] However, like the particular, and maybe it's just cause I'm kind of coming in and chapter [00:40:00] four of this particular saga of full self-driving and robo taxis will be here in six months as [00:40:05] Elon Musk. [00:40:06] And of course they're not there, but it seems like at least the way that I've experienced [00:40:13] full self-driving when I've used it, it seems to me like I feel a thousand times safer because [00:40:21] the combination of the car, mostly doing the right thing, mostly making the conservative [00:40:25] choice, absolute worst case. [00:40:27] It haunt, it blares at you and you need to take over, uh, combined with my own hypervigilance [00:40:35] of not, you know, I constitutionally do not trust computers and you know, Jeremy doesn't [00:40:41] either. [00:40:42] And so when we're driving these things, we're looking around all the time where we're, we're, [00:40:45] we're sort of, because we have a curiosity and how the technology works, like trying to think [00:40:49] about how is it thinking through this? [00:40:51] Like, like we have a lot of, for example, um, automated gated communities where like the, [00:40:56] the gates will open and closed when you're, when you're entering and exiting. [00:41:00] It's like, we, we look at the little like computer screens, like how does it, how does it, what [00:41:04] does it think is in front of it right now? [00:41:05] It sees that there's an obstruction. [00:41:07] Uh, and if it opens too slowly, is it thinking it's a permanent obstruction or is it going to [00:41:11] wait and then proceed after the thing opens automatically? [00:41:14] Like there's a lot of little moments like that, where it's actually kind of interesting [00:41:17] to see how, you know, how the car reacts and then it gets a software update and then how [00:41:22] the car reacts after that. [00:41:23] And then additionally, there's the typical ebb and flow of software updates generally where [00:41:28] there's regressions, right? [00:41:29] Like there was a version of this, uh, system that, that the ability, like it used to blow [00:41:35] past this one particular speed bump, uh, uh, near our neighborhood, uh, because it didn't [00:41:41] have sufficient paint on the road to indicate that it was a speed bump. [00:41:45] And then there was a software update and then it perfectly negotiated all four speed bumps [00:41:49] just right in a row every single time. [00:41:52] And then there was another update and now it blows past the third speed bump again. [00:41:56] And so, uh, I think that people who are technology enthusiasts who maybe follow this stuff and [00:42:05] understand how, what software is, how it works, that updates are not a pure linear, you know, [00:42:11] march of progress, I think the idea that there would be regressions in software releases or [00:42:18] even, uh, non-determinism in how the, how the computer car operates, that's totally natural [00:42:24] to me. [00:42:24] And I expect it now. [00:42:25] I, I grown at it and I think like, this is, this is probably a bad idea in aggregate and [00:42:31] at a population level. [00:42:33] I suspect that the average driver would be confused by that the same way that like the [00:42:38] average person is terrified of updating their phone or their computer because they associate [00:42:43] software updates with, uh, uh, you know, newness and unawareness and, and, and, and, and, and all [00:42:51] the things that they finally had working, no longer working. [00:42:54] And when they, but when you talk about the, the march of progress and technology, they sort [00:43:00] of have a, what it is, is whenever anything goes wrong with technology, if you're not, if [00:43:08] you're not primed to know that it's burning you is, it seems like people mostly blame themselves [00:43:13] instead of blaming the technology. [00:43:15] And if that's your, if that's the way you use your phone or your computer, uh, you [00:43:21] know, when, when the car makes a mistake, you might not realize it as a car making mistake [00:43:26] and you might not have the hypervigilance. [00:43:27] That's like, you know, a more adversarial, like, like, I feel like I'm constantly spot checking [00:43:31] it. [00:43:31] And I, and while I am surprisingly impressed with how well it's been negotiating everything [00:43:37] that we've thrown at it so far, it's made one or two mistakes and I've, I've, I've, [00:43:41] I've, I've dealt with it, but on net, like it's driving waste. [00:43:45] Way more safely than I am way. [00:43:47] And it's, it's taught me a few things. [00:43:49] It's like, Oh yeah. [00:43:49] Like whenever I do this at an intersection, like that's really dumb. [00:43:52] Like it's doing this way better. [00:43:53] Uh, I can't think of a specific example, but like, I'm pretty impressed. [00:43:58] And so I thought, well, I'll ask Jeremy to borrow the car because I've got this natural [00:44:03] experiment now, same time of day, uh, same location. [00:44:07] So I already know how to get there. [00:44:08] It's a, it's a little bit goofy, but like, because I was just there, I'm not going to feel [00:44:12] like I'm learning how to get, get there and also learning how to use this. [00:44:15] Auto driving system simultaneously. [00:44:17] And, uh, holy shit. [00:44:20] Like, yes, I had people jump out in front of the car. [00:44:23] It was even worse this time at the particular intersection before you get to the, to, to [00:44:27] I four and the car like saw them out of its blind spot while it was turning, right. [00:44:32] It saw them on the left camera and breaks perfectly. [00:44:37] Uh, and I, uh, my first reaction was like, I would not have caught that. [00:44:40] I probably would have cut it real close. [00:44:44] Uh, almost hitting these people. [00:44:45] Uh, you get onto the highway and then this is why I emphasize like I four is like the deadliest [00:44:51] highway in America because it's, it is, it is not like driving on the highway, wherever [00:44:59] the fuck you live like anywhere I was ever in Michigan or Ohio or anywhere else in the [00:45:04] U S or certainly anywhere I've driven in Japan. [00:45:06] Those are the only places I suppose I've driven or Canada. [00:45:09] Like, yes, sometimes it's a little stressful driving on the highway. [00:45:12] Like that's not what this is. [00:45:14] This is, you have to practice extreme defensive driving. [00:45:18] And if you actually want to get where you're going, you also have to practice offensive [00:45:21] driving. [00:45:21] Uh, so having, uh, you know, nine cameras and nine directions is just necessary for basic [00:45:28] like assurance of survival. [00:45:31] Like when I'm on I four, I, I feel constantly under threat. [00:45:35] Uh, and something happens every time. [00:45:39] So we get on the highway and that stuff does happen. [00:45:42] Uh, you know, the car on its own decided to take the express lanes by itself, which was [00:45:46] incredible, but like people were like, I was trying to merge into a lane. [00:45:50] And then as, as the things, well, it was trying to merge into a lane. [00:45:53] And as it was changing lanes, somebody who didn't even have a blinker on starts edging in [00:45:58] and the car knows I'm going to back off. [00:45:59] Uh, there was another case of somebody swerving into our lane, like very close to the car and [00:46:05] the car, you know, defensively, you know, switch to the right lane, which was wide open [00:46:11] to prevent the risk that like, you know, it might have to break. [00:46:14] Suddenly there wasn't enough distance between the cars. [00:46:16] And that was stuff that like, I only was actually even able to piece together. [00:46:19] What the fuck was it doing after the fact? [00:46:20] Like looking at the map and looking around me, it's just, it went great. [00:46:28] Got there, dropped the shit off, turned around, you know, the parking is wonderful too, because [00:46:34] it'll back into every parking spot. [00:46:36] You just tap the screen. [00:46:37] Like it'll see the parking spots. [00:46:38] You just tap which one you want and just, it handles it for you. [00:46:40] It parks way better than I park. [00:46:42] I don't know, man. [00:46:43] And on the ride home, not only, you know, everything around me felt like it was on fire and chaos, [00:46:50] but because I had a buddy who was doing the driving and I could just kind of be, you know, [00:46:54] patrolling and looking around, I actually got a, a low heart rate notification on my watch, [00:47:00] which I get, I get them frequently. [00:47:01] Cause I have a low resting heart rate, but like it would say, Hey, your, your heart rate's [00:47:05] been under 40 beats per minute for the last 10 minutes. [00:47:08] And, uh, which I, if that's not you, that's like, if that's not typical for you, that might [00:47:14] sound scary, but like, no, my, my resting heart rate when I'm actually like de-stressed and, [00:47:17] and just chill is like typically like 38. [00:47:20] So the fact that I could be on I4 with a heart rate under 40 feeling completely safe more than [00:47:27] anything, it's not about going fast or whatever. [00:47:29] It's like feeling like I've got a team of two that are dedicated to getting me home safely, [00:47:32] me and this computer. [00:47:34] Uh, it was a revelatory experience now that look, I realized it's a complicated situation [00:47:44] because Elon is a big old bucket of assholes and the politics of it are all fucked. [00:47:50] Uh, you know, the right time to buy a Tesla was, was when, uh, everyone agreed that, that [00:47:54] they were cool and EVs were good and the planet deserves saving. [00:47:57] Uh, but yeah, I got, I totally saw where, where my brother was coming from and all of his friends [00:48:03] who, who, who, who are similar technologists who, who have these things and who are, you [00:48:07] know, who got on board in the very recent hardware three or hardware four era of Tesla. [00:48:12] Um, particularly with like the, the, the entry level models that are higher volume and therefore [00:48:17] kind of more, uh, consistently produced, you know, the cyber truck, for example, more, most [00:48:26] expensive, but lowest volume and has the most problems. [00:48:29] The model Y at this point is pretty boring and dull, but like, you know, if, if you, if [00:48:34] you are like me and just kind of think of cars, the modern day car is just a tablet with wheels. [00:48:40] This is a, you know, and I, yes, I had, I had low expectations. [00:48:46] I had a high level of suspicion, but it went great. [00:48:48] And, uh, uh, I, I, I successfully dropped off my snoring thing. [00:48:55] I can't wait to get the results. [00:48:57] That'll tell me that, uh, you know, nothing happened. [00:48:59] Another bit of follow-up. [00:49:01] I think I'd mentioned that I, uh, I had used rocket money. [00:49:05] So, you know, it used to be called true bill and then quick and loans bought it. [00:49:08] And, uh, the, as quick and loan started branding itself as rocket and having this rocket suite [00:49:13] of products, rocket money became, it's, you know, a consumer entree into upselling it to [00:49:18] other products and rocket monies, you know, promises. [00:49:21] It's going to help you, uh, visualize all your subscriptions and even negotiate a tiny, tiny [00:49:27] sliver of those subscriptions. [00:49:28] And the one that I yielded to it was my spectrum account. [00:49:32] So my ISP had, had gradually been charging me more and more to the point where it was [00:49:36] like $145 after tax every month for the same internet program. [00:49:39] That was like a hundred dollars when I moved here. [00:49:41] And I was very skeptical when rocket money said, Hey, we just saved you $893 a year, uh, by, [00:49:48] by lowering your monthly bill to 70 bucks. [00:49:50] And they sent me a new modem as well. [00:49:53] And I was like, I don't need a new modem. [00:49:55] It's the, it's, it's the model number. [00:49:56] It looks almost identical. [00:49:57] And I, I was actually at UPS returning that modem. [00:50:01] And I just thought to myself, what if this modem is somehow better? [00:50:04] Cause I had not been super blown away by the performance of my current one. [00:50:09] And so I, I went to the trouble of unplugging the old one, plugging in the new one, setting [00:50:13] it up, calling to activate and it, my, my connection now is rock solid. [00:50:19] So, so just by doing this price hack thing, I now have a modem that works way better. [00:50:23] I was able to activate it myself without having some tech come over here. [00:50:25] So that's a, that's a win, but the statements were still showing up $140. [00:50:29] And I was really skeptical that like this would materialize, but sure enough, this week I got [00:50:35] a statement for $70. [00:50:36] Uh, and I guess that means I owe rocket money 35% of whatever it saved me. [00:50:42] And I don't know how that's, I don't know how that's paid or when that works. [00:50:45] I'll figure it out. [00:50:47] But if you're, if you're willing to, basically I would recommend rocket money to anyone who [00:50:52] is currently paying sticker price for whatever utilities, it's probably mostly ISPs and cell [00:51:00] phone bills. [00:51:01] If you're paying for like a normal plan that is still available and you're paying top dollar, [00:51:06] uh, call them, give it a try. [00:51:08] But if you're like, you know, like I am with T-Mobile grandfathered in on some 12 year old [00:51:13] plan that has been replaced five times. [00:51:15] And there's no like, like the most likely case then is it's going to put me on the latest plan [00:51:19] and sign me up for all of the new throttling and four ADP video and the shit that you don't [00:51:24] want, uh, in terms of limitations. [00:51:26] So check out rocket money. [00:51:30] I, I, I was extremely skeptical and now this is, this is a rocket money ad. [00:51:34] Uh, although it is unpaid. [00:51:36] If you want to be a sponsor of the program podcast at seerls.co, uh, another followup item. [00:51:47] I, let me tell you what it took to connect. [00:51:53] My Xbox controller to my, to my gaming PC. [00:51:58] So, uh, I have an Xbox series elite to whatever you call it. [00:52:04] A nice, the fancy Xbox controller that costs like $170. [00:52:07] And I like this controller. [00:52:09] It's got the little paddles in the back. [00:52:11] It's got, you know, a nicer grip, uh, interchangeable thumb sticks and D pad and stuff. [00:52:16] It's a very nice product, but it's, it's, you know, talk about low volume things that [00:52:21] aren't as reliable. [00:52:21] It has a lot of reliability issues and my right bumper button, like next to the right [00:52:27] shoulder, it had been like very, very, um, it would miss like 70% of the clicks. [00:52:36] And because the right bumper isn't the most important button in the world. [00:52:39] Like it just meant like, uh, I guess I'm just not the kind of guy to throw grenades or whatever [00:52:43] the right bumper is typically assigned to, I got a replacement relative, like a, a, a cheap [00:52:50] replacement through Microsoft support channel. [00:52:52] I think they charged me $70. [00:52:53] They didn't require me to ship back the old one. [00:52:55] Uh, the replacement came and I plugged it into the computer to start set up and pairing. [00:53:00] And the Xbox accessories app was like, this is too out of date to be able to configure your [00:53:06] controller, which was weird because windows update, which I checked frequently had said [00:53:10] that I was up to date, but there was a little message at the bottom saying, uh, windows is [00:53:16] up to date. [00:53:16] Important security updates have not been applied. [00:53:19] Make sure that your computer is turned on, which is weird because if I'm manually updating [00:53:22] and nothing's saying that it's like, where are these secret security updates that aren't [00:53:26] happening? [00:53:26] And when I dug into my actual windows version, it said I was on 21 H two. [00:53:32] So the naming scheme for these major windows releases seems to be the, the two digit year [00:53:39] followed by H one for first half of the year and H two for second half of the year, which [00:53:44] is, um, real dumb. [00:53:47] I'm going to say just a dumb way to name things, you know, numbers are good. [00:53:52] You know, I, I, I get it now why it's named that. [00:53:56] But 21 was, uh, if you, if you decode the version several, several numbers ago, it was [00:54:02] three, at least it was at least two H one ago. [00:54:05] And why was I on such an old version? [00:54:10] It turns out I'll share like a, an article from, from just December, the, the windows 11 [00:54:16] required computers to have secure boot enabled using the trusted platform module or TPM equivalent [00:54:22] encryption. [00:54:23] And that's to certify or to be able to attest that like the, the operating system has not [00:54:28] been tampered with and so forth. [00:54:29] And then this has all sorts of like DMCA, DR, DRM, um, uh, and, uh, HDCP, all this sort [00:54:36] of a content encryption, copyright protection, uh, ostensibly it's quote unquote security. [00:54:41] And it, and it's the, like making sure from a malware perspective that the veracity of [00:54:45] the system files are all in place and so forth. [00:54:47] But like a lot of nerds were not on board because they want to rip blue waves or whatever it is. [00:54:51] And this might make it marginally more difficult, but gaming motherboards were like the last ones [00:54:57] to the party to support secure boot. [00:54:59] And even though I built my gaming PC, well, after windows 11 launched the BIOS that it [00:55:04] shipped with did not support secure boot. [00:55:06] Um, it didn't support, uh, I don't think like booting from UEFI drives correctly either. [00:55:13] So I'd set it up just like a normal basic fucking computer and it worked for however long it [00:55:18] worked. [00:55:18] But apparently in December, Microsoft was just like, and you get no more updates at all. [00:55:22] No more security updates, no more, nothing, which is why I started getting that message. [00:55:25] Uh, if you want to be on the latest and greatest version of windows 11, you must have secure boot. [00:55:30] Problem now is like, it's been several years. [00:55:34] And so figuring out what kind of motherboard I even have, I'm too lazy to like open the case [00:55:38] up and look at it. [00:55:39] And so I, I found the particular model number in my Amazon orders. [00:55:42] So step one, you know, I figured out what was happening. [00:55:45] I guess step, step zero is I get this new controller and I immediately regret it. [00:55:49] Uh, step two, figure out what's happening. [00:55:52] Step three, check my Amazon orders, identify the motherboard. [00:55:55] Uh, step four, I went to the motherboard website. [00:55:58] I find that there, a BIOS update is available and it's, it adds the secure boot functionality [00:56:03] because apparently the encryption software hardware is on the device, which is great. [00:56:07] So I download the BIOS and then I start flashing it. [00:56:12] Uh, not, you know, not that kind of, get your head out of the gutter. [00:56:15] I, it, it requires, uh, you know, identifying there's a, there's a particular USB port on [00:56:23] the back of the, of the motherboard. [00:56:25] That is the only one that can flash the BIOS and you have to look for it. [00:56:30] This is like M dash flash on it. [00:56:31] So you put it in there, you know, you restart, you, uh, boot into the BIOS and I, uh, got [00:56:39] it to update that, that part was actually pretty easy. [00:56:41] Then you go into the, the BIOS and it, you know, I don't know what BIOS stands for. [00:56:45] So if you're not like a PC person, this might not make sense, but you, you, the, the, it's, [00:56:49] it's the little bit of software that runs before the computer really starts. [00:56:52] And you can typically get there by hitting a key like F12 or delete. [00:56:55] And it's, you know, if you weren't raised on windows, uh, it's, it's, it's a weird [00:56:59] under, underbelly that sometimes you have to go into. [00:57:02] It's got a lot of arcane settings. [00:57:04] None of them make any sense. [00:57:05] It's a lot of acronyms that aren't explained, even though modern BIOS systems typically have [00:57:09] tooltips, it'll be like, what is, you know, what is MDR? [00:57:12] And it's like this, this option determines whether you have MDR turned on and off. [00:57:16] And there's like room for two more paragraphs to just maybe spell out what the fuck MDR is. [00:57:20] Uh, I turned on the secure boot, figure that out. [00:57:25] Uh, chat GPT is wonderful for stuff like this. [00:57:27] Like it gave me step-by-step directions because like, there's probably 800 forum, forum posts, [00:57:31] like detailing the same thing. [00:57:33] Uh, after reboot, nothing worked and like the computer would not boot. [00:57:39] I turned on secure boot, which required turning on UEFI, which is like a related technology of [00:57:44] like a more modern boot system for computers. [00:57:46] And it turns out it's because that my drive partition map is master boot record MBR, which [00:57:51] is like from the DOS era. [00:57:53] And that was the default when I set it up in 21 or 2020. [00:57:56]

Radio Sentai Castranger [519] Bill Nyelv ft Chris WHAT?

uber eats promo code valen sentai skipthedishes vram

Play Episode Listen Later Jan 18, 2025 110:27

It's time for emotional whiplash as we go from happy to sad with this week's double-serving of the Boonboomger movie and this week's episode. But first we discuss Vram issuing his debut beatdown to Gavv and Valen. Casters Present: Blue Gray Yellow North Show Notes: https://www.patreon.com/posts/120242594 Required Viewing: Kamen Rider Gavv 18, Boonboomger GekijoBOON Promise The Circuit, Bakuage Sentai Boonboomger 44 Watch on YouTube: https://www.youtube.com/watch?v=9fItYrjzE1c Hungry? Get CA$15 off your first 3 UberEats orders of CA$20 or more! https://ubereats.com/feed?promoCode=eats-christopherm5931ue Get $5 off your first order with SkipTheDishes! https://www.skipthedishes.com/r/6YaJc65HKg

291. AMD RX 9070 XT Leak, Nvidia RTX 5090 Performance, DLSS 4 Claims | Alderon Games

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Jan 8, 2025 162:30

A Game Dev joins to discuss how realistic Nvidia's DLSS 4 claims will be, and we have RDNA 4 updates! [SPON: Use "brokensilicon“ at CDKeyOffer for $23 Win11 Pro: https://www.cdkeyoffer.com/cko/Moore11 ] [SPON: Use “brokensilicon” to get $30 OFF the Minisforum V3 3-in-1 Tablet: https://shrsl.com/4rt3x ] 0:00 Intel Raptor Lake Failures Update 13:45 Nvidia RTX 5090, 5080, 5070 Ti, and 5070 Thoughts 33:38 Will DLSS 4 work as well as stated? 42:10 Will "Neural Compression" actually fix Nvidia's VRAM issues? 47:30 FSR 4 vs DLSS 4 vs XeSS2, Intel Battlemage's Future 1:04:12 Why does Sony's PSSR references XeSS in Code? 1:09:22 (NEW LEAK) AMD RX 9070 XT & 9070 Release Date Update 1:21:45 RDNA 4 Pricing, Nvidia Marketing Trapped AMD 1:30:47 AMD vs NVIDIA Ray Tracing 1:39:12 Nintendo Switch 2 Performance Analysis 1:53:37 Windows on (Qualcomm) ARM 2:04:20 Nvidia's ARM APU could actually be REALLY good! 2:10:01 Linux Support & Anti-Cheat Issues 2:26:21 Intel's CES Keynote Last time Matt was on: https://youtu.be/rkVSgix0L38?si=KK4Szr9VVl0Bisjw https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/ https://research.nvidia.com/labs/rtr/neural_texture_compression/ https://youtu.be/07UFu-OX1yI?t=218 https://alderongames.com/

games performance code sony windows pricing intel claims nintendo switch nvidia leak tablet amd gamedev nvidia rtx xt dlss fsr rdna vram amd rx

2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]

Play Episode Listen Later Dec 24, 2024 43:02

Happy holidays! We'll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production!For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver.Of perennial interest, particularly at academic conferences, is scaled-up architecture research as people hunt for the next Attention Is All You Need. We have many names for them: “efficient models”, “retentive networks”, “subquadratic attention” or “linear attention” but some of them don't even have any lineage with attention - one of the best papers of this NeurIPS was Sepp Hochreiter's xLSTM, which has a particularly poetic significance as one of the creators of the LSTM returning to update and challenge the OG language model architecture:So, for lack of a better term, we decided to call this segment “the State of Post-Transformers” and fortunately everyone rolled with it.We are fortunate to have two powerful friends of the pod to give us an update here:* Together AI: with CEO Vipul Ved Prakash and CTO Ce Zhang joining us to talk about how they are building Together together as a quote unquote full stack AI startup, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms, with notable industry contributions from RedPajama v2, Flash Attention 3, Mamba 2, Mixture of Agents, BASED, Sequoia, Evo, Dragonfly, Dan Fu's ThunderKittens and many more research projects this year* Recursal AI: with CEO Eugene Cheah who has helped lead the independent RWKV project while also running Featherless AI. This year, the team has shipped RWKV v5, codenamed Eagle, to 1.5 billion Windows 10 and Windows 11 machines worldwide, to support Microsoft's on-device, energy-usage-sensitive Windows Copilot usecases, and has launched the first updates on RWKV v6, codenamed Finch and GoldFinch. On the morning of Latent Space Live, they also announced QRWKV6, a Qwen 32B model modified with RWKV linear attention layers. We were looking to host a debate between our speakers, but given that both of them were working on post-transformers alternativesFull Talk on YoutubePlease like and subscribe!LinksAll the models and papers they picked:* Earlier Cited Work* Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* Hungry hungry hippos: Towards language modeling with state space models* Hyena hierarchy: Towards larger convolutional language models* Mamba: Linear-Time Sequence Modeling with Selective State Spaces* S4: Efficiently Modeling Long Sequences with Structured State Spaces* Just Read Twice (Arora et al)* Recurrent large language models that compete with Transformers in language modeling perplexity are emerging at a rapid rate (e.g., Mamba, RWKV). Excitingly, these architectures use a constant amount of memory during inference. However, due to the limited memory, recurrent LMs cannot recall and use all the information in long contexts leading to brittle in-context learning (ICL) quality. A key challenge for efficient LMs is selecting what information to store versus discard. In this work, we observe the order in which information is shown to the LM impacts the selection difficulty. * To formalize this, we show that the hardness of information recall reduces to the hardness of a problem called set disjointness (SD), a quintessential problem in communication complexity that requires a streaming algorithm (e.g., recurrent model) to decide whether inputted sets are disjoint. We empirically and theoretically show that the recurrent memory required to solve SD changes with set order, i.e., whether the smaller set appears first in-context. * Our analysis suggests, to mitigate the reliance on data order, we can put information in the right order in-context or process prompts non-causally. Towards that end, we propose: (1) JRT-Prompt, where context gets repeated multiple times in the prompt, effectively showing the model all data orders. This gives 11.0±1.3 points of improvement, averaged across 16 recurrent LMs and the 6 ICL tasks, with 11.9× higher throughput than FlashAttention-2 for generation prefill (length 32k, batch size 16, NVidia H100). We then propose (2) JRT-RNN, which uses non-causal prefix-linear-attention to process prompts and provides 99% of Transformer quality at 360M params., 30B tokens and 96% at 1.3B params., 50B tokens on average across the tasks, with 19.2× higher throughput for prefill than FA2.* Jamba: A 52B Hybrid Transformer-Mamba Language Model* We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. * Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. * This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU.* Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. * We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.* SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers* We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: * (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. * (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. * (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. * (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. * As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. * RWKV: Reinventing RNNs for the Transformer Era* Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. * We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.* Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. * We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.* LoLCATs: On Low-Rank Linearizing of Large Language Models* Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. * We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute. * We base these steps on two findings. * First, we can replace an LLM's softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss ("attention transfer").* Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA). * LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU. * Furthermore, LoLCATs does so with only 0.2% of past methods' model parameters and 0.4% of their training tokens. * Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work). * When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.Timestamps* [00:02:27] Intros* [00:03:16] Why Scale Context Lengths? or work on Efficient Models* [00:06:07] The Story of SSMs* [00:09:33] Idea 1: Approximation -> Principled Modeling* [00:12:14] Idea 3: Selection* [00:15:07] Just Read Twice* [00:16:51] Idea 4: Test Time Compute* [00:17:32] Idea 2: Hardware & Kernel Support* [00:19:49] RWKV vs SSMs* [00:24:24] RWKV Arch* [00:26:15] QWRKWv6 launch* [00:30:00] What's next* [00:33:21] Hot Takes - does anyone really need long context?Transcript[00:00:00] AI Charlie: We're back at Latent Space Live, our first mini conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co host. As a special treat this week, we're recapping the best of 2024 going domain by domain. We sent out a survey to the over 900 of you who told us what you wanted, and then invited the best speakers in the Latent Space Network to cover each field.[00:00:24] AI Charlie: 200 of you joined us in person throughout the day, with over 2200 watching live online. Thanks Our next keynote covers the State of Transformers alternative architectures, with a special joint presentation with Dan Fu of Together AI and Eugene Chia of Recursal AI and Featherless AI. We've featured both Together and Recursal on the pod before, with CEO Veepal Vedprakash introducing them.[00:00:49] AI Charlie: And CTO CE Zhang joining us to talk about how they are building together together as a quote unquote full stack AI startup from the lowest level kernel and systems [00:01:00] programming to the highest level mathematical abstractions driving new model architectures and inference algorithms with notable industry contributions from Red Pajama V2, Flash Attention 3, Mamba 2, Mixture of Agents.[00:01:15] AI Charlie: Based, Sequoia, Evo, Dragonfly, Danfoo's Thunder Kittens, and many more research projects this year. As for Recursal and Featherless, we were the first podcast to feature RWKV last year, and this year the team has shipped RWKV v5, codenamed Eagle, to 1. 5 billion Windows 10 and Windows 11 machines worldwide to support Microsoft's on device, end Energy Usage Sensitive Windows Copilot Use Cases and has launched the first updates on RWKV v6, codenamed Finch and Goldfinch.[00:01:53] AI Charlie: On the morning of Latent Space Live, they also announced QRdata UKv6, a QEN32B model [00:02:00] modified with RDWKV linear attention layers. Eugene has also written the most single most popular guest post on the Latent Space blog this year. Yes, we do take guest posts on what he has discovered about the H100 GPU inference NeoCloud market since the successful launch of Featherless AI this year.[00:02:20] AI Charlie: As always, don't forget to check the show notes for the YouTube link to their talk as well as their slides. Watch out and take care.[00:02:27] Intros[00:02:27] Dan Fu: Yeah, so thanks so much for having us. So this is going to be a little bit of a two part presentation. My name is Dan. I'm at Together AI, and I'll be joining UCSD as faculty in about a year. And Eugene, you want to introduce yourself?[00:02:46] Eugene Cheah: Eugene, I lead the art activity team, and I, I'm CEO of Featherless, and we both work on this new post transformer architecture space.[00:02:55] Dan Fu: Yeah, so yeah, so today we're really excited to talk to you a little bit [00:03:00] about that. So first I'm going to give a broad overview of kind of the last few years of progress in non post transformer architectures. And then afterwards Eugene will tell us a little bit about the latest and the greatest and the latest frontier models in this space.[00:03:16] Why Scale Context Lengths? or work on Efficient Models[00:03:16] Dan Fu: So, the story starts with Scaling. So this is probably a figure or something like this that you've seen very recently. Over the last five to six years, we've seen models really scale up in parameter size, and that's brought with it a bunch of new capabilities, like the ability to talk to you and tell you sometimes how to use your Colab screens.[00:03:35] Dan Fu: But another place where we've seen scaling especially recently is scaling in context length. So this can mean Having more text inputs for your models, but it can also mean things like taking a lot of visual token inputs image inputs to your models or generating lots of outputs. And one thing that's been really exciting over the last few months or so is that we're, we're seeing scaling, not only during training time, but also [00:04:00] during test time.[00:04:00] Dan Fu: So this is one of the, the, this is the iconic image from the OpenAI 01 release. Not only are we starting to scale train time compute, but we're also starting to scale test time compute. Now if you're familiar with our attention and our transformer architectures today, this graph on the right might look a little bit scary.[00:04:19] Dan Fu: And one of the reasons is that the implications are a little bit Interesting. So what does it mean if we want to continue having smarter and smarter models? Do we just need to start building bigger, bigger data centers, spending more flops? Is this this little Dolly 3, we need more flops, guys? Is this going to be the future of all of AI?[00:04:39] Dan Fu: Or is there a better way, another path forward? Maybe we can get the same capabilities that we've gotten used to, But for a lot less compute, a lot less flops. And one of the things that we're going to talk about today is specifically looking at that core attention operator in some of these models.[00:04:57] Dan Fu: And the reason is that so this is just some, some [00:05:00] basic you know, scaling curves, but attention has compute that scales quadratically in the context length. So that means that if you're doing something like test time compute and you want to spend a bunch of tokens thinking about what comes next, the longer that that goes the, the, the more tokens you spend on that, that compute grows quadratically in that.[00:05:19] Dan Fu: One of the questions that we're interested in is, can we take that basic sequence model, that basic sequence primitive at the bottom, and get it to scale better? Can we scale in, let's say, n to the 3 halves or n log n? So in, in the first part of the talk, so we just went over the introduction. What I'm gonna do over the next few slides is just talk about some of the key advances and ideas that have shown over the past few years since maybe early 2020 to, to now that shown promise that this might actually be possible.[00:05:48] Dan Fu: That you can actually get potentially the same quality that we want while scale, while scaling better. So to do that, we're and, and basically the, the story that we're gonna look is we're gonna start to see [00:06:00] how. So this is a basic graph of just the past couple years of progress of perplexity where that blue line, that dotted blue line, is attention.[00:06:07] The Story of SSMs[00:06:07] Dan Fu: It's your basic transformer, full dense attention. And then the dots coming down are some of the methods that you'll see in this presentation today. We're going to turn the clock back all the way to 2020. So this, this, this question of can we make attention subquadratic? Basically, as soon as we said attention is all you need, People started asking this question.[00:06:28] Dan Fu: So we have this quadratic attention operator. Can we do better? I'll briefly talk about why attention is quadratic. And the basic thing that happens, if you're not familiar, is that you have these inputs, these keys and queries. And what you do in this attention matrix, this S matrix over here, is that you're using, you're comparing every token in your input to every other token.[00:06:49] Dan Fu: So when I try to do something like upload a whole book to Gemini, what happens beyond the Maybe not Gemini, because we don't necessarily know what architecture is. But let's say we upload it to LLAMA, what happens beyond [00:07:00] the scenes, behind the scenes, is that it's going to take every single word in that book and compare it to every other word.[00:07:05] Dan Fu: And this has been a really, it's, it's led to some pretty impressive things. But it's kind of a brute forcing of the way that you would try to interpret a interpret something. And what attention does in particular is the, and then what attention, sorry, don't want to. Okay, no, no laser pointer. What, what attention does afterwards is that instead of always operating in this quadratic thing, it takes a row wise softmax over this matrix, and then multiplies it by this values matrix.[00:07:32] Dan Fu: So, one of the key points to notice is that the output size is always going to be the same as the inputs, at least in standard self attention. So one of the first things that folks tried to do around 2020 is this thing called linear attention, which is just, just noticing that if we take out this softmax from here, if we take out this non linearity in the middle of the attention operation, and then if you compute the keys and the values operation first, you actually never hit this quadratic bottleneck.[00:07:57] Dan Fu: So that, that's potentially a way [00:08:00] to get a lot more computationally efficient. And there are various ways to do this by basically using feature maps or try to approximate this overall attention computation. But some of this work sort of started to hit a wall in 2020. And the basic challenges were, were two.[00:08:16] Dan Fu: So one was quality. It was back then, it was kind of hard to, to get good quality with these linear attention operators. The other one was actually hardware efficiency. So these, this feature map that was just shown by a simplify simplify here. Actually ends up being quite computationally expensive if you just implement it naively.[00:08:34] Dan Fu: So you started having these operators that not only were you sure, you're not really sure if they have the same quality, but also they're actually just wall clock slower. So you kind of end up getting the worst of both worlds. So this was the the stage. So that kind of sets the stage for four years ago.[00:08:49] Dan Fu: Keep this in mind because linear attention is actually going to come back in a few years once we have a better understanding. But one of the works that started kicking off this, this [00:09:00] mini revolution in post transformer architectures was this idea called states based model. So here the seminal work is, is one about our work queue in 2022.[00:09:09] Dan Fu: And this, this piece of work really brought together a few ideas from, from some long running research research lines of work. The first one was, and this is really one of the keys to, to closing the gap in quality was just using things that, that if you talk to a, a, an electrical engineer off the street, they might know off, off the, like the back of their hand.[00:09:33] Idea 1: Approximation -> Principled Modeling[00:09:33] Dan Fu: But taking some of those properties with how we model dynamical systems in signal processing and then using those ideas to model the inputs, the, the text tokens in, for example a transformer like Next Token Prediction Architecture. So some of those early states-based model papers were looking at this relatively, relatively simple recurrent update model that comes from maybe chapter one of a signal processing class.[00:09:59] Dan Fu: But then using [00:10:00] some principle theory about how you should do that recurrent update in order to really get the most that you can out of your hidden state, out of your out of your sequence. So that, that was one key idea for quality and. When this was eventually realized, you started to see a bunch of benchmarks that were pretty sticky for a few years.[00:10:20] Dan Fu: Things like long range arena, some long sequence evaluation benchmarks, There was stuff in time series, time series analysis. They started to, you started to see the quality tick up in meaningful ways. But the other key thing that What's so influential about these states based models is that they also had a key idea about how you can compute these things efficiently.[00:10:45] Dan Fu: So if you go back to your machine learning 101 class where you learned about RNNs, one thing that you may have learned is that they don't paralyze as well as detention, because if you just run them naively, you have to do this kind of sequential update to process new tokens, [00:11:00] whereas in attention, you can process all the tokens in parallel at one time.[00:11:04] Dan Fu: One of the key insights behind the S4 paper was that these recurrent models, you could take them and you could also formulate them as a convolution. And in particular, with a convolution, you could, instead of using a PyTorch conv1d operation, you can compute that with the FFT. And that would give you n log n compute in the in the sequence length n with an operator that was relatively well optimized for modern hardware.[00:11:28] Dan Fu: So those are really, I'd say, the two key ideas in 2022 that started allowing these breakthroughs to happen in these non transformer architectures. So, these ideas about how to principally model sorry, how to model the recurrent updates of a mo of, of a sequence in a principled way, and also these key ideas in how you can compute it efficiently by turning it into a convolution and then scaling it up with the FFT.[00:11:53] Dan Fu: Along those same lines, so afterwards we started putting out some work on specialized kernels, so just [00:12:00] like we have flash attention for transformers, we also have works like flash fft conf, and if you look at these lines of work oftentimes when, whenever you see a new architecture, you see a new primitive one of the, one of the table stakes now is, do you have an efficient kernel so that you can actually get wall clock speed up?[00:12:14] Idea 3: Selection[00:12:14] Dan Fu: So by 2022, We are starting to have these models that had promising quality primitives, but and, and also promising wall clocks. So you could actually see regimes where they were better than transformers in meaningful ways. That being said, there were, there's still sometimes a quality gap, particularly for language modeling.[00:12:33] Dan Fu: And because languages, It's so core to what we do in sequence modeling these days the, the next, the next key idea that I'm going to talk about is this idea of selection mechanisms. And this is basically an idea of, so you have this recurrent state that you're keeping around that just summarizes everything that, that came before.[00:12:50] Dan Fu: And to get a good sequence model, one of the things that you really need to be able to do is have the model learn what's the best way to pick out pieces from that recurrent [00:13:00] state. So one of the, one of the major ideas here in a line of work called H3, Hungry Hungry Hippos, and also these hyena models were One way you can do this is by just adding some simple element wise gates.[00:13:13] Dan Fu: So versions of these ideas have been around for decades. If you squint at the LSTM paper you, you can probably find, find this gating mechanism. But turns out you can take those old ideas, add them into these new. state space models, and then you can see quality start to pick up. If you've heard of the Mamba model, this also takes the selection to the next level by actually making some changes in that fundamental recurrent state space.[00:13:40] Dan Fu: So, it's not only just this gating that happens around the SSM layer, but also you can actually make The ABCD matrices of your state space model, you can make them data dependent, which will allow you to even better select out different pieces from your hidden state depending on what you're seeing. I'll also point out if you look at the [00:14:00] bottom right of this figure, there's this little triangle with a GPU SRAM, GPU HBM, and this, this is just continuing that trend of when you have a new architecture you, you, you also release it with a kernel to, to, to show that it is hardware efficient, that it, that it can be hardware efficient on modern hardware.[00:14:17] Dan Fu: The, the, one of the next cool things that happened is once we had this understanding of these are the basic pieces, these are the basic principles behind some of the sequence models linear attention actually started to come back. So in earlier this year, there was a model called BASED the, from Simran Arora and, and some other folks, that combined a more principled version of linear attention that basically the, the, the, the two second summary is that it used a Taylor approximation of the softmax attention, combined that with a simple sliding window attention and was starting to able, starting to be able to expand the Pareto frontier of how much data can you recall from your sequence, versus how small is your recurrent state size.[00:14:58] Dan Fu: So those orange dots [00:15:00] are, at the top there, are just showing smaller sequences that can recall more memory.[00:15:07] Just Read Twice[00:15:07] Dan Fu: And the last major idea I think that has been influential in this line of work and is very relatively late breaking just a few months ago, is just the basic idea that when you have these models that are fundamentally more efficient in the sequence length, you maybe don't want to prompt them or use them in exactly the same way.[00:15:26] Dan Fu: So this was a really cool paper called Just Read Twice, also from Simran. That basically said, hey, all these efficient models can process tokens so much more efficiently than transformers that they can sometimes have unfair advantages compared to a simple transformer token. So, or sorry, a simple transformer model.[00:15:44] Dan Fu: So take, for example the standard, the standard use case of you have some long document, you're going to pass it in as input, and then you're going to ask some question about it. One problem you might imagine for a recurrent model where you have a fixed state size is, let's say that [00:16:00] you're. Article is very long, and you're trying to ask about some really niche thing.[00:16:04] Dan Fu: You can imagine it might be hard for the model to know ahead of time what information to put into the hidden state. But these, these, these models are so much more efficient that you can do something really stupid, like, you can just put the document write down the document, write down the question, write down the document again, and then write down the question again, and then this time, the second time that you go over that document, you know exactly what to look for.[00:16:25] Dan Fu: And the cool thing about this is, so this is, And this this results in better quality, especially on these recall intensive tasks. But the other interesting thing is it really takes advantage of the more efficient architectures that, that we're having here. So one of the other, I think, influential ideas in this line of work is if you change the fundamental compute capabilities of your model and the way that it scales, you can actually start to query it at test time differently.[00:16:51] Idea 4: Test Time Compute[00:16:51] Dan Fu: And this actually, of course, goes back to those slides on test time compute. So while everybody's looking at, say, test time compute for big transformer models, [00:17:00] I think potentially a really interesting research question is, how can you take those and how does it change with this new next generation of models?[00:17:09] Dan Fu: So the, I'll just briefly summarize what some of those key ideas were and then talk and then show you briefly kind of what the state of the art is today. So, so the four key ideas are instead of just doing a simple linear attention approximation, instead take ideas that we know from other fields like signal processing, do a more principled approach to your modeling of the sequence.[00:17:32] Idea 2: Hardware & Kernel Support[00:17:32] Dan Fu: Another key idea throughout all these lines of work is you really want. Hardware and kernel support from day one. So, so even if your model is theoretically more efficient if somebody goes and runs it and it's two times slower one of the things that, that we've learned is that if, if you're in that situation, it's, it's just gonna be dead on arrival.[00:17:49] Dan Fu: So you want to be designing your architectures one of the key, key machine learning ideas that has been important for the quality is just making sure that you encode different ways that you can [00:18:00] select from your hidden state and, and really focus on that as a key decider of quality. And finally, I think one of the, the, the emerging new, new things for, for this line of work and something that's quite interesting is, What are the right test time paradigms for these models?[00:18:15] Dan Fu: How do they change relative to relative to what you might do for a standard transformer? I'll briefly end this section. So I've labeled this slide where we are yesterday because Eugene is going to talk about some new models that he released literally this morning. But as of yesterday, some of the really cool results out of the, these efficient alternative models were so AI2 trained this hybrid MOE called Jamba.[00:18:40] Dan Fu: That, that, that seems, that is currently the state of the art for these non transformer architectures. There's this NVIDIA and MIT put out this new diffusion model called SANA recently that one of their key key observations is that you can take a standard diffusion transformer diffusion model, replace the layers with linear [00:19:00] attention, and then that lets you scale to much larger much larger images, much, much Much larger sequences more efficiently.[00:19:07] Dan Fu: And and one thing that I don't think anybody would have called when a few years ago is that one of those gated SSM, gated states based models ended up on the cover of Science because a great group of folks went and trained some DNA models. So that's Michael Polley, Eric Yuen from from Stanford and the Arc Institute.[00:19:26] Dan Fu: So it's, we're really at an exciting time in 2024 where these non transformer, post transformer architectures are showing promise across a wide range. Across a wide range of, of modalities, of applications, and, and of tasks. And with that, I'll pass it on to Eugene, who can tell you a little bit about the latest and greatest with RWKV.[00:19:49] RWKV vs SSMs[00:19:49] Eugene Cheah: So, that's useful? Yeah. You're talking to here. Oh, I'm talking to here. Okay. So, yeah, two streams. Yeah. So, I think one common questions that we tend to get asked, right, is what's the difference between [00:20:00] RWKV and state space? So I think one of the key things to really understand, right the difference between the two groups, right, is that we are actually more like an open source, random internet meets academia kind of situation.[00:20:11] Eugene Cheah: Like, most of us never wrote any paper, but we, we basically look at RNNs and linear intention when intention is all you need came out, and then we decided to like, hey there is a quadratic scaling problem. Why don't we try fixing that instead? So, so, so we end up developing our own branch, but we end up sharing ideas back and forth.[00:20:30] Eugene Cheah: So, and, and we do all this actively in Discord, GitHub, etc. This was so bad for a few years, right, that basically, the average group's H index was so close to zero, right, Illuter. ai actually came in and helped us write our first paper. Great, now our H index is now three, apparently. So, so, so, but, but the thing is, like, a lot of these experiments led to results, and, and, essentially, essentially, we we took the same ideas from linear attention, [00:21:00] and we built on it.[00:21:01] Eugene Cheah: So, to take a step back into, like, how does RWKB handle its own attention mechanic and achieve the same goals of, like, O and compute, respectively, and in focus of our overall goal to make AI accessible to everyone, regardless of language, nation, or compute, that's our goal. We actually train our models primarily on over a hundred languages, which is another topic altogether.[00:21:23] Eugene Cheah: And our goal is to train to even 200 languages to cover all languages in the world. But at the same time, we work on this architecture, To lower the compute cost so that people can run it on Raspberry Pis and on anything. So, how did RWKB break the dependency of LSTM token flow? Because I think to understand architecture, right, it's probably easier to understand it from the RNN lens.[00:21:46] Eugene Cheah: Because that's where we built on. We all, we all state space kind of like try to, try to start anew and took lessons from that and say, So there's a little bit of divergence there. And AKA, this our version of linear attention. So to take step back [00:22:00] all foundation models, be it transformers or non transformers at a very high level, right?[00:22:05] Eugene Cheah: Pumps in the token. I mean, text that things into embeddings and go through a lot of layers. Generate a lot of states where the QKV cache or be iron in states or RW KB states. And outputs and embedding, they are not the same thing. And we just take more layers and more embeddings. And somehow that magically works.[00:22:23] Eugene Cheah: So, if you, if you remember your ancient RNN lessons which we, which we, which we we call best learning these days the general idea is that you have the embedding information flowing all the way up, and when, and you take that information and you flow it back down, and then you process it as part of your LSTM layers.[00:22:41] Eugene Cheah: So, this is how it generally works. Kapati is quoted saying that RNNs are actually unreasonably effective. The problem is this is not scalable. To start doing work on the second token, you need to wait for the first token. And then you need to, and likewise for the third token and fourth token, yada yada.[00:22:55] Eugene Cheah: That is CPU land, not GPU land. So, so, so, you [00:23:00] can have a H100 and you can't even use 1 percent of it. So, so that's kind of why RNNs didn't really take off in the direction that we wanted, like, billions of parameters when it comes to training. So, what did RDAP KV version 0 do? Boom. We just did the dumbest, lamest thing.[00:23:13] Eugene Cheah: Sorry, this is the bottleneck for RNN. We did the dumb thing of removing that line. And it kind of worked. It trained. It sucked, but it kind of worked. Then we were like, hey, then no one cared because the loss was crap, but how do we improve that? And that's essentially where we move forward, because if you see this kind of flow, right, you can actually get your GPU saturated quickly, where it essentially cascades respectively.[00:23:41] Eugene Cheah: So I'm just waiting for this to loop again. So it's like, once you get your first layer, your token to be computed finish. You start to cascade your compute all the way until you are, Hey, I'm using 100 percent of the GPU. So we, we worked on it, and we started going along the principle of that as long as we keep this general architecture [00:24:00] where, where we can cascade and, and be highly efficient with our architecture, nothing is sacred in our architecture.[00:24:06] Eugene Cheah: And we have done some crazy ideas. In fact, you ask us, if you ask me to explain some things in the paper, right, officially in the paper, I'll say we had this idea and we wrote it this way. The reality is someone came with a code, we tested it, it worked, and then we rationalized later. So, so the general[00:24:24] RWKV Arch[00:24:24] Eugene Cheah: The idea behind rwkbr is that we generally have two major blocks that we do.[00:24:30] Eugene Cheah: We call time mix and channel mix. And time mix generally handles handles long term memory states, where essentially, where essentially where we apply the matrix multiplication and Cilu activation functions into processing an input embedding and an output embedding. I'm oversimplifying it because this, This calculation changed every version and we have, like, version 7 right now.[00:24:50] Eugene Cheah: ChannelMix is similar to Base in the sense that it does shorter term attention, where it just looks at the sister token, or the token before it, because [00:25:00] there's a shift in the token shift matrix. I don't really want to go too much into the papers itself, because, like, we do have three papers on this.[00:25:09] Eugene Cheah: Basically, RWKB, RNN for the transformer, ERA, Ego and Pinch, RWKB, Matrix Value State. This is the updated version 5, version 6. And Goldfinch is our, is, is, is, is our hybrid model respectively. We are writing the paper already for V seven and which is, which is for R wk V seven. Called, named Goose, or architectures are named by Bird.[00:25:30] Eugene Cheah: And, I'm going to cover as well, qrwkb, and mama100k, and rwkb, and Where did that lead to? Great! Because we are all GPU poor and to be clear, like, most of this research is done, like, only on a handful H100s, which I had one Google researcher told me that was, like, his experiment budget for a single researcher.[00:25:48] Eugene Cheah: So, our entire organization has less compute than a single researcher in Google. So We, we, one of the things that we explored into was to how do we convert transformer models instead? Because [00:26:00] someone already paid that billion dollars, a million dollars onto training, so why don't we take advantage of those weights?[00:26:05] Eugene Cheah: And, and to, I believe, together AI worked on the lockets for, for the Lambda side of things, and, and we took some ideas from there as well, and we essentially did that for RWKB.[00:26:15] QWRKWv6 launch[00:26:15] Eugene Cheah: And that led to, Q RWKB6, which we just dropped today, a 32 bit instruct preview model, where we took the Quen 32 bit instruct model, freeze the feedforward layer, remove the QKB attention layer, and replace it with RWKB linear layers.[00:26:32] Eugene Cheah: So to be clear, this means we do not have the rwkv channel mix layer, we only have the time mix layer. But but once we do that, we train the rwkv layer. Important is that the feedforward layer needs to be frozen, so the new attention can be learned. And then we unfreeze the feedforward layer, and train all the layers together with a custom learning rate schedule, so that they can learn how to work together.[00:26:54] Eugene Cheah: The end result, surprisingly, And, to be honest, to the frustration of the R. W. [00:27:00] KV MOE team, which ended up releasing the model on the same day, was that, with just a few hours of training on two nodes, we managed to get it to be on par, kind of, with the original QUAN32B model. So, in fact, when the first run, right, that completely confused us, it was like, and I was telling Daniel Goldstein, Smirky, who kind of leads most of our research coordination, When you pitched me this idea, you told me at best you'll get the same level of performance.[00:27:26] Eugene Cheah: You didn't tell me the challenge and score and Winograd score will shoot up. I don't know what's happening there. But it did. MMLU score dropping, that was expected. Because if you think about it, when we were training all the layers, right, we were essentially Like, Frankenstein this thing, and we did brain damage to the feedforward network layer 2 with the new RWKB layers.[00:27:47] Eugene Cheah: But, 76%, hey, somehow it's retained, and we can probably further train this. We didn't even spend more than 3 days training this, so there's a lot more that can be done, hence the preview. This brings up [00:28:00] a big question, because We are already now in the process of converting to 7TB. We are now, this is actually extremely compute efficient to test our attention mechanic.[00:28:10] Eugene Cheah: It's like, it becomes a shortcut. We can, we are already planning to do our version 7 and our hybrid architecture for it. Because we don't need to train from scratch. And we get a really good model out of it. And the other thing that is uncomfortable to say is that because we are doing right now on the 70b is that if this scales correctly to 128k context length, I'm not even talking about a million 128, majority of enterprise workload today is just on 70b at under 32k context length.[00:28:41] Eugene Cheah: That means if this works and the benchmark matches it, It means we can replace the vast majority of current AI workload, unless you want super long context. And then sorry, can someone give us more GPUs? Because we do need the VRAM for super long context, sadly. So yeah, that's what we are working on, and essentially, [00:29:00] we are excited about this to just push it further.[00:29:02] Eugene Cheah: And this conversion process, to be clear, I don't think it's going to be exclusive to RWKB. It probably will work for Mamba as well, I don't see why not. And we will probably see more ideas, or more experiments, or more hybrids, or Yeah, like, one of the weirdest things that I wanted to say outright, and I confirmed this with the Black Mamba team and the Jamba team, which because we did the GoFinch hybrid model, is that none of us understand why a hard hybrid with a state based model to be R.[00:29:28] Eugene Cheah: QA state space and transformer performs better when, than the baseline of both. It's like, it's like when you train one, you expect, and then you replace, you expect the same results. That's our pitch. That's our claim. But somehow when we jam both together, it outperforms both. And that's like one area of emulation that, like, we only have four experiments, plus four teams, that a lot more needs to be done.[00:29:51] Eugene Cheah: But, but these are things that excite me, essentially, because that is what it's potentially we can move ahead for. Which brings us to what comes next.[00:30:00] What's next[00:30:00] [00:30:00][00:30:00] Dan Fu: So, this part is kind of just some, where we'll talk a little bit about stuff that, that we're excited about. Maybe have some wild speculation on, on what, what's, what's coming next.[00:30:12] Dan Fu: And, of course this is also the part that will be more open to questions. So, a couple things that, that I'm excited about is continued hardware model co design for, for these models. So one of the things that we've put out recently is this library called ThunderKittens. It's a CUDA library.[00:30:29] Dan Fu: And one of the things that, that we found frustrating is every time that we built one of these new architectures, and I'm sure you had the exact same experience, we'd have to go and spend two months in CUDA land, like writing these, these new efficient things. And. If we decided to change one thing in PyTorch, like one line of PyTorch code is like a week of CUDA code at least.[00:30:47] Dan Fu: So one of our goals with, with a library like Thunderkitten, so we, we just broke down what are the key principles, what are the key hardware things what are the key, Compute pieces that you get from the hardware. So for example on [00:31:00] H100 everything is really revolves around a warp group matrix multiply operation.[00:31:06] Dan Fu: So you really want your operation to be able to split into relatively small matrix, matrix multiply operations. So like multiplying two 64 by 64 matrices, for example. And so if you know that ahead of time when you're designing your model, that probably gives you you know, some information about how you set the state sizes, how you set the update, how you set the update function.[00:31:27] Dan Fu: So with Thunderkittens we basically built a whole library just around this basic idea that all your basic compute primitives should not be a float, but it should be a matrix, and everything should just be matrix compute. And we've been using that to, to try to both re implement some existing architectures, and also start to design code.[00:31:44] Dan Fu: Some new ones that are really designed with this core with a tensor core primitive in mind. Another thing that that we're, that at least I'm excited about is we, over the last four or five years, we've really been looking at language models as the next thing. But if you've been paying [00:32:00] attention to Twitter there's been a bunch of new next generation models that are coming out.[00:32:04] Dan Fu: So there, there are. So, video generation models that can run real time, that are supported by your mouse and your keyboard, that I'm told if you play with them that, you know, that they only have a few seconds of memory. Can we take that model, can we give it a very long context length so that you could actually maybe generate an entire game state at a time?[00:32:25] Dan Fu: What does that look like for the model? You're certainly not going to do a giant quadratic attention computation to try to run that. Maybe, maybe use some of these new models, or some of these new video generation models that came out. So Sora came out I don't know, two days ago now. But with super long queue times and super long generation times.[00:32:43] Dan Fu: So that's probably a quadratic attention operation at the, at the bottom of it. What if we could remove that and get the same quality, but a lot faster generation time? Or some of the demos that we saw from Paige earlier today. You know, if I have a super long conversation with my [00:33:00] Gemini bot, what if I wanted to remember everything that it's seen in the last week?[00:33:06] Dan Fu: I mean, maybe you don't for personal reasons, but what if I did, you know? What does that mean for the architecture? And I think, you know, that's certainly something I'm pretty excited about. I'm sure you're excited about it too. So, I think we were supposed to have some hot takes, but I honestly don't remember what our hot takes were.[00:33:21] Hot Takes - does anyone really need long context?[00:33:21] Eugene Cheah: Yeah, including the next slide. Hot takes, yes, these are our[00:33:25] Dan Fu: hot takes.[00:33:25] Eugene Cheah: I think the big one on Twitter that we saw, that we shared, was the question is like, is RAG relevant? In the case of, like, the future of, like, state based models?[00:33:38] Dan Fu: Let's see, I haven't played too much with RAG. But when I have. I'll say I found it was a little bit challenging to do research on it because we had this experience over and over again, where you could have any, an embedding model of any quality, so you could have a really, really bad embedding model, or you could have a really, really [00:34:00] good one, By any measure of good.[00:34:03] Dan Fu: And for the final RAG application, it kind of didn't matter. That's what I'll say about RAG while I'm being recorded. I know it doesn't actually answer the question, but[00:34:13] Eugene Cheah: Yeah, so I think a lot of folks are like, extremely excited of the idea of RWKB or State Space potentially having infinite context.[00:34:21] Eugene Cheah: But I think the reality is that when we say infinite context, we just mean a different kind of infinite context, or you, or as it's previously covered, you need to test the model differently. So, think of it more along the lines of the human. Like, I don't remember what I ate for breakfast yesterday.[00:34:37] Eugene Cheah: Yeah, that's the statement that I'll say. And And we humans are not quadratic transformers. If we did, if let's say we increased our brain size for every second we live, we would have exploded by the time we are 5 years old or something like that. And, and I think, I think basically fundamentally for us, right, be it whether we, regardless of whether RWKB, statespace, XLSTM, [00:35:00] etc, our general idea is that instead of that expanding state, that increase in computational cost, what if we have a fixed state size?[00:35:08] Eugene Cheah: And Information theory detects that that fixed state size will have a limit. Just how big of a limit is a question, like, we, like, RWKB is running at 40 megabytes for, for its state. Its future version might run into 400 megabytes. That is like millions of tokens in, if you're talking about mathematically, the maximum possibility.[00:35:29] Eugene Cheah: It's just that I guess we were all more inefficient about it, so maybe we hit 100, 000. And that's kind of like the work we are doing, trying to like push it and maximize it. And that's where the models will start differing, because it will choose to forget things, it will choose to remember things. And that's why I think that there might be some element of right, but it may not be the same right.[00:35:49] Eugene Cheah: It may be the model learn things, and it's like, hmm, I can't remember that, that article. Let me do a database search, to search. Just like us humans, when we can't remember the article in the company. We do a search on Notion. [00:36:00][00:36:00] Dan Fu: I think something that would be really interesting is if you could have facts that are, so right now, the one intuition about language models is that all those parameters are around just to store random facts about the world.[00:36:14] Dan Fu: And this intuition comes from the observation that if you take a really small language model, it can do things like talk to you, or kind of has like the The style of conversation, it can learn that, but where it will usually fall over compared to a much larger one is it'll just be a lot less factual about things that it knows or that it can do.[00:36:32] Dan Fu: But that points to all those weights that we're spending, all that SGD that we're spending to train these models are just being used to store facts. And we have things like databases that are pretty good at storing facts. So I think one thing that would be really interesting is if we could actually have some sort of outside data store that a language model can can look at that that maybe is you know, has has some sort of gradient descent in it, but but would be quite interesting.[00:36:58] Dan Fu: And then maybe you could edit it, delete [00:37:00] facts, you know, change who's president so that it doesn't, it doesn't get lost.[00:37:04] Vibhu: Can we open up Q& A and hot takes for the audience? I have a hot take Q& A. Do these scale? When, when 405B state space model, RAG exists, no one does long context, who's throwing in 2 million token questions, hot takes?[00:37:24] Dan Fu: The, the who's throwing in 2 million token question, I think, is, is a really good question. So I actually, I was going to offer that as a hot take. I mean, my hot take was going to be that long context doesn't matter. I know I just gave a whole talk about it, but you know, what, what's the point of doing research if you can't, you know, play both sides.[00:37:40] Dan Fu: But I think one of the, so I think for both of us, the reason that we first got into this was just from the first principled questions of there's this quadratic thing. Clearly intelligence doesn't need to be quadratic. What is going on? Can we understand it better? You know, since then it's kind of turned into a race, which has [00:38:00] been exciting to watch, like, how much context you can take in.[00:38:03] Dan Fu: But I think it's right. Nobody is actually putting in a two million context prompt into these models. And, and, you know, if they are, maybe we can go, go You know, design a better model to do that particular thing. Yeah, what do you think about that? So you've also been working on this. Do you think long context matters?[00:38:19] Eugene Cheah: So I'm going to burn a bit. How many of you remember the news of Google Gemini supporting 3 million contacts, right? Raise your hand.[00:38:28] Vibhu: Yeah, 2 million.[00:38:29] Eugene Cheah: Oh, it's 2 million.[00:38:31] Eugene Cheah: Yeah, how many of you actually tried that? See?[00:38:34] Vibhu: I use it a lot. You? You work for MindsTV. I use it a lot.[00:38:41] Eugene Cheah: So, for some people that has used, and I think, I think that's the, that's might be, like, this is where my opinion starts to differ, because I think the big labs may have a bigger role in this, because Like, even for RWKB, even when we train non contacts, the reason why I say VRAM is a problem is that because when we did the, we need to backprop [00:39:00] against the states, we actually need to maintain the state in between the tokens by the token length.[00:39:05] Eugene Cheah: So that means we need to actually roll out the whole 1 million contacts if we are actually training 1 million. Which is the same for transformers, actually, but it just means we don't magically reuse the VRAM consumption in the training time space. So that is one of the VRAM bottlenecks, and I'm neither OpenAI nor Google, so donate GPUs if you have too much of them.[00:39:27] Eugene Cheah: But then, putting it back to another paradigm, right, is that I think O1 style reasoning might be actually pushing that direction downwards. In my opinion, this is my partial hot take is that if, let's say you have a super big model, And let's say you have a 70B model that may take double the tokens, but gets the same result.[00:39:51] Eugene Cheah: Strictly speaking, a 70B, and this is even for transformer or non transformer, right? We we'll take less less resources than that 400 B [00:40:00] model, even if it did double the amount thinking. And if that's the case, and we are still all trying to figure this out, maybe the direction for us is really getting the sub 200 B to be as fast as efficient as possible.[00:40:11] Eugene Cheah: We a very efficient architecture that some folks happen to be working on to, to just reason it out over larger and larger context thing.[00:40:20] Question: Yeah. One thing I'm super interested in is. Models that can watch forever? Obviously you cannot train something on infinite context length. How are y'all thinking about that, where you run on a much longer context length than is possible to train on?[00:40:38] Dan Fu: Yeah, it's a, it's a great question. So I think when I think you guys probably had tweets along these lines, too. When we first started doing these things, because these are all recurrent models in theory you could just run it forever. You could just run it forever. And at the very least it won't, it won't like error out on your crash.[00:40:57] Dan Fu: There's another question of whether it can actually [00:41:00] use what it's seen in that infinite context. And I think there, so one place where probably the research and architectures ran faster Then another research is actually the benchmarks for long context. So you turn it on forever. You want to do everything or watch everything.[00:41:16] Dan Fu: What is it that you actually wanted to do? Can we actually build some benchmarks for that? Then measure what's happening. And then ask the question, can the models do it? Is there something else that they need? Yeah, I think that if I were to turn back the clock to 2022, that's probably one of the things I would have done differently, which would have been actually get some long context benchmarks out at the same time as we started pushing context length on all these models.[00:41:41] Eugene Cheah: I will also say the use case. So like, I think we both agree that there's no Infinite memory and the model needs to be able to learn and decide. I think what we have observed for, I think this also fits the state space model, is that one of the key advantages of this alternate attention mechanic that is not based on token position is that the model don't suddenly become crazy when you go past the [00:42:00] 8k training context tank, or a million context tank.[00:42:03] Eugene Cheah: It's actually still stable. It's still able to run, it's still able to rationalize. It just starts forgetting things. But some of these things are still there in latent memory. Some of these things are still somewhat there. That's the whole point of why reading twice works. Things like that. And one of the biggest pushes in this direction is that I think both Statespace and RWKB have Separate papers by other researchers where they use this architecture for time series data.[00:42:26] Eugene Cheah: Weather modeling. So, you are not asking what was the weather five days ago. You're asking what's the weather tomorrow based on the infinite length that we, as long as this Earth and the computer will keep running. So, so, and they found that it is like, better than existing, like, transformer or existing architecture in modeling this weather data.[00:42:47] Eugene Cheah: Control for the param size and stuff. I'm quite sure there are people with larger models. So, so there are things that, that in this case, right, there is future applications if your question is just what's next and not what's 10 years ago.[00:42:59] Dan Fu: Thanks so [00:43:00] much for having us. Get full access to Latent Space at www.latent.space/subscribe

ceo ai google earth science space state deep story dna microsoft mit built idea boom raise discord vancouver stanford windows base era scaling architecture ego models frankenstein hot takes infinite nlp eagle transformers year in review selection separate openai gemini nvidia hardware dit efficient generate sd goose qu'en aws github llama sana notion transformer finch evo pinch llm cpu mamba gpu 3b sequoia ls raspberry pi pareto dragonfly ucsd ae rag gpus black mamba lambda 6b lms s4 aes 8b lm recurrent compute mixture thoth google gemini hyena abcd colab goldfinch simran cuda h3 decoder t5 pytorch 16gb excitingly fft hungry hungry hippos 50b ssm jamba icl mse o1 sgd winograd 70b neurips 360m rnn icml windows copilot vram lstm daniel goldstein rnns lolcats iclr latent space ssms

Episode 40: What Every LLM Developer Needs to Know About GPUs

Vanishing Gradients

Play Episode Listen Later Dec 24, 2024 103:34

Hugo speaks with Charles Frye, Developer Advocate at Modal and someone who really knows GPUs inside and out. If you're a data scientist, machine learning engineer, AI researcher, or just someone trying to make sense of hardware for LLMs and AI workflows, this episode is for you. Charles and Hugo dive into the practical side of GPUs—from running inference on large models, to fine-tuning and even training from scratch. They unpack the real pain points developers face, like figuring out: - How much VRAM you actually need. - Why memory—not compute—ends up being the bottleneck. - How to make quick, back-of-the-envelope calculations to size up hardware for your tasks. - And where things like fine-tuning, quantization, and retrieval-augmented generation (RAG) fit into the mix. One thing Hugo really appreciate is that Charles and the Modal team recently put together the GPU Glossary—a resource that breaks down GPU internals in a way that's actually useful for developers. We reference it a few times throughout the episode, so check it out in the show notes below.

ai building developers machine learning data science software engineers gpu data scientists rag gpus 1k stitch fix modal developer advocate vram

Intel Promises Battlemage GPU Game Fixes, Enough VRAM and Long Term Future (feat. Tom Petersen)

Play Episode Listen Later Dec 9, 2024 86:42

Episode 55: Intel's Tom Petersen joins the podcast to chat about Arc Battlemage! We discuss fixing and improving game compatibility, the importance of enough VRAM, hardware design decisions including the die size, the future of the Arc division, XeSS 2 frame generation and plenty more.CHAPTERS00:00 - Intro02:12 - The Journey from Alchemist to Battlemage09:21 - Hardware Changes and Improving Game Compatibility15:13 - The Importance of VRAM20:11 - GPU Design Choices and Improvements34:18 - The Price38:12 - The Future of Arc: On the Chopping Block?48:28 - XeSS 2 Frame Generation and XeLL1:19:45 - Ray Tracing on Arc Battlemage1:25:32 - OutroSUBSCRIBE TO THE PODCASTAudio: https://shows.acast.com/the-hardware-unboxed-podcastVideo: https://www.youtube.com/channel/UCqT8Vb3jweH6_tj2SarErfwSUPPORT US DIRECTLYPatreon: https://www.patreon.com/hardwareunboxedLINKSYouTube: https://www.youtube.com/@Hardwareunboxed/Twitter: https://twitter.com/HardwareUnboxedBluesky: https://bsky.app/profile/hardwareunboxed.bsky.social Hosted on Acast. See acast.com/privacy for more information.

game future promises acast long term intel arc alchemists fixes chopping block ray tracing podcast video vram battlemage tom petersen

S.T.A.L.K.E.R. 2 Punishes Your CPU, GPU and VRAM

pc acast loading cpu flight simulator podcast video cpu gpu vram

Play Episode Listen Later Nov 21, 2024 66:45

Episode 53: We've been testing games this week, including S.T.A.L.K.E.R. 2 and Flight Simulator 2024, so we discuss how these games run on PC. Loading issues, punishing CPU requirements, VRAM issues and more.CHAPTERS00:00 - Intro04:07 - Testing S.T.A.L.K.E.R. 214:51 - VRAM is An Issue Again34:18 - Floaty Controls and Frame Generation39:04 - Testing Flight Simulator 202446:51 - Updates From Our Boring LivesBluesky: https://bsky.app/profile/hardwareunboxed.bsky.socialSUBSCRIBE TO THE PODCASTAudio: https://shows.acast.com/the-hardware-unboxed-podcastVideo: https://www.youtube.com/channel/UCqT8Vb3jweH6_tj2SarErfwSUPPORT US DIRECTLYPatreon: https://www.patreon.com/hardwareunboxedFloatplane: https://www.floatplane.com/channel/HardwareUnboxedLINKSYouTube: https://www.youtube.com/@Hardwareunboxed/Twitter: https://twitter.com/HardwareUnboxed Hosted on Acast. See acast.com/privacy for more information.

276. PS5 Pro 120Hz Leak, Intel Lunar Lake Review, Nintendo Switch 2, AMD Zen 5 Strix

ThinkComputers Weekly Tech Podcast

Play Episode Listen Later Sep 24, 2024 100:34

We discuss Intel Arrow Lake, Lunar Lake, Zen 5, RADEON UDNA, PS5 Pro, and Nintendo Switch 2!!! [SPON: Thanks for Sponsoring the Video Odoo! Get your first App FREE here: https://www.odoo.com/r/xSwO ] [SPON: Use "brokensilicon“ at CDKeyOffer to get Win 11 Pro for $23: https://www.cdkeyoffer.com/cko/Moore11 ] 0:00 Tom messes up the beginning (Intro Banter) 3:06 XBOX Series S vs GTX 970 VRAM, XSX Disc Drives (Corrections) 9:12 Intel makes IFS a Subsidiary & Sells of Parts of the Company 17:57 AMD Strix Point vs Meteor Lake vs Hawk Point Pricing Analysis 32:20 AMD Ryzen AI Max+ 395 Blockchain GTA VI 38:53 Intel Lunar Lake Reviews - Competitive w/ Strix Point 49:47 PS5 Pro Revealed w/ Controversial $699 MSRP 50:31 PlayStation 5 Pro tested at 120Hz (Leak) 1:08:16 Lunar Lake Early Supply Leak 1:08:44 Nintendo Switch 2 Leaked 1:15:08 iPhone 16 Revealed, Launched, and Tested 1:21:02 Zen 5 CCX Latency, Arrow Lake Performance, UDNA, FSR 4 (Wrap-Up) 1:31:59 IPC Terminology, RDNA 4 Ray Tracing (Final Reader Mail) https://www.xbox.com/en-US/consoles/xbox-series-x https://www.cnbc.com/2024/09/16/intel-turns-foundry-business-into-subsidiary-weighs-outside-funding.html https://www.servethehome.com/intel-creating-foundry-subsidiary-and-announcing-a-big-aws-win/ https://www.nasdaq.com/articles/beaten-down-intel-stock-buy-foundry-spinoff-plans https://x.com/AnhPhuH/status/1837053994591735905 https://www.newegg.com/p/2S3-0006-002E9 https://www.newegg.com/p/1TS-000E-1B8Y6?Item=9SIAMRPKA16634 https://www.newegg.com/p/1TS-000E-1B481?Item=9SIAKDXK9J5011 https://www.newegg.com/p/1TS-000X-05XE2 https://weibo.com/3219724922/OxQViq3ja https://www.tomshardware.com/pc-components/cpus/amd-pushes-ryzen-to-the-max-ryzen-ai-max-300-strix-halo-reportedly-has-up-to-16-zen-5-cores-and-40-rdna-3-cus https://youtu.be/BLwwytLe4DA?si=-K2sqw0xyeaTstW8 https://www.youtube.com/live/X24BzyzQQ-8?si=L5IHsTEzmnNuisUp https://youtu.be/6HaRMiTfvks https://youtu.be/jGRxqfG7RxY https://youtube.com/live/-nhZJ1RTTsM?feature=share https://youtu.be/5qlOQg2mEsw https://www.youtube.com/watch?v=fJZ6ndDACG8 https://www.cnet.com/tech/gaming/exclusive-hands-on-i-played-sonys-all-new-ps5-pro/ https://www.tomsguide.com/gaming/playstation/playstation-30th-anniversary-collection-pre-orders-how-to-buy https://x.com/deckwizardyt/status/1836365264625058214 https://x.com/deckwizardyt/status/1837089911809183976 https://x.com/carygolomb/status/1836377056780698009 https://x.com/mooreslawisdead/status/1836548687352172868 https://youtu.be/5qlOQg2mEsw https://www.youtube.com/watch?v=UArxpvOZV5M&ab_channel=%E5%B0%8F%E5%AE%81%E5%AD%90XNZ https://www.reddit.com/r/GamingLeaksAndRumours/comments/1fjp352/photos_of_switch_2_factory_prototypes_have_leaked/ https://www.apple.com/newsroom/2024/09/apple-introduces-iphone-16-and-iphone-16-plus/ https://www.apple.com/newsroom/2024/09/apple-debuts-iphone-16-pro-and-iphone-16-pro-max/ https://finance.yahoo.com/news/apple-iphone-16-reaches-stores-004937230.html https://www.pcmag.com/news/which-iphone-16-is-fastest-a18-vs-18-pro-processors-benchmarked https://www.applemust.com/the-queues-for-iphone-16-track-emerging-economic-realities/ https://www.cbsnews.com/news/apple-iphone-16-on-sale-but-without-ai/ https://www.businessinsider.com/apple-intelligence-features-rollout-timeline-iphone-16-2024-9 https://www.youtube.com/watch?v=hp0dZEXZ_7I&ab_channel=MrMacRight

ThinkComputers Podcast #413 - New Lian Li AiO, Noctua NH-D15 G2, CXL Tech & More!

Play Episode Listen Later Jul 4, 2024 72:16

This week on the podcast we go over our reviews of the PNY XLR8 CS3150 Gen5 Solid State Drive, Lian Li HydroShift LCD 360S Liquid CPU Cooler, and Noctua NH-L12Sx77 CPU Cooler. We also discuss some interesting new tech that allows for memory and SSDs to increase VRAM capacity, the new Noctua NH-D15 G2 CPU cooler, and much more!

tech ssd noctua vram lian li

#091 - Broadcom's VMware Acquisition: The Fireside Followup

Data Center Therapy

Play Episode Listen Later May 23, 2024 47:09

Welcome back, long-awaiting listeners, as your favorite IT podcast with a healthy dose of empathy, Data Center Therapy, returns after a seven-month hiatus. It's been too long, we know, but we're back and we're recharged so we can bring you the first of many new episodes with exciting, topical and relevant content. Thanks for joining us!On this episode, your intrepid hosts, Mr. Matt “been through the desert on a horse with no name” Cozzolino and Mr. Matt “Call me Heisenberg” Yette share Cozzo's adventures hiking from Supai to Havasupai Falls in the great state of Arizona, and talk all things Broadcom, given the big news of the acquisition of VMware completed since the last episode aired.While sharing their thoughts on the whole VMware ecosystem and the changes, the DCT crew muse about:The new tiers of subscription licensing that replace many of the old VMware SKUs,The changes in licensing from sockets to cores,The aborted licensing change back in the VMware days regarding vRAM,The “second day” about face on ROBO licensingViable alternatives to vSphereThe spinning off of the Horizon technology to the new firm named Omnissa,and much, much more.As a reminder, IVOXY are hosting a vSphere 8 Advanced Class in just two weeks, and we're developing classes and workshops for Disaster Recovery and Aria Operations (formerly VMware vRealize Operations.) If you're interested in Matt Cozzolino's Networking for Server admins Ask Me Anything on June 20th at 11AM Pacific, register at: https://ivoxy.com/ama-networkingforserveradmins. If you're looking for the DirtFish 2024 event registration, it may be found at: https://ivoxy.com/dirtfish2024. As always, if you enjoy Data Center Therapy, please tell three friends and be sure to like, share and subscribe wherever you get your podcasts. Thanks for your patience, your attention, and we eagerly look forward to sharing more in 2024 with you all on the next episode of Data Center Therapy!

arizona networking horizon acquisition server ask me anything vmware fireside broadcom disaster recovery dct vsphere vram advanced class dirtfish cozzo havasupai falls

DF Direct Weekly #162: Xbox Studio Shutdown Crisis, Switch 2 Confirmed, ROG Ally X Specs Reaction

Digital Foundry Direct Weekly

Play Episode Listen Later May 13, 2024 131:52

There's anger, bafflement and frustration this week as John, Rich and Alex discuss Microsoft's decision to close one studio that created one of the greatest immersive games ever made and another studio that's just delivered a BAFTA-winning, critically acclaimed rhythm action game... so what is going on at Xbox right now? Beyond that, the team discuss the latest Switch 2 revelations, the new Asus ROG - or R.O.G. - Ally, Hellblade 2 PC specs and Sony's massive Helldivers 2 own-goal. 0:00:00 Introduction 0:01:38 News 01: Microsoft shutters Tango Gameworks, Arkane Austin 0:44:37 News 02: Nintendo confirms Switch sequel, potential components uncovered 1:03:42 News 03: Asus announces new ROG Ally 1:16:13 News 04: Sony backtracks on Helldivers 2 PC PSN requirements 1:25:25 News 05: Ninja Theory reveal Hellblade 2 PC specs 1:36:15 News 06: Sand Land released! 1:44:17 Supporter Q1: Could the Switch 2 connect to a TV wirelessly? 1:46:52 Supporter Q2: How much longevity will video cards with 8-10 GB of VRAM have? 1:53:41 Supporter Q3: Should next gen consoles have much larger RAM allotments? 1:57:40 Supporter Q4: Do you think mouse and keyboard support will become more common on consoles? 2:00:49 Supporter Q5: How feasible is it to update an older UE5 game to UE 5.4? 2:03:50 Supporter Q6: Could Intel beat AMD in discrete GPU sales by 2029? 2:06:40 Supporter Q7: What's your favourite immersive sim, and why is it Prey (2017)? Learn more about your ad choices. Visit megaphone.fm/adchoices

Horizon Forbidden West, Dragon's Dogma 2, VRAM and Bad Optimization

health dragon pc acast dogma optimization horizon forbidden west podcast video vram

Play Episode Listen Later Mar 28, 2024 92:09

Episode 26: Steve and Tim have been "playing" (for work) Horizon Forbidden West and Dragon's Dogma 2 and share their thoughts on what the games are bringing to PC. Then we answer some listener mails on voting with your wallet and how the channel has been going in a relatively slow period.CHAPTERS00:00 - Intro00:31 - Testing Horizon Forbidden West and Dragon's Dogma 230:04 - Should Buyers Vote With Their Wallet?47:37 - Health of Hardware Unboxed Channel1:15:43 - Updates From Our Boring LivesSUBSCRIBE TO THE PODCASTAudio: https://shows.acast.com/the-hardware-unboxed-podcastVideo: https://www.youtube.com/channel/UCqT8Vb3jweH6_tj2SarErfwSUPPORT US DIRECTLYPatreon: https://www.patreon.com/hardwareunboxedFloatplane: https://www.floatplane.com/channel/HardwareUnboxedLINKSYouTube: https://www.youtube.com/@Hardwareunboxed/Twitter: https://twitter.com/HardwareUnboxed Hosted on Acast. See acast.com/privacy for more information.

These Are The GPU Features YOU Care About

Play Episode Listen Later Feb 9, 2024 80:29

Episode 21: After a brief discussion of the Ryzen 7 5700, we dive into the survey results to look at what GPUs you own, how often you upgrade, your thoughts on switching brands, and the graphics features you find important. Some very interesting results!CHAPTERS0:00 - Intro01:12 - AMD Ryzen 7 5700 Recap13:57 - Survey Results14:53 - GPU Market Share19:07 - How Much Did Your GPU Cost?21:48 - How Often Do You Upgrade?28:23 - Rasterization vs Ray Tracing Importance33:57 - How Much is Too Much for 8GB of VRAM?43:32 - Buying Same Brand Again vs Switching Brands54:55 - What GPU Features are Important1:08:43 - Updates From Our Boring LivesSUBSCRIBE TO THE PODCASTAudio: https://shows.acast.com/the-hardware-unboxed-podcastVideo: https://www.youtube.com/channel/UCqT8Vb3jweH6_tj2SarErfwSUPPORT US DIRECTLYPatreon: https://www.patreon.com/hardwareunboxedFloatplane: https://www.floatplane.com/channel/HardwareUnboxedLINKSYouTube: https://www.youtube.com/@Hardwareunboxed/Twitter: https://twitter.com/HardwareUnboxed Hosted on Acast. See acast.com/privacy for more information.

care acast gpus ryzen podcast video amd ryzen 8gb vram

240. Nvidia RTX 4070 SUPER, Intel i9-14900KS Drama, RX 7600 XT 16GB, AMD R7 8700G

Play Episode Listen Later Jan 15, 2024 125:42

We have RTX 4070 SUPER Benchmarks, Intel i9 Leaks, and more to discuss!!! [SPON: Use "brokensilicon“ at CDKeyOffer $16 Win10: https://www.cdkeyoffer.com/cko/Moore10 ] [SPON: Get 10% off Tasty Vite Ramen with code BROKENSILICON: https://bit.ly/3wKx6v1 ] 0:00 Minnesota vs Tennessee Winters (Intro Banter) 6:17 PS3 MLAA, Intel Foundry Services, AMD Laptop Support (Corrections) 15:30 RTX 4070 SUPER Analysis 26:13 RTX 4080 SUPER & RTX 4070 Ti SUPER Announced 42:22 How Nvidia plans to push 4070 Ti Sales after SUPER "launches"... 45:34 RX 7600 XT Releases next to BAD Mobile RADEON Sales 58:56 How can AMD afford to give a $329 GPU 16GB of VRAM? 1:00:11 Ryzen 7 8700G Announced, Hawk Point gets Rapid Adoption 1:11:40 AMD Hawk Point Benchmarks vs Meteor Lake Claims 1:18:13 Lunar Lake & Arrow Lake Details (kinda) Announced 1:25:06 Did Intel Arrow Lake once have Hyper-Threading? 1:31:52 Intel i9-14900KS Drama Leak, APO comes to 13th & 12th Gen 1:38:21 4090D, 7800M, 3050 6GB, MSI Claw, ARM Windows Exclusivity (Wrap-Up) 1:49:26 AMD mandating OCuLink, Devs Thoughts on FSR, Vite Vitality (Final RM) https://videocardz.com/newz/nvidia-rtx-4070-super-ad104-gpu-features-48mb-of-l2-cache-not-36mb-as-claimed-earlier https://www.techspot.com/review/1865-geforce-rtx-super/ https://videocardz.com/newz/custom-geforce-rtx-4070-super-cards-appear-at-retailers-for-up-to-650 https://youtu.be/gA-eKbi1QWU?si=x71xqQuaxlJ5dSGy https://www.nvidia.com/en-us/geforce/news/geforce-rtx-4080-4070-ti-4070-super-gpu/ https://www.computerbase.de/2024-01/gaming-notebooks-fuer-amd-radeon-rx-7000m-war-die-ces-ein-desaster/ https://www.anandtech.com/show/21215/amd-adds-radeon-rx-7600-xt-to-product-stack-16gb-1080p-gaming-card-for-329 https://videocardz.com/newz/ayaneo-and-gpd-launch-first-handhelds-with-ryzen-7-8840u-processor https://videocardz.com/newz/gpd-to-update-all-handheld-products-with-amd-ryzen-7-8840u-apu https://www.tomshardware.com/pc-components/cpus/amd-launches-ryzen-8000g-phoenix-apus-brings-ai-to-the-desktop-pc-reveals-zen-4c-clocks-for-the-first-time https://videocardz.com/newz/intel-shows-off-lunar-lake-with-memory-on-package-reaffirms-its-2024-plans-for-lunar-arrow-lake https://twitter.com/OneRaichu/status/1744537140451844344 https://www.pcgamer.com/intel-to-roll-out-14th-gens-game-optimization-software-to-older-1213th-gen-hybrid-cpus-after-all/ https://videocardz.com/newz/alleged-intel-core-i9-14900ks-6-2-ghz-cpu-has-been-pictured https://twitter.com/9550pro/status/1742151746598944892 https://www.youtube.com/watch?v=BGZMOK9l2Dc&ab_channel=KitGuruTech https://videocardz.com/newz/nvidia-geforce-rtx-4090d-is-6-slower-than-rtx-4090-in-first-test-oc-support-limited https://videocardz.com/newz/shipping-manifests-reveal-amd-cuarzo-gpus-as-navi-3x-series-hint-at-navi-32-mobile-rx-7800m https://videocardz.com/newz/nvidia-geforce-rtx-3050-6gb-to-feature-2304-cuda-cores-and-70w-tdp https://videocardz.com/newz/msi-claw-gaming-handheld-leaked-features-intel-core-ultra-7-155h-with-arc-graphics-and-32gb-memory https://www.youtube.com/watch?v=S1R08Qx6Fvs&ab_channel=Windows https://videocardz.com/newz/amd-enables-fluid-motion-frames-afmf-for-integrated-radeon-700m-series-through-preview-driver https://www.tomshardware.com/pc-components/cpus/windows-on-arm-may-be-a-thing-of-the-past-soon-arm-ceo-confirms-qualcomms-exclusivity-agreement-with-microsoft-expires-this-year#:~:text=The%20exact%20date%20the%20exclusivity,coming%20from%20AMD%20and%20Nvidia https://www.bleepingcomputer.com/news/security/framework-discloses-data-breach-after-accountant-gets-phished/ https://www.youtube.com/watch?v=eONWY3kbZc0&ab_channel=DigitalTrends https://www.youtube.com/watch?v=S1R08Qx6Fvs&ab_channel=Windows https://www.howtogeek.com/what-is-oculink/ https://www.amd.com/en/product/14066

drama minnesota intel nvidia leaks amd rx rtx ryzen apo nvidia rtx fsr 16gb win10 6gb msi claw vram hyperthreading

237. Nvidia 4080 SUPER, Ryzen 8000G, AMD Zen 5 Strix, RDNA 4 | Dawid Does Tech Stuff

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Dec 26, 2023 127:14 Very Popular

Dawid joins to discuss 2023 & 2024 Hardware! [SPON: Support MLID w/ the Salad App: https://salad.com/download?utm_source=YouTube&utm_medium=video&utm_campaign=MLID ] [SPON: Go to my partner https://trymintmobile.com/mooreslaw to get premium wireless for as low as $15 a month - Buy 3 Months, Get 3 Months Free through 1/1/24!] [SPON: Get 10% off Tasty Vite Ramen with code BROKENSILICON: https://bit.ly/3wKx6v1 ] (Recorded 12/21/23) 0:00 How are RADEON sales? What's the worst product of 2023? 10:55 RTX 4080 SUPER Pricing - Has Nvidia learned the limits of milking? 25:53 AMD vs NVIDIA Software Features 39:57 What would make RDNA 4 Succeed? 52:20 Ryzen 8000G APUs on Desktop 1:00:22 8GB of VRAM in 2023, From Mining to AI Booms 1:14:50 When will RTX 4090 Prices return to normal? 1:19:22 Intel ARC 1:27:31 7800X3D Dominating Sales - Has DIY rejected E-Cores? 1:40:57 Intel's Future, Blackwell Segmentation 1:55:44 Favorite Products from 2023 Last Dawid Episode: https://youtu.be/8BGzTMz_BUg?si=JOWL1ECA0XKy7t3h First Dawid Episode: https://youtu.be/USSx-qHwWzI?si=O7VdJH_Yp3TOY-dE&t=6226 Dawid A380 Video: https://youtu.be/DirxAQtSnmA?si=E2B6c9XFtGqNxfFO https://youtu.be/yTu-IWf4LKU?si=5Rj61OT-MI4ELMoe https://youtu.be/1mE5aveN4Bo?si=0pQD68lr7tPYgDVM https://youtu.be/JIqoMyjmC5A?si=jMqUjO-YRcsYiqDy https://youtu.be/l5ZDOHI3IkM?si=Y1TjecHNoMNkhKQa https://youtu.be/3XlxEIIF22I?si=Pjp442uSParf75kB https://www.techpowerup.com/review/nvidia-geforce-rtx-4060-ti-founders-edition/42.html https://youtu.be/PqZmaMC-Kbo?si=x5HFNxo27wbjFCXl https://youtu.be/aH70c4S-XPk?si=mGn5dlH3qNLhR80w https://youtu.be/26a0Xl6eVt8?si=VuGV-ZNrr25N9Hyi https://youtu.be/AoADopUg0tA?si=3gUWACvdLYA73kpM https://www.techspot.com/review/2701-nvidia-geforce-rtx-4060/ https://youtu.be/nvjb8HPZbZU?si=X4iHODvLS2n2Z11U https://youtu.be/Isn4eLTi8lQ?si=_xC5aIFmbcL4F9Rq https://youtu.be/tmfHxJT1I3I?si=gc_SuWX5HzTSvZea https://youtu.be/zcJZrWS5xfc?si=-nzFEoqe5IB6kZqq https://youtu.be/8l6CsPzIZFw?si=Vy3FdmDW-qSH0Zb4 https://youtu.be/W5K8GM2fNDM?si=mZ1xvyoNW0sE_UF_ https://youtu.be/ek3y-rNOOc0?si=ffCLAxmHAim1ZQTs https://youtu.be/IOQxFZdTdDE?si=CW752FACreLoqT4p https://twitter.com/TechEpiphanyYT/status/1736061471333687676 https://twitter.com/TechEpiphanyYT/status/1733919739192197207 https://youtu.be/-pQEdpMCrdU?si=0QReCPKpGjF1-_uQ https://youtu.be/3FsUTYnQNOA?si=AbmMXtaMz3QtZ3dO https://youtu.be/ekCMnmD_EzA?si=0aGpSoXLggIOi9RP https://youtu.be/l44xorRKHfk?si=UreiGa1EJaCxT_dB https://youtu.be/R0x0DwdSXcE?si=JKSCbV3SUqPQk5AK https://youtu.be/siG6TJtLztc?si=wGl04abE2g9I0OXZ https://youtu.be/Ki2j11GcoU0?si=snubm3XSsixVm_vJ https://youtu.be/7cEKTr70YBY?si=rFjr6wDmJtm1br1B https://youtu.be/brgrCKKID28?si=qrjRNs-rl64TRf9x https://youtu.be/xoWckeP3YUY?si=8VIH4eOMWCKuTp2i https://youtu.be/EU0Ti9Gyayg?si=BrPBGjY5ZfpdbVzO https://youtu.be/DuNu9K7fDB8?si=nMFXnldv2WtEYYRA https://www.tomshardware.com/pc-components/cpus/pricing-leaks-for-three-amd-ryzen-8000g-desktop-am5-apus-ranging-from-dollar190-to-dollar400-new-am4-cpus-too https://videocardz.com/newz/amd-ryzen-8000g-desktop-apu-specs-leaked-by-asus-and-asrock https://www.techspot.com/review/2746-amd-radeon-7900-xtx-vs-nvidia-geforce-rtx-4080/

future succeed prices intel nvidia hardware amd rtx dawid ryzen 8gb favorite products strix radeon rdna techstuff amd zen vram

The Busy Person's Intro to Finetuning & Open Source AI - Wing Lian, Axolotl

Play Episode Listen Later Dec 8, 2023 64:18 Very Popular

The Latent Space crew will be at NeurIPS on Tuesday! Reach out with any parties and papers of interest. We have also been incubating a smol daily AI Newsletter and Latent Space University is making progress.Good open models like Llama 2 and Mistral 7B (which has just released an 8x7B MoE model) have enabled their own sub-industry of finetuned variants for a myriad of reasons:* Ownership & Control - you take responsibility for serving the models* Privacy - not having to send data to a third party vendor* Customization - Improving some attribute (censorship, multiturn chat and chain of thought, roleplaying) or benchmark performance (without cheating)Related to improving benchmark performance is the ability to use smaller (7B, 13B) models, by matching the performance of larger models, which have both cost and inference latency benefits.Core to all this work is finetuning, and the emergent finetuning library of choice has been Wing Lian's Axolotl.AxolotlAxolotl is an LLM fine-tuner supporting SotA techniques and optimizations for a variety of common model architectures:It is used by many of the leading open source models:* Teknium: OpenHermes, Trismigestus, CollectiveCognition* OpenOrca: Mistral-OpenOrca, Mistral-SlimOrca* Nous Research: Puffin, Capybara, NousHermes* Pygmalion: Mythalion, Pygmalion* Eric Hartford: Dolphin, Samantha* DiscoResearch: DiscoLM 120B & 70B* OpenAccess AI Collective: Manticore, Minotaur, Jackalope, HippogriffAs finetuning is very formatting dependent, it also provides prompt interfaces and formatters between a range of popular model formats from Stanford's Alpaca and Steven Tey's ShareGPT (which led to Vicuna) to the more NSFW Pygmalion community.Nous Research MeetupWe last talked about Nous at the DevDay Recap at the e/acc “banger rave”. We met Wing at the Nous Research meetup at the a16z offices in San Francisco, where they officially announced their company and future plans:Including Nous Forge:Show NotesWe've already covered the nuances of Dataset Contamination and the problems with “Open Source” in AI, so we won't rehash those topics here but do read/listen to those if you missed it.* Axolotl GitHub and Discord* The Flan paper and dataset* StackLlama model and blogpost* Multipack paper* Our episode with Tri Dao* Mamba state space models - Tri Dao and Albert GuTimestamps* [00:00:00] Introducing Wing* [00:02:34] SF Open Source AI Meetup* [00:04:09] What is Axolotl?* [00:08:01] What is finetuning?* [00:08:52] Open Source Model Zoo* [00:10:53] Benchmarks and Contamination* [00:14:29] The Case for Open Source AI* [00:17:34] Orca and OpenOrca* [00:23:36] DiscoLM and Model Stacking* [00:25:07] Datasets and Evals over Models* [00:29:15] Distilling from GPT4* [00:33:31] Finetuning - LoRA, QLoRA, ReLoRA, GPTQ* [00:41:55] Axolotl vs HF Transformers* [00:48:00] 20x efficiency with StackLlama and Multipack* [00:54:47] Tri Dao and Mamba* [00:59:08] Roadmap for Axolotl* [01:01:20] The Open Source AI CommunityTranscript[00:00:00] Introducing Wing Lian[00:00:00] [00:00:00] swyx: Welcome to Latent Space, a special edition with Wing Lien, but also with our new guest host, Alex. Hello, hello. Welcome, welcome. Again, needs no introduction. I think it's like your sixth time on Latent Space already. I think so, yeah. And welcome, Wing. We just met, but you've been very prolific online. Thanks for having me.[00:00:30] Yeah. So you are in town. You're not local. You're in town. You're from Minneapolis?[00:00:35] Wing Lian: Annapolis. Annapolis. It's funny because a lot of people think it's Indianapolis. It's I've got Minneapolis, but I used to live out at least in the San Francisco Bay Area years ago from like 2008 to 2014. So it's fairly familiar here.[00:00:50] swyx: Yep. You're the maintainer of Axolotl now, which we'll get into. You're very, very prolific in the open source AI community, and you're also the founder of the Open Access AI Collective. Yeah. Cool. Awesome. Maybe we can go over a little bit of your backgrounds into tech and then coming into AI, and then we'll cover what[00:01:06] Wing Lian: happens and why you're here.[00:01:08] Yeah. So. Back on tech, so I started years ago, I started way back when I was scraping, Apartment websites for listings and then, and then building like SEO optimized pages and then just throwing Google AdSense on it.[00:01:24] And that got me through like college basically. Is[00:01:27] swyx: that decent money? And what year[00:01:28] Wing Lian: was this? Like 2004, 2005. Yeah, that's decent money. It's like thousand bucks a month. But as a college student, that's like. Gravy. Really good money, right? So, and then there's just too much competition It's just sort of like died off. I was writing stuff in like Perl back then using like like who nobody hosted anything on Perl anymore, right? Still did a little bit more like computer tech support and then software, and web more professionally.[00:01:54] So I spent some time working on applications in the blood industry. I came out to San Francisco for, I was at SGN, so Social Gaming Network, as a startup. They started doing, with Facebook apps, and then they pivoted into doing mobile apps. And then, from there, I spent time.[00:02:14] I've quite a few more startups since then and in the last few years I've been in the music space So like I was at United Masters for a while and then past year I've been at SoundCloud, but not doing that anymore and now that I have a lot more time It's just like all right.[00:02:30] We're going full bore on axolotl and we're gonna we're gonna crush AI So yeah,[00:02:34] SF Open Source AI Meetup[00:02:34] swyx: totally you so you're here in town for the open source. Yeah, I meet up that we had yesterday Yep, yeah, that was amazing. Yeah, it was a big collection. Olama, Noose Research, Alignment Lab, Anyone else that I missed? I mean, Jeremy Howard is his own thing.[00:02:47] Yeah.[00:02:49] And Alex, you're also there. You love to bring SF to the world. Your takes?[00:02:55] Alex Volkov: It's incredible that we recorded a Thursday Eye episode after that one. And LDJ, who's usually co hosts Thursday Eye, just like briefly mentioned, Oh yeah, I talked about it.[00:03:04] Like, I saw Karpathy, and then I talked to Jeremy Howard, and the guy from Mistral came in, and it's like, He's talking about all these, titans of industry, basically, that outside of SF, You just don't meet casually hanging out in the same space. You can't, pull somebody. He ran into the Laylow from Mistral, he ran into him while, drinking water.[00:03:20] He didn't even know he was there. It's just, that type of stuff is really hard to find outside of SF. So, absolutely, absolutely great. And also, presentations from Alignment Labs, presentations from News Research, news issues, talked about. Forge, and some of[00:03:33] swyx: the other stuff they announced. We can say now they're officially a company.[00:03:36] I met Technium.[00:03:37] He[00:03:37] Alex Volkov: came over here. He didn't want to get recorded. But maybe.[00:03:41] Wing Lian: We'll wear him down at some point. Yeah, I'm excited for Forge. They've positioned it as this agentic sort of framework where it's just Drag and drop things and, fill in text with where you want to inject different variables and it opens up all of these potentials for data pipelines now, right?[00:03:56] And using your own local LLMs and not relying on GPT 4 or anything like that. Yeah, yeah,[00:04:02] swyx: good stuff. Okay, so let's maybe go into the Axolotl origin story and then we have, we have some intro or background.[00:04:09] What is Axolotl?[00:04:09] swyx: To do on like the open source model universe and also on fine tuning, but maybe just, since you're talking about your personal journey, what was your personal journey into[00:04:18] Wing Lian: axolotl?[00:04:19] Yeah, so my personal journey started like back in mid March, completely unrelated to AI and axolotl. And it really started, I fell while skiing, I torqued. Great 3 MCL sprain and being sort of like an active person that can no longer be active because the two, couldn't play soccer, because that is requires to have having knees until I, it's healed.[00:04:42] So I. I decided I needed to find something to do to take up my free time. And that became, well, let's learn how to train in, these language models. It was everywhere. So I was like, all right, I'm just going to sit down, learn. I think I used like other, I think I was using like Alpacalora.[00:05:00] Cause I think the Alpaca paper had just came out, come out then. So I was like using Alpacalora repo and sort of like learning how to use like. None of us were like GPU rich back then, and none of us, most of us still we're still all GPU poor, but I was doing what was it, like 4 bit, Alpaca Lord, there was like a 4 bit version where we were doing quant, or 8, no, 8 bit quantizations, and then I think they had released QLOR a little bit later, and I think right when, before QLOR came out, I was already starting to do fine tunes, but having this need to sort of like mix data sets together, and If you've ever looked at all the various different datasets available on HuggingFace, they all have various different prompt formats, and, it's sort of a nightmare, and then I think the other piece is if you've ever tried to fine tune, at least Back then probably the ecosystem's a little better now.[00:05:54] Everybody required that you say, alright, you put your hyperparameters as command line arguments. And so it's always like, well, I now have to go copy and paste my previous thing and to change things out. And I really wanted it. to be in a YAML file because it was more portable and reproducible.[00:06:09] So I was doing that and then the QLOR paper came out. Tim Dettmer announced that and then somebody looked it up for me yesterday and it's like between that announcement it took us seven days to get that integrated into Axolotl, right? Which is like, it's not. I wouldn't say it's really fast, but in a manner that, is in a, a reusable framework, I think it was quite the accomplishment then.[00:06:33] And so we started, picking up traction with people there. And then it's just been building models, and then just iterating what my needs are. So, yeah. Excellent. Yeah. I[00:06:44] Alex Volkov: want to ask, for folks who are listening who never heard of Axolotl, now do you describe how you got there?[00:06:49] Can you, how do you summarize this for folks who maybe haven't fine tuned anything. They know about open source LLM exists, they maybe know like LLAML, what's XLR for somebody who doesn't know. I've never heard of a data set curation[00:07:01] Wing Lian: creation before. We sort of have to take a step back and understand that, when you've got these language models, you have what I think most people refer to as like base models, also known as like foundational models, right?[00:07:15] Where some benefactor, whether it's Meta or Mistral or whoever, has gone and spent all this money. To train these models on huge corpuses of text, right? And these, these corpuses, they're generally good across lots of different things, but they're really good at just saying, talking on and on and on, but they're not good at, following instructions or having chats or anything like that.[00:07:40] So, when you think about fine tuning, it's like Saying, all right, we have this really sort of good generalized, text completion thing, and I want to turn it into something that I can talk to or have, follow instructions. So, I think fine tuning is probably best defined in like that.[00:07:58] swyx: Okay, got it.[00:07:59] And we actually[00:08:01] What is finetuning?[00:08:01] swyx: Do want to make sure that we have like an overall introduction to fine tuning for people because again like trying to make sure that we bring everyone along in this, in this journey. We already went into Loras and QLoras without explaining what[00:08:12] Wing Lian: they are. Oh yes, yes, sorry.[00:08:14] swyx: And so I will put things in my words and you can correct me as, as, as my I'll be the village idiot here.[00:08:21] So, so fine tuning is basically sort of grabbing an open source model off the shelf, and then basically doing further training on it with a custom dataset of your own. Primarily, people use it, think about it as fine tuning for JSON output, or fine tuning for a style of response. Let's say you wanted to tell jokes, or be funny, or be short, or whatever.[00:08:43] Just the open source AI community has really fine tuned in all sorts of different manner. I think we'll go over those those things now. Let's go over those things now, and then we'll talk about fine tuning methods.[00:08:52] Open Source Model Zoo[00:08:52] swyx: So there's a universe of people who fine tune stuff. Yesterday in your slides, you had, I'll just list some of these and then we'll maybe go through some of them, right?[00:08:59] So Technium is personally leading Open Hermes, which is I think the sort of premier model out of the news. news community. There's OpenOrca, which you had a hand in. News, the news research itself also has Capybara and Puffin and all the others. There's Pygmalion, which I've never messed with.[00:09:14] Eric Hartford, I am aware of his Uncensored Models and his Samantha Models. Disco Research with Disco LM. And then you personally have done Manticore, Minotaur, Jackalope, and Hippogriff. What should people know about all these names? Being part of AI Twitter is seeing all these things and going dude, I'm being DDoS'ed by all these things and I don't know how different they are.[00:09:32] What should people know? Yeah, so[00:09:34] Wing Lian: I think on a lot of these models, generally, we like to think of those as sort of general models, so If you think about it, what is GPT 4, what is Chad GPT? It's a good general model, and then, One of the services I think that OpenAI offers is like these fine tunings where you're a business and you have very specific business use cases and you might fine tune for that use case.[00:10:00] All of these models are really just general use case that you can then go and maybe Fine tune another lore over it for your use cases, but they tend to be good. With good being relative, it's open source. Open source AI is still sort of is infancy. So, good is, it's pretty reasonable.[00:10:18] It's probably still better than most, high schoolers at answering questions and being able to like figure things out and, and reasoning skills and math and those sorts of things, right?[00:10:27] swyx: And also as measured on the Hugging[00:10:29] Wing Lian: Face leaderboard. Yes, well, that's like a whole other discussion, right, there's a whole other, group of people who, and I, I mostly agree with them that, benchmarks can be, are pretty bogus these days, LM says, I think they published something recently where, even if you think the dataset's not contaminated, you can go and, find contamination And maybe we should step back and say what contamination is, right?[00:10:53] Benchmarks and Contamination[00:10:53] Wing Lian: So we have all of these data, when you go and do these benchmarks, there's a specific data set where there are these questions and usually it's multiple choice. And what can happen is, well, sometimes someone It puts the question, maybe maliciously, maybe accidentally, into the training dataset, and now the, the, your model knows how to answer the test questions really well, but it doesn't, it hasn't generalized the ability to actually do that[00:11:20] Alex Volkov: right.[00:11:21] We've seen some folks competitively announce models that are like the best at that leaderboard, but then it's, it's quite obvious that, In open source? Yeah, and in that leaderboard, for Hugging Face specific, I don't know if LMCs, if that had suffered, but we, there's been some models that seem to have been competitively trained and some leakage happened into their,[00:11:41] swyx: like, supposal.[00:11:43] I understand, once there's been a credible assertion, Hugging Face actually does take them down, right? Yeah, yeah,[00:11:48] Alex Volkov: which is really hard to know, right?[00:11:50] swyx: It's really hard to know, sometimes it's like a pure accident,[00:11:52] Alex Volkov: it's oh, oops. You're going through a mixer. I think, a responsible So acknowledgement, that this kind of happened to you is also important.[00:11:58] I saw LDJ from news research can acknowledge that. Because many of these datasets are collections of other datasets. There's a bunch of people are baking, basically. It's alchemy. Right. And so sometimes you don't know. Sometimes you pull an open source dataset and they announce, oh, you know what, actually, the MMLU benchmark which we used to Specifically identify models that did go into this data set, that then went into that data set.[00:12:22] So sometimes it's actually an accident and folks take it down. But I've seen some competitive folks who want to put their name out there because people are starting to notice which is the top[00:12:30] swyx: model. For those who want a fun take on this so the file one dataset. FindOne model from Microsoft was accused of being contaminated.[00:12:37] And I saw this joke paper that was fantastic. It was called, training on the test set is all you need. It's a super small model that just memorizes everything. It was fantastic. So yeah, contamination, I think we've actually covered it in a previous episode before. So we're good. But again, I want to give people a map into the open source AI model, the universe.[00:12:57] And Alex, you can also jump in here because you guys have spent a lot more time with them than I have. So, what should people know about Technium? What should people know about Noose? And then we can go down the list. Yeah,[00:13:05] Wing Lian: I think so. I think if we start with, Technium. When you talk to him, he's gonna say, I think, I think his response is that he wants to build GP4 on his laptop, right?[00:13:14] So, very, very good at building general models. I think with Noose, Noose Research, they're looking at more, sort of, More, more research focused things, like their Yarn models, I don't, I don't, they didn't actually train their, they have their own trainer for their Yarn models, but So they did not use Xlato for that one?[00:13:30] They didn't use that, but like Is that, you don't have support for it? I think we do support Yarn, I think, I'd have to double check that answer. Yeah, I'm just kind of curious what you can and cannot support, and Yeah, I mean, Yarn is supportable, it's basically, I think it's just replacing, I think, the rope part of that, so Yeah, not, not a big deal.[00:13:48] Yeah, it's not a big deal, it's just I haven't gotten to it, not enough people have asked, I think a lot of people have asked for other things, so it's just, squeaky wheel, right? I think at the end of the day, people are like building these data sets and I think if you sort of map things chronologically, these make more sense because it's like, how do we incrementally improve all of these models?[00:14:07] So a lot of these models are just incremental improvements over the last thing, right? Whether it is sort of through methods of how do we, how did we curate the data set? How did we improve the quality of the data set? So, you maybe LDJ talked about it right on I think for, for Capybara and Puffin, like how those, those were very specific dataset curation techniques that he works on.[00:14:29] The Case for Open Source AI[00:14:29] Alex Volkov: So there's, folks are doing this for dataset curation. Folks are doing this for skillset building as well. Definitely people understand that open source is like very important, especially after the, the, the, the, the march, the debacle, the OpenAI weekend that we all had. And people started noticing that even after developer day in OpenAI, the APIs went out.[00:14:48] And then after that, the whole leadership of the company is swiftly changed and people, there was worries about, you know. How can people continue building AI products based on these like shaky grounds that turned attention definitely to Technium at least in open RMS I started seeing this more and more on Twitter, but also other models and many companies They're gonna start with open AI just to get there quick, and then they they think about okay Maybe I don't want to share my knowledge.[00:15:13] Maybe I don't want to sign up for Microsoft. Maybe they will change their terms and conditions so What else is out there? They turned to other companies. Up until yesterday, Google was nowhere to be found. We've talked about Gemini a little bit before in a previous And you can tune in[00:15:26] swyx: to[00:15:26] Alex Volkov: Thursday Eye.[00:15:26] Yeah, you can tune in to Thursday Eye. We covered the Gemini release a little bit. And but many are turning into the open source community and seeing that Meta released and continues to release and commit to open source AI. Mistral came out and the model is way smaller than LLAMA and performs Significantly better.[00:15:43] People play with OpenRMS, which is currently techniums based, news researched, sourced, axolotl trained OpenRMS, I assume, right? And then they play with this and they see that, okay, this is like GPT 3. 5 quality. We had GPT 4. 5 birthday just a week ago. A week ago, a year ago, a week ago, we never, interacted with these models of this caliber.[00:16:04] And now there's one open source, one that's on my laptop, completely offline, that, I can continue improving for my use cases. So enterprises, companies are also noticing this. And the open source community folks are building the skill set, not only the data sets. They're building the actual kind of, here's how we're going to do this, with Axelotl, with these data sets.[00:16:21] The curation pieces. Now. Interesting. There's like recipes of curation. The actual model training is kind of a competitive thing where people go and compete on these leaderboards that we talked about, the LMC arena, and that recently added open air and recently added open chat and a bunch of other stuff that are super cool.[00:16:37] The hug and face open source leaderboard. And so there's a competitive aspect to this. There's the open source. Aspect to this, like Technium says, I want GPT 4 on my laptop. There's the, let me build a skill set that potentially turns into a company, like we saw with Noose. Noose just, started organizing, a bunch of people on Discord, and suddenly, they're announcing their company.[00:16:54] It's happening across all these modalities, and suddenly all these people who saw these green pastures and a fairly quick way to, hey, here's a cool online community I can, start doing cool stuff with. You mentioned the same in the beginning, right? Like, after your accident, what's cool, let me try this out.[00:17:08] Suddenly I start noticing that there's a significant movement of interest in enterprising companies into these areas. And, this skill set, these data sets, and this community is now very Very important, important enough to create an event which pulls in Andrei Karpathy from OpenAI to come and see what's new Jeremy Howard, like the event that we just talked about, people are flying over and this is just a meetup.[00:17:28] So, definitely, the community is buzzing right now and I think Axelot is a big piece as well.[00:17:34] Orca and OpenOrca[00:17:34] Wing Lian: Cool. Maybe we can talk about like Orca real quick, Orca, OpenOrca rather, I think there was a lot of buzz when, the first Orca paper came out. And just briefly, what is Orca? Yeah, Orca was basically having traces of like chain of thought reasoning, right?[00:17:48] So they go and they, they distill sort of GPT 4. They take, they take a sampling of data from the Flan dataset. Maybe we can like add some show notes in the Flan dataset. Yeah, but we've covered it. Okay, cool. Use GPT 4 to say, all right, explain this in a step by step reasoning, right?[00:18:06] And then you take that and you, they train the model and it showed, very good improvements across a lot of benchmarks. So OpenOrca was sort of the open reproduction of that since Microsoft Research never released that particular data set. And going back to sort of the Hugging Face leaderboard thing, those models did really well.[00:18:23] And then I think, so sort of the follow up to that was SlimOrca, right? I think Going into and building the OpenOrca dataset, we never really went in and, validated the actual answers that GPT 4 gave us, so what we did was one from OpenChat actually cross referenced the original Flan, the original Flan response, the human responses, the correct answers with the dataset, and then I went and took it and sent all of, both of them to GPT 4 and said, is this answer mostly correct, right?[00:18:54] Yeah. And then we were able to filter the dataset from, At least of the GPT 4 only answers from like 800, 000 to like 500, 000 answers or rows and then, and then retrain the model and it had the same performance as the original model to within I think, 0. 1 percent here about, and 30 percent less data.[00:19:13] So, yeah. Okay.[00:19:15] swyx: Interesting. So, I mean, there's, there's so much there that I want to highlight, but yeah. Orca is interesting. I do want people to know about it. Putting chain of thought into the data set like it's just makes a ton of sense one thing I think it would be helpful for people to scope thing these things out is how much data are we talking about when when you When people are fine tuning and then how much time or resources or money does it take to train to fine[00:19:36] Wing Lian: tune?[00:19:37] Yeah, so I think there's a little bit of overlap there with sort of like fine tuning techniques, but let's say Orca and I think even Hermes, they're both relatively large data sets like 10 billion tokens. Yeah. So large data sets being or the original Orca was, or the original open Orca was 800,000 rows.[00:19:55] I believe it was somewhere in the ballpark of like a gigabyte of data, of gigabyte, of text data. And I, I don't. I believe, Hermes was, is like a quarter million rows of data, I don't know the actual byte size on that particular one. So, going and training a, let's, let's say everybody's training 7 billion Mistral right now, right?[00:20:15] So, to tri I, I believe to fine tune 7 billion Mistral on, let's say, 8 A6000s, which have 48 gigabytes of VRAM, I believe, It takes about 40 hours, so 40, and then that's, depending on where you get your compute, 40 times 6, so it's like 500 to fine tune that model, so, and, and that's assuming you get it right the first time, right?[00:20:44] So, you know.[00:20:45] swyx: Is, is that something that X. Lotto handles, like, getting it right the first[00:20:48] Wing Lian: time? If you talk to anybody, it's like you've probably tried at least three or four runs or experiments to like find the right hyperparameters. And after a while you sort of have a feel for like which, where you need your hyperparameters to be.[00:21:04] Usually you might do like a partial training run, do some benchmark. So I guess for Al Farouk, whether you're going by his. This is Jeremy, he's, his actual name, or his twitter handle. He released the Dharma dataset, which is basically a subset of all the benchmarks. And Axolotl actually supports, you know taking that subset and then just running many benchmarks across your model every time you're doing an evaluation so you can sort of like see sort of relative it's not going to be the actual benchmark score, but you can get ideas alright, is this benchmark improving, is this benchmark decreasing, based on, you know Wait,[00:21:39] swyx: why don't you run the full benchmark?[00:21:41] What, what, what The[00:21:42] Wing Lian: full benchmarks take Take a long time. Significant, yeah, significant amount of time. Yeah. And Okay, so that's like[00:21:48] swyx: mini MMLU. Yeah. Like,[00:21:49] Wing Lian: mini BigBench or whatever. Yep, exactly.[00:21:51] Alex Volkov: It's really cool. We, when I joined Web2Masters just recently, and one of the things that I try to do is hey I'm not, I'm a software engineer by trade, I don't have an MLE background, But I joined a company that does primarily MLE, and I wanted to learn from the community, Because a lot of the open source community, they use weights and biases, And the benchmark that you said that Pharrell did, remind me of the name, sorry.[00:22:13] Dharma? Dharma, yeah, yeah. So Luigi showed me how Dharma shows inside the dashboard. In Wi and Biases dashboard and so you can actually kinda see the trending run and then you can see per each kind of iteration or, or epoch or you can see the model improving trending so you can on top of everything else.[00:22:29] The wi and biases gives like hyper parameter tracking, which like you, you started with common line and that's really hard to like remember. Also the Dharma data set, like the quick, the mini orca mini, you mini many different things. It's pretty cool to like visualize them as well. And I, I heard that he's working on a new version of, of Dharma, so Dharma 2, et cetera.[00:22:47] So hopefully, hopefully we'll see that soon, but definitely it's hard, right? You start this training around, it said like 40, 50 hours. Sometimes, sometimes it's like your SSHing into this machine. You, you start a process, you send it with God and you just go about your day, collecting data sets, and then you have to return.[00:23:04] And the whole process of instrumentation of this is still a little bit like squeaky but definitely. Tuning performance, or like grabbing performance in the middle of this, like with Dharma and some other tools, is very helpful to know that you're not wasting precious resources going somewhere you shouldn't go.[00:23:21] Yeah.[00:23:22] swyx: Yeah. Very cool. Maybe I'll, I'll, before we go into like sort of more details on fine tuning stuff, I just wanted to round out the rest of the Excel autoverse. There's, there's still Eric Hartford stuff. I don't know if you want to talk about Pygmalion, Disco, anything that you know about[00:23:35] Wing Lian: those, those things.[00:23:36] DiscoLM and Model Stacking[00:23:36] Wing Lian: Yeah, I think like one of the, definitely one of the more interesting ones was like the Disco 120b, right? Yeah, I know nothing about it. Yeah. So, so. Alpen from Pygmalion AI, right, so they, so Pygmalion is a sort of a, it's, it's, they have their own community, a lot of it is based around, roleplay models, those sorts of things, and Alpen, like, put together, merged together Llama270B, so, and Alpen, like, put together, merged together Llama270B, so, I don't remember how he stacked them together, whether he merged the layers in between. There's a whole, there's a whole toolkit for that by Charles Goddard, where you can like take a single model and like stack them together or multiple models merge.[00:24:18] That's like a whole other talk and a whole other tool set, but was able to create this 120. Billion parameter model out of a LAMA two 70 B. And then I believe the, yeah, disco is a fine tune of, of the, the, the sort of the base one 20 B is, I believe Goliath one 20 B. So, and, and what are the[00:24:37] swyx: headline results that people should know about[00:24:39] Wing Lian: disco?[00:24:39] I think for the headline results, I, I've, I haven't played with it personally because it's. It's a very large model and there's a lot of GPU, right? But, like, from what I've heard anecdotally, it performs really well. The responses are very good. Even with, like, just, even the base model is a lot better than, Llama70b.[00:24:57] So, and we, I think generally everybody's like, we would all love to fine tune Llama70b, but it's just, it's so much, it's so much memory, so much compute, right?[00:25:07] Datasets and Evals over Models[00:25:07] Wing Lian: I[00:25:07] Alex Volkov: want to touch on this point because the interesting thing That comes up out of being in this ecosphere and being friends with open source folks, tracking week to week state of the art performance on different models.[00:25:19] First of all, a lot of the stuff that the folks do a couple of weeks ago, and then something like Mistral comes out, and a lot of the stuff back then, Doesn't technically make sense anymore. Like the artifacts of that work, the actual artifacts, they don't no longer make sense. They're like lower on the on, on the hug and face leaderboard or lower on LM CS leaderboard.[00:25:36] But some of the techniques that people use, definitely the datasets. The datasets keep traveling, right? So open airmen, for example, is the dataset. The tum cleaned up for only. Open sourceable data that previously was just Hermes. And that, it was previously used to train Lama. And then once Mistral came out, it was used to train Mistral.[00:25:54] And then it became significantly better on the 7b base Mistral. So the data sets keep traveling, keep getting better a little bit here and there. And so the techniques improve as well. It looks like both things are simultaneously true. The artifacts of a month and a half ago. The, the actual models themselves, it's great the hug and face has them, because not every company can keep up with the next weeks', oh, I, I'll install this model instead, sell this model instead.[00:26:19] But the, the techniques and the, the dataset keep improving as we go further, and I think that's really cool. However, the outcome of this is that for a long time. For many, many people, including us, that we do this every week. We literally talk with people who release these models every week. It's really hard to know.[00:26:36] So, there's a few aspects of this. One, I think, like you said, the bigger model, the 70B models, you actually have to have somebody like Perplexity, for example, giving you access to the 70B really fast. Or you have to, like, Actually, find some compute, and it's expensive, especially for the bigger models. For example Falcon 180B came out, like the hugest open source model.[00:26:56] How do you evaluate this if you can't run it? Nobody liked it. It's really, so first of all, nobody liked it, but secondly, only the people who were able to find compute enough to run inference on this, they only had like, I can't run this on my laptop, and so that's why it's much easier, something like OpenRMS 7 to be, 7B, it's much easier, because you can run this on your MacBook.[00:27:14] It's much easier to evaluate. It's much easier to figure out the vibes, right? Everybody talks about the vibes as an evaluation check. If you're plugged in enough, if you follow the right people, if they say pretty much the same things all independently, then you run into a problem of whether they're repeating, and their stochastic parents are repeating the same thing, or they actually evaluated themselves.[00:27:31] Yeah, you never know. But, you never know, but like, I think on a large enough scale on Twitter, you start getting the feel. And we all know that like, OpenRMS is one of the top performing models, benchmarks, but also vibes. And I just wanted to highlight this vibes checks thing because you can have the benchmarks, you can have the evaluations, they potentially have contamination in them, potentially they not necessarily tell you the whole story because some models are good on benchmarks, but then you talk to them, they're not super helpful.[00:28:00] And I think it's a combination of the benchmarks, the leaderboards, the chatbot, because LMSys, remember, their ranking is not only based on benchmarks, it's also people playing with their arena stuff. People actually like humans, like, get two answers. I think they completely ignore benchmarks. Yeah, and then They only do ELO.[00:28:18] Oh, they do ELO completely, right? So that, for example, is just like people playing with both models and say, Hey, I prefer this one, I prefer that one. But also there's like some selection bias. The type of people who will go to LMCs to play with the models, they're a little bit specific in terms of like who they are.[00:28:33] It's very interesting. There's so many models. People are doing this in this way, that way. Some people are doing this for academic rigor only to test out new ideas. Some people are actually doing this like the Intel fine tunes of Mistral. Intel wanted to come out and show that their hardware approach is possible, Mistral, etc.[00:28:51] And it's really hard to know, like, what to pick, what to use. And especially on the bigger models, like you said, like the Llama 70B, the Falcon 180B. It's really because, like, who has the compute to validate those? So I would mention that, like, use with caution. Like, go and research and see if the biggest model that just released was actually worth the tokens and the money you spend on it.[00:29:12] To try and, if you're a business, to integrate it.[00:29:15] Distilling from GPT4[00:29:15] swyx: Since you said use of caution, I'll bring in one issue that has always been in the back of my mind whenever I look at the entire universe of open source AI models, which is that 95 percent of the data is derived from GPC 4, correct?[00:29:30] Which technically you can't use for commercial licenses,[00:29:34] Wing Lian: right?[00:29:35] swyx: What is the community's stance on this kind of stuff?[00:29:40] Wing Lian: I think from the community stance, like I feel like a lot of us are just experimenting, so for us, it's like, we're not going and building a product that we're trying to sell, right?[00:29:49] We're just building a product because we think it's interesting and we want to use it in our day to day lives, whether or not we try and integrate it. Personal use, yeah. Yeah, personal use, so like, as long as we're not selling it, yeah, it's fine. But[00:30:01] swyx: like, I as a company cannot just take OpenHermes and start serving[00:30:05] Alex Volkov: it and make money on it.[00:30:06] OpenHermes you can. Because the opening of OpenHermes, I think, is a clean up. That did after the regular Hermes, please folks, check your licenses before you listen to podcasts and say, Hey, I will tell you though, you could say the same thing about OpenAI. You could say the same thing kind of makes sense, where OpenAI or StabilityAI trains their diffusion model on a bunch of pictures on the internet, and then the court kind of doesn't strike down Sarah Silverman, I think, or somebody else, who came and said, hey, this has my work in it, because of the way how it processes, and the model eventually builds this knowledge into the model, and then it doesn't actually reproduce one to one what happened in the dataset.[00:30:45] You could claim the same thing for open source. Like, we're using And by we, I mean the, the open source community that I like happily report on uses GPT 4 to rank, for example, which is the better answer you, you, that's how you build one, one type of data set, right? Or DPO or something like this, you, you basically generate data set of like a question and four answers, for example, and then you go to GPT 4 and say, Hey, smartest model in the world right now, up to Gemini Ultra, that we should mention as well.[00:31:11] Which one of those choices is better? But the choices themselves are not necessarily written with GPT 4. Some of them may be, so there's like full syntactic datasets. But there's also, datasets are just ranked with GPT 4. But they're actually generated with a sillier model, or like the less important model.[00:31:25] The lines are very blurry as to what type of stuff is possible or not possible. And again, when you use this model that's up on Hug Face, the license says you can use this. OpenAI is not going to come after you, the user. If anything, OpenAI will try to say, hey, let's prevent this, this type of thing happening, and the brain, but I honestly don't think that they could know even, not that it makes it okay, it's just like, They also kind of do this with the Internet's archive, and also, I think that some of it is for use.[00:31:55] You use models to help you augment tasks, which is what GPT 4 lets you do.[00:32:00] swyx: Yeah, the worst thing that OpenAI can do is just kick you off OpenAI. That's because it's only enforced in the terms of service.[00:32:05] Alex Volkov: Sure, but just like to make sure, to clarify who they're going to kick out, they could kick out like News, for example, if news are abusing their service, a user of the open source, fully Apache 2 open source, for example, They won't get kicked out if they use both, just because they use both.[00:32:22] I don't believe so. I don't think OpenAI has a claim for that.[00:32:25] swyx: Well, we're not lawyers, but I just want to mention it for people to know it's an issue.[00:32:30] Wing Lian: And one of the things, like, I talked to someone recently, and I think that they also are like interested in it, but also to the point of like, right, if I use a model trained on data, using GPT for data, But I use that model to then regenerate new data.[00:32:46] Is that model, is that data okay? So like you start going down this whole rabbit hole. So yeah. All right.[00:32:53] swyx: Fantastic. Cool. Well, I think that's roughly highlights most of the open source universe. You also have your own models. Do you want to shout out any one of them? Yeah.[00:33:01] Wing Lian: I mean, I think like, I think Early on, Manicore got a lot of love.[00:33:04] I think it was mostly popular in, like, the roleplay communities. It was, it tended to be pretty truthful. It tended to be, like, have relatively good answers, depending on who you ask, right? But, I think for me, it was just, Releasing models was a way to try and, like, continue to build out the product, figure out what I needed to put into the product, how do I make it faster, and, if you've got to, like, go and debug your product, you may as well have it do something useful.[00:33:29] Awesome. So, yeah.[00:33:31] Finetuning - LoRA, QLoRA, ReLoRA, GPTQ[00:33:31] swyx: Okay, and then maybe we'll talk about just fine tuning techniques. So this is going to be a little bit more technical than just talking about model names and datasets. So we started off talking about LoRa, QLoRa. I just learned from your readme there's ReLoRa. Which I've never heard about.[00:33:45] Could you maybe talk about, like, just parameter efficient fine tuning that whole, that[00:33:50] Wing Lian: whole journey, like, what people should know. Yeah, so with parameter efficient fine tuning, I think the popular ones, again, being, let's, we'll start with lore, right? So, usually what you do is you freeze all the layers on your base, on the base model, and then you, at the same time, you sort of introduce additional Oh, this is tight.[00:34:08] No. You introduce, another set of layers over it, and then you train those, and it is done in a way that is mathematically possible, particularly with LORs that you can, then you, you, When you, when you train the model, you, you run your inputs through the base model, whose weights are frozen, but you, then you also run it through the additional weights, and then at the end you combine the weights, and then, and then, or you combine the weights to get your outputs, and then at the end, and when you're done training, you're left with this other set of weights, right, that are completely independent, and And then from that, what you can do is, some person smarter than I figured out, well, oh, they've done it in such a way that now I can merge these weights back into the original model without changing the architecture of the model, right?[00:35:03] So, so, that tends to be, like, the go to, and You're training much fewer parameters so that when you do that, yes, you still need to have all of the original weights, but you have a smaller gradient, you have a smaller optimizer state, and you're just training less weights, so you can tend to train those models on, like, much smaller GPUs.[00:35:27] swyx: Yeah. And it's roughly like, what I've seen, what I've seen out there is roughly like 1 percent the number of parameters that you're trading. Yeah, that sounds about right. Which is that much cheaper. So Axelotl supports full fine tune, LoRa, QLoRa,[00:35:40] Wing Lian: Q. Yes. So, so QLoRa is, is very similar to LoRa. The paper was, if I remember correctly, the paper was Rather, traditionally, most people who did Loras were, were, they were quant, they were putting the model weights in 8 bit, and then fine tune, parameter efficient fine tuning over the Lora weights, and then with QLora, they were quantizing all of those, they were then quantizing the weights down to 4 bit, right, and then I believe they were also training on all of the linear layers in the model.[00:36:15] And then with ReLore, that was an interesting paper, and then, I think, like, it got implemented. Some people in the community tried it, tried it out, and it showed that it didn't really have the impact that the paper indicated that it would. And from what I was told recently, that they re I guess they re released something for Relora, like, a few weeks ago, and that it's possibly better.[00:36:44] I personally haven't had the time. What was the[00:36:46] swyx: main difference,[00:36:47] Wing Lian: apart from quantization? I don't know. Okay. What was the main difference, sorry?[00:36:49] swyx: Apart from quantization, right? Like,[00:36:50] Wing Lian: Qlora's thing was, like, we'll just drop off some bits. With Relora, what they did was, you would go through, you would define some number of steps that you would train, like, your Lora with, or your Qlora.[00:37:01] Like, you could do Like, ReqLore, if you really wanted to, you would, you would train your LoRa for some number of steps, And then you would merge those weights into your base model, and then you would start over. So by starting, so, then by starting over, The optimizer has to find, like, sort of, re optimize again, and find what's the best direction to move in, and then do it all again, and then merge it in, do it all again, and theoretically, according to the paper, doing ReLore, you can do parameter efficient fine tuning, but still have sort of, like, the performance gains of doing a full fine tuning, so.[00:37:38] swyx: Yeah, and[00:37:39] Wing Lian: GPTQ? And GPTQ, so it's, I think with GPTQ, it's very similar to, more similar to QLore, where you're, it's mostly a quantization of the weights down to like 4 bit, where GPTQ is a very, is a specific methodology or implementation of quantization, so. Got it.[00:37:57] Alex Volkov: Wang, for, for folks who use Axolotl, your users, some people who maybe, Want to try it out?[00:38:03] And do they need to know the differences? Do they need to know the implementation details of QLora versus ReLora? Or is it okay for them to just know that Axolotl is the place that already integrated them? And if that's true, if that's all they need to know, how do they choose which method to use? Yeah,[00:38:22] Wing Lian: so I think like, I think most people aren't going to be using ReLora.[00:38:25] I think most people are going to be using either Lora or QLora. And I think they should have it. They should have an understanding of why they might want to use one over the other. Most people will say that with Qlora, the quality of the final model is not quite as good as like if you were to do a LoRa or a full fine tune, right?[00:38:44] Just because, you've quantized these down, so your accuracy is probably a little off, and so that by the time you've done the Qlora, you're not moving the weights how you would on a full fine tune with the full parameter weights.[00:38:56] Interesting.[00:38:57] swyx: Okay, cool. For people who are more interested, obviously, read the papers. I just wanted to give people, like, a high level overview of what these things are. And you've done people a service by making it easy for people to try it out. I'm going to, I'm going to also ask a question which I know to be wrong, but I'm curious because I get asked this all the time.[00:39:15] What is the difference between all these kinds of fine tunes[00:39:17] Wing Lian: and RLHF? Okay, between all of these sorts of fine tunes and RLHF. So all of these sorts of fine tunes are based, are, ideally, this, they are taking knowledge that the base model already knows about, and presenting it in a way to the model that you're having the model answer like, Use what it already knows to sort of answer in a particular way, whether it's, you're extracting general knowledge, a particular task, right?[00:39:44] Instruct, tune, chat, those sorts of things. And then generally with RLHF, so what is, let's go back, what is it? Reinforcement Learning with Human Feedback. So if we start with the human feedback part, What you're doing is you generally have, you have like a given prompt and then you, maybe you have one, maybe you have two, I think, like if you look at with Starling, you have like up to what, seven different, seven different possible responses, and you're sort of ranking those responses on, on some sort of metric, right, whether the metric is how much I, I might like that answer versus or I think with like starling is like how how how helpful was the answer how accurate was the answer how toxic was the answer those sorts of things on some sort of scale right and then using that to go back and like sort of Take a model and nudge it in the direction of giving that feedback, to be able to answer questions based on those preferences.[00:40:42] swyx: Yeah, so you can apply, and is it commutative? Can you apply fine tuning after and onto an RLHF model? Or should the RLHF apply, come in afterwards,[00:40:54] Wing Lian: after the fine tune? Um, I, yeah, I don't know that there's There's been enough research for one way or another, like, I don't know.[00:41:02] That's a question that's been asked on Discord. Yeah, like, I definitely would say I don't know the answer. Go and try it and report back to me and let me know so I can answer for the next guy.[00:41:10] swyx: It's shocking how much is still unknown about all these things. Well, I mean, that's what research is for, right?[00:41:16] Wing Lian: So actually I, I think I saw on the top of a leaderboard, it was a, it was a mytral base model, and they didn't actually fine tune it. They, or they, they just did RLH, they did like an RLHF fine tune on it using like, I don't, I don't recall which dataset, but it was like, and it benchmarked really well.[00:41:37] But yeah, you'd have to go and look at it. But, so it is interesting, like going back to that, it's like. Traditionally, most people will fine tune the model and then do like a DPO, PPO, some sort of reinforcement learning over that, but that particular model was, it seemed like they skipped like the supervised fine tuning or Scott.[00:41:55] Axolotl vs HF Transformers[00:41:55] swyx: Cool. One thing I did also want to comment about is the overall, like, landscape, competitive landscape, I don't know. Hugging Face Transformers, I think, has a PFT module.[00:42:05] Wing Lian: Yeah, yeah, the PEFT, the Parameter Efficient Fine Tuning, yep. Is that a competitor to you? No, no, so we actually use it. We're just a wrapper over sort of, sort of the HuggingFace stuff.[00:42:15] So, so that is their own sort of module where They have, taken the responsibility or yeah, the responsibility of like where you're doing these parameter efficient fine tuning methods and just sort of like, it is in that particular package where transformers is mostly responsible for sort of like the modeling code and, and the trainer, right.[00:42:35] And then sort of, there's an integration between the two and, there's like a variety of other fine tuning packages, I think like TRL, TRLX, that's the stability AI one. Yeah, I think TRL likes the stability, yeah, Carper, and TRL is a hugging face trainer. Even that one's just another wrapper over, over the transformers library and the path library, right?[00:43:00] But what we do is we have taken sort of those, yes, we've We also use that, but we also have more validation, right? So, there are some of us who have done enough fine tunes where like, Oh, this and this just don't go together, right? But most people don't know that, so like Example?[00:43:19] Like, people want to One and one doesn't go together. I don't have an example offhand, but if you turn this knob and this knob, right? You would think, all right, maybe this will work, but you don't know until you try. And then by the time you find out it doesn't work, it's like maybe five minutes later, it's failed.[00:43:34] It's failed in the middle of training or it's failed during the evaluation step. And you're like, ah, so we've, we've added a lot of, we've added a lot more validation in it. So that like, when you've, you've created your configuration, you run it through and now you say. The validation code says this is probably not right or probably not what you don't, not what you want.[00:43:52] So are you like a, you[00:43:53] swyx: do some linting of your YAML file?[00:43:56] Wing Lian: There, I guess you could call it linting, it's sort of like Is there a set of rules out[00:44:00] swyx: there somewhere? Yeah, there's a set of rules in there. That's amazing, you should write documentation like This rule is because, this user at this time, like, ran into this bug and that's what we invested in.[00:44:10] It's like a good collection[00:44:11] Wing Lian: of knowledge. Yeah, it is, and I guess like, if you really wanted to, like, figure it out, I guess you could, like, git blame everything, and But, yeah, it's, so, I think that's always a useful thing, it's like Because people want to experiment but they don't, people will get frustrated when you've experiment, you're experimenting and it breaks and you don't know why or you know why and you've just gone down the rabbit hole, right?[00:44:37] So, so I think that's one of the big features that's, that I think I find important because it's It prevents you from doing things you probably shouldn't have, and it, and sometimes we will let you do those things, but we'll try and warn, warn you that you've done that.[00:44:50] I[00:44:51] Alex Volkov: have a follow up question on this, actually, because yesterday we hung out to this open source event, and I spent time by you a couple times, like when people told you, oh, XLR, I use XLR, it's super cool, and then the first thing you asked is, like, immediately, like, what can we improve?[00:45:04] And yes, from multiple folks, and I think we talked about this a little bit, where there's It's a developer tool. It's like a machine learning slash developer tool. Your purpose in this is to help and keep people, as much as possible, like, Hey, here's the best set of things that you can use right now. The bear libraries are, or the bear trainer, for example, is a bear trainer.[00:45:28] And also, maybe we should talk about how fast you're implementing these things. So you mentioned the first implementation took a week or so. Now there's a core maintainer group, right? There's like, features are landing, like Qlora, for example. Neftune, I don't know if that's one example of something that people potentially said that it's going to be cool, and then eventually, like, one of those things that didn't really shake out, like, people quickly tested this out.[00:45:48] So, there's a ton of Wait, Neftune is cancelled? I don't know if it's fully canceled, but based on vibes, I heard that it's not that great. So like, but the whole point that I'm trying to make with Neftune as well is that being existing in the community of like XLR or like, I don't know, even following the, the GitHub options or following the Discord, it's a fairly good way to like, learn these, Kind of gut feelings that you just, you just said, right?[00:46:14] Like where this, maybe this knob, that knob doesn't work. Some of these are not written down. Some of these are like tribal knowledge that passes from place to place. Axel is like a great collection of many of them. And so, do you get That back also from community of folks who just use, like, how do you know who uses this?[00:46:30] I think that's still an issue, like, knowing if they trained with XLR or should they add this to things? Talk about, how do you get feedback and how else you should get feedback?[00:46:38] Wing Lian: Yeah, I mean, most of the feedback comes from the Discord, so people come in and , they don't get a training running, they run into, like, obscure errors or, errors that That's a lot of things that maybe, maybe as a product we could catch, but like, there's a lot of things that at some point we need to go and do and it's just on the list somewhere.[00:46:58] Right that's why when people come up, I'm like, what, what were your pain points? Because like, as a developer tool, if you're not happy with it, or you come in and in the first, Takes you 30 minutes and you're still not happy. You leave the tool and you may, you might move on maybe to a better tool, maybe to, one with less frustration, but it may not be as good, right?[00:47:17] So I'm trying to like, figure out, all right, how can I reduce all this frustration? Because like for me, I use it every day for the most part, right? And so I am blind to that, right? Mm-Hmm. . Mm-Hmm. . I just know, I, I go do this, this, and this. It pretty much mostly works, right? But, so I don't have sort of that, alright, that learning curve that other people are seeing and don't understand their pain points.[00:47:40] Yeah,[00:47:40] Alex Volkov: you don't have the The ability to onboard yourself as a new user completely new to the whole paradigm to like get into the doors of like, Oh, no, I don't even know how to like ask about this problem or error.[00:47:53] swyx: Cool. The last few things I wanted to cover was also just the more advanced stuff that you covered yesterday.[00:48:00] 20x efficiency with StackLlama and Multipack[00:48:00] swyx: So I'll just, caution this as like, yeah, this is more advanced. But you mentioned Stackllama and Multipack. What are they[00:48:06] Wing Lian: and what should people know? Yeah, so, so, Stack Llama was, that paper came out, so Stack Llama I think was like, two, two, two separate, two separate concepts that they announced, so the first one was They being hugging face.[00:48:20] Yeah, sorry, yes, they being hugging face, so the first one being sort of like, this idea of packing, like some packing sequences together, so like, if we think about training data, right, your training data is, let's say, to keep the math easy, let's say your training data is 500, We, we, we, we will use the terminology words.[00:48:39] Let's say your training data is 500 words long, and let's say your, your context length, you know how much data your, that your model can accept is like, or that you want feed into your model. It's, let's say, we won't use tokens again, we'll we'll use it is it's 4,000 tokens, right? So if you're training at 4K Con or four 4,000 4K contacts and you're only using 500 of it, you're sitting like with the other 1500.[00:49:05] 3, 500 words that you're not using, right? And typically that's either filled with these PAD tokens, so I think I made the analogy last night that it's like having sort of like a glass here you fill it up with a shot of liquor and then you're and that's your training data and then you just fill it up with more water and those are your PAD tokens and it's just, it doesn't do much, right?[00:49:27] It's still the same thing, but you still have to go through all of that to go through all your training data. And then, so what Stack Llama showed was you could just sort of take your training data, append the next row of training data until you filled that entire 4k context, so in this example, right, with 500 words to 4k, that's 8 rows of training data.[00:49:48] But, the problem with that is, is that with a lot of these transformer models, they're very much relying on attention, right? So, like, if you now have this sequence of words that now, in order for the, the model has seen all of these other words before, right? And then it sees another set of words, another set of words, but it's learning everything in context of all the words that it's seen before.[00:50:13] We haven't corrected the attention for that. And just real quickly, since I said that that paper was two concepts, the other one was, I believe it was like a reinforcement learning, but outside the scope of this. So going from that, I implemented that early on because I was like, Oh, wow, this is really great.[00:50:29] And. Yes, because it saves you a bunch of time, but the trade off is a little bit of accuracy, ultimately, but it still did pretty well. I think when I did Manicore, I think it used sort of that concept from Stack Llama of just sort of appending these sequences together, right? And then sort of the next evolution of that is Multipack, right?[00:50:51] So, there was a separate paper on that, it was, I believe it was referenced, it got referenced in the Orca paper, where you could, you could properly mask those out using like a, I think it was like a lower block triangular attention mask, and then sort of, so, So, there's that. I did try implementing that, manually recreating that mask, but then one from the OpenChat, so he was helping with OpenOrca as well, and he had done an implementation of Multipack, and where he used FlashAttention, so FlashAttention So that was released by TreeDAO, and it was this huge performance gain.[00:51:35] Everybody uses it now, even the Transformers library now, they've taken all of these, like, people are taking all of these models and sort of like, making it compatible with FlashAttention. But in Flash Tension, there is one particular implementation that lets you say, Well, I'm sending you all of these sequences like you would in Stack Llama, But let me send you another, another, Set of information about, this is where this set of sequences is, this is where the second set of sequences is.[00:52:06] So like, if it was like, 500 words long, and you stacked them all together, you would just send it a row of information that was like, 0, 500, 1000, 1500, etc, etc, out to 4000. And it would know, alright, I need to break this up, and then run the forward pass with it. And then it would be able to, and it was much more, much more performant.[00:52:29] And I think you end up seeing like 10x, 20x improvements over sort of, I mean, I think FlashAttention was like a 2x improvement, and then adding that with the Multipack, you start to see like, depending on, how much data you have, up to like a 20x improvement sometimes. 20x. 20x. Wow. Yeah.[00:52:48] And I only know the 20x because I, like, before last night, I was like, I re ran the alpaca, I looked up the alpaca paper because it was like, I just need a frame of reference where somebody did it, and I think they used eight A100s for three hours, and they said it cost them 100. I don't, I don't think eight A100s cost, I don't know how much it costs right now.[00:53:14] But I ended up rerunning it. Usually a dollar an hour, right? Yeah, so eight. The cheapest is like a[00:53:18] Alex Volkov: dollar, a dollar an hour for one.[00:53:20] Wing Lian: Yeah, so that's still like 24, 25. But maybe if you're going on Azure, maybe it's like, maybe it's 100 on Azure. I mean, it used to be more expensive, like, a year ago.[00:53:31] Yeah, and then, so I re ran it with sort of like, I turned on all of the optimizations just to see what it would be. And like, and usually Multipack is the biggest optimization, so Multipack with Flash Detention. And it, I think I spun it up on 8 L40s, and it ran, and I didn't let it run all the way through, I just grabbed the time, the estimated completion time, and it was like 30 minutes, so it would have cost like 4 or 5 to run the entire, like, reproduce the alpaca paper, right?[00:54:00] Which is crazy. It's crazy. 20x,[00:54:02] Alex Volkov: yeah. I want to ask about, like, you said you turned on all the optimization. Is that the yaml file with xlodl, you just go and like check off, like, I want this, I want that? Yeah, yeah,[00:54:10] Wing Lian: so there's like one particular yaml file in there, That, there's one particular YAML file in there that's like, it's under examples, llama2, fft, optimize.[00:54:20] So, I think someone had created one where they just turned, they put in all of the optimizations and turned them on. I mean, it actually, it does run, which is like, sort of surprising sometimes, because sometimes, you optimize this, optimize this, and sometimes they just don't work together, but, yeah.[00:54:36] Just turn the knobs on, and like, fine tuning should really just be that easy, right? I just want to flip the knob and move on with my life and not figure out how to implement it.[00:54:47] Tri Dao and Mamba[00:54:47] Alex Volkov: Specifically, the guy behind FlashAttention came up with something new. You want to talk about this a little bit? You want to briefly cover Mamba?[00:54:53] Yeah, let's talk about Mamba. Let's talk about Mamba. So, what is Mamba?[00:54:57] Wing Lian: Oh, gosh. I

god ai google internet pr giving personal talk news san francisco reach microsoft putting open uber soundcloud discord busy seo stanford minneapolis billion indianapolis privacy disco releasing fantastic drag goliath models intel roadmap folks excel transformers significant apartments openai gemini sf wing tuning san francisco bay area open source 4k gpt lama traditionally forge github takes llama dharma apis macbook lors hermes azure gravy lotto apache nano biases amin llm pharrell mamba gpu aspect elo annapolis orca ddos alpen prs contamination yarn instruct sarah silverman hugging starling benchmarks gpus 7b perplexity distilling minotaur pad fine tuning mcl alpaca modify lm noose json pygmalion sota hyena trl mistral puffins a16z xlr microsoft research flan ppo axolotl manticore dpo reinforcement learning jackalope google adsense pft stability ai datasets capybara lmc gpc lay low yaml open source ai mle jeremy howard 70b loras carper neurips sgn huggingface hippogriff united masters rlhf vram gemini ultra technium vicuna latent space human feedback which i've relora

Episode 280: Digital Foundry Talks Ray Tracing, VRAM Requirements & More

The Full Nerd

Play Episode Listen Later Nov 21, 2023 103:52

Join The Full Nerd gang as they talk about the latest PC hardware topics. In this episode the gang is joined by Alex Battaglia of @DigitalFoundry to talk about the future of ray tracing/path tracing, VRAM requirements in modern games, unoptimized ports, what it's like to work at DF and more. And of course we answer all of your burning questions, so be sure to join us live! * This episode of The Full Nerd is sponsored by Asus. On December 3rd the company will be hosting PC DIY Day, a celebration of all things PC building and modding. Head to https://www.asus.com/us/site/PCDIY/ to pick up tickets for this event or tune into the live stream! Buy The Full Nerd merch: https://crowdmade.com/collections/pcworld Join the PC related discussions and ask us questions on Discord: https://discord.gg/SGPRSy7 Follow the crew on Twitter: @GordonUng @BradChacos @MorphingBall @AdamPMurray Follow PCWorld for all things PC! ---------------------------------- SUBSCRIBE: http://www.youtube.com/subscription_center?add_user=PCWorldVideos TWITTER: https://www.twitter.com/pcworld

head pc discord requirements df asus ray tracing digital foundry vram full nerd

Mike Ybarra Confident With Phil Spencer's Leadership, Playstation Confirms BIG Delays For 1st Party

Double Barrel Gaming

Play Episode Listen Later Nov 10, 2023 111:58

TIME STAMP INFO: 00:00:00 Intros 00:05:00 Baldur's Gate 3 developers found a 34% VRAM optimization while developing the Xbox Series S port. 00:30:00 Playstation Confirms BIG Delays For 1st Party GAAS Titles, 1/2 Of Them Have Been Delayed Outside Of 2025 & Beyond! 01:20:00 Mike Ybarra Confident With Phil Spencer's Leadership & See's A Bright Future For The Company! 01:48:00 Panel Outros and Special Message to the community! --- Support this podcast: https://podcasters.spotify.com/pod/show/craig-ravitch/support

leadership gate confident playstation delays confirms baldur special message phil spencer xbox series s mike ybarra vram

221. 2024 VRAM, RDNA 4 vs RTX 5000, Zen 5 vs Arrow Lake, PS5 Pro | AAA Developer

Play Episode Listen Later Sep 4, 2023 147:52

A Veteran AAA Dev of Obsidian, Vicarious Visions, & NetEase joins to discuss Intel, Nvidia, & AMD. [SPON: Get 10% off Tasty Vite Ramen with code BROKENSILICON: https://bit.ly/3wKx6v1 ] [SPON: dieshrink = 3% off Everything, brokensilicon = 25% off Windows: https://biitt.ly/shbSk ] 0:00 Who is Taylor? What does a Senior Tools Engineer do? 11:45 Will 2024 crush 12GB GPUs? Why did 8GB issues surprise people? 25:56 PlayStation 5 Pro Performance Targets - Will 8K ever make sense? 37:41 How do devs decide on min requirements? Why does Starfield fit in 8GB? 53:58 How does “Optimization” actually happen? When does it fail? 1:01:02 RDNA 4 v RTX 5000 Strategy – Gamers want more? Or the same for less? 1:22:53 When will games utilizes dozens of cores? 1:41:12 Intel Rentable Units vs AMD X3D & SMT-4 1:47:15 Zen 5 vs Arrow Lake - Does ARL have enough P-Threads? 2:02:06 The Road to Mainstream VR Taylor's TangentNotes Project: https://www.tangentnotes.com/ Previous Episode with Guest: https://youtu.be/DKBHsD4UUOo?si=TtyXdXyz64MdA2qH Taylor's LinkedIn: https://www.linkedin.com/in/taylor-hadden-01a09827/ https://youtu.be/NDEka3tBE1g?si=3rE4odgG73S9DFR_&t=5594 https://youtu.be/b2aD3aIAeMo https://youtu.be/7cEKTr70YBY https://youtu.be/vTNiZhEqaKk https://youtu.be/IPSB_BKd9Dg?si=x_pEsZpIeX0zi9ck

lake playstation windows playstation 5 developers intel zen nvidia arrow optimization amd starfield obsidian rtx netease smt vicarious visions 8gb rdna vram

Nvidia Further Destroys AMD in Ray Tracing

acast fortnite nvidia destroys amd bios benchmarking ray tracing podcast video dlss vram nvidia dlss

Play Episode Listen Later Aug 23, 2023 71:49

Episode 2: It's a news episode! We talk about Nvidia's new DLSS 3.5 technology; our early thoughts and how this impacts AMD and benchmarking. We also revisit some difficulties with VRAM benchmarking, give our thoughts on new PresentMon features, Nvidia's BIOS signature check being bypassed and more.CHAPTERS0:00 - Intro2:56 - Nvidia DLSS 3.5 Ray Reconstruction25:25 - Benchmarking is Getting More Difficult, VRAM Discussion37:49 - DLSS 3 is Coming to Fortnite, Latency Testing42:30 - GPU Busy Metric in PresentMon48:28 - Nvidia BIOS Signature Lock Bypassed58:41 - Insights into Steve's Boring LifeNEWS LINKSNvidia DLSS 3.5: https://www.nvidia.com/en-au/geforce/news/nvidia-dlss-3-5-ray-reconstruction/TechPowerUp Nvidia BIOS Lock Bypass: https://www.techpowerup.com/312631/nvidia-bios-signature-lock-broken-vbios-modding-and-crossflash-enabled-by-groundbreaking-new-toolsSUBSCRIBE TO THE PODCASTAudio: https://shows.acast.com/the-hardware-unboxed-podcastVideo: https://www.youtube.com/channel/UCqT8Vb3jweH6_tj2SarErfwSUPPORT US DIRECTLYPatreon: https://www.patreon.com/hardwareunboxedFloatplane: https://www.floatplane.com/channel/HardwareUnboxedLINKSYouTube: https://www.youtube.com/@Hardwareunboxed/Twitter: https://twitter.com/HardwareUnboxed Hosted on Acast. See acast.com/privacy for more information.

The Mathematics of Training LLMs — with Quentin Anthony of Eleuther AI

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Aug 16, 2023 50:38

Invites are going out for AI Engineer Summit! In the meantime, we have just announced our first Actually Open AI event with Brev.dev and Langchain, Aug 26 in our SF HQ (we'll record talks for those remote). See you soon (and join the Discord)!Special thanks to @nearcyan for helping us arrange this with the Eleuther team.This post was on the HN frontpage for 15 hours.As startups and even VCs hoard GPUs to attract talent, the one thing more valuable than GPUs is knowing how to use them (aka, make GPUs go brrrr).There is an incredible amount of tacit knowledge in the NLP community around training, and until Eleuther.ai came along you pretty much had to work at Google or Meta to gain that knowledge. This makes it hard for non-insiders to even do simple estimations around costing out projects - it is well known how to trade $ for GPU hours, but trading “$ for size of model” or “$ for quality of model” is less known and more valuable and full of opaque “it depends”. This is why rules of thumb for training are incredibly useful, because they cut through the noise and give you the simple 20% of knowledge that determines 80% of the outcome derived from hard earned experience.Today's guest, Quentin Anthony from EleutherAI, is one of the top researchers in high-performance deep learning. He's one of the co-authors of Transformers Math 101, which was one of the clearest articulations of training rules of thumb. We can think of no better way to dive into training math than to have Quentin run us through a masterclass on model weights, optimizer states, gradients, activations, and how they all impact memory requirements.The core equation you will need to know is the following:Where C is the compute requirements to train a model, P is the number of parameters, and D is the size of the training dataset in tokens. This is also equal to τ, the throughput of your machine measured in FLOPs (Actual FLOPs/GPU * # of GPUs), multiplied by T, the amount of time spent training the model.Taking Chinchilla scaling at face value, you can simplify this equation to be `C = 120(P^2)`.These laws are only true when 1000 GPUs for 1 hour costs the same as 1 GPU for 1000 hours, so it's not always that easy to make these assumptions especially when it comes to communication overhead. There's a lot more math to dive into here between training and inference, which you can listen to in the episode or read in the articles. The other interesting concept we covered is distributed training and strategies such as ZeRO and 3D parallelism. As these models have scaled, it's become impossible to fit everything in a single GPU for training and inference. We leave these advanced concepts to the end, but there's a lot of innovation happening around sharding of params, gradients, and optimizer states that you must know is happening in modern LLM training. If you have questions, you can join the Eleuther AI Discord or follow Quentin on Twitter. Show Notes* Transformers Math 101 Article* Eleuther.ai* GPT-NeoX 20B* BLOOM* Turing NLG* Mosaic* Oak Ridge & Frontier Supercomputer* Summit Supercomputer * Lawrence Livermore Lab* RWKV* Flash Attention * Stas BekmanTimestamps* [00:00:00] Quentin's background and work at Eleuther.ai* [00:03:14] Motivation behind writing the Transformers Math 101 article* [00:05:58] Key equation for calculating compute requirements (tau x T = 6 x P x D)* [00:10:00] Difference between theoretical and actual FLOPs* [00:12:42] Applying the equation to estimate compute for GPT-3 training* [00:14:08] Expecting 115+ teraflops/sec per A100 GPU as a baseline* [00:15:10] Tradeoffs between Nvidia and AMD GPUs for training* [00:18:50] Model precision (FP32, FP16, BF16 etc.) and impact on memory* [00:22:00] Benefits of model quantization even with unlimited memory* [00:23:44] KV cache memory overhead during inference* [00:26:08] How optimizer memory usage is calculated* [00:32:03] Components of total training memory (model, optimizer, gradients, activations)* [00:33:47] Activation recomputation to reduce memory overhead* [00:38:25] Sharded optimizers like ZeRO to distribute across GPUs* [00:40:23] Communication operations like scatter and gather in ZeRO* [00:41:33] Advanced 3D parallelism techniques (data, tensor, pipeline)* [00:43:55] Combining 3D parallelism and sharded optimizers* [00:45:43] Challenges with heterogeneous clusters for distribution* [00:47:58] Lightning RoundTranscriptionAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]Swyx: Hey, today we have a very special guest, Quentin Anthony from Eleuther.ai. The context for this episode is that we've been looking to cover Transformers math for a long time. And then one day in April, there's this blog post that comes out that literally is called Transformers Math 101 from Eleuther. And this is one of the most authoritative posts that I've ever seen. And I think basically on this podcast, we're trying to give people an intuition around what are the rules of thumb that are important in thinking about AI and reasoning by AI. And I don't think there's anyone more credible than the people at Eleuther or the people training actual large language models, especially on limited resources. So welcome, Quentin. [00:00:59]Quentin: Thank you. A little bit about myself is that I'm a PhD student at Ohio State University, starting my fifth year now, almost done. I started with Eleuther during the GPT-NeoX20B model. So they were getting started training that, they were having some problems scaling it. As we'll talk about, I'm sure today a lot, is that communication costs and synchronization and how do you scale up a model to hundreds of GPUs and make sure that things progress quickly is really difficult. That was really similar to my PhD work. So I jumped in and helped them on the 20B, getting that running smoothly. And then ever since then, just as new systems challenges arise, and as they move to high performance computing systems and distributed systems, I just sort of kept finding myself falling into projects and helping out there. So I've been at Eleuther for a little bit now, head engineer there now, and then finishing up my PhD and then, well, who knows where I'll go next. [00:01:48]Alessio: Awesome. What was the inspiration behind writing the article? Was it taking some of those learnings? Obviously Eleuther is one of the most open research places out there. Is it just part of the DNA there or any fun stories there? [00:02:00]Quentin: For the motivation for writing, you very frequently see in like the DL training space, like these Twitter posts by like, for example, like Stas Bekman at Hugging Face, you'll see like a Twitter post that's like, oh, we just found this magic number and everything is like 20% faster. He's super excited, but doesn't really understand what's going on. And the same thing for us, we very frequently find that a lot of people understand the theory or maybe the fundamentals of why like AI training or inference works, but no one knows like the nitty gritty details of like, how do you get inference to actually run correctly on your machine split across two GPUs or something like that. So we sort of had all of these notes that we had accumulated and we're sort of sharing among engineers within Eleuther and we thought, well, this would really help a lot of other people. It's not really maybe appropriate for like a paper, but for something like a blog post or technical report, this would actually maybe squeeze a lot of performance out of people's hardware they're already running on. So I guess there are a lot of projects in Eleuther that we're sort of trying to share notes with people in a way that typical institutions don't. They sort of live within that institution and then you go to a different institution and they do something very similar, but without the lessons of the previous. And it's because everyone's trying to do their own special sauce with their own stack. Whereas Eleuther, we don't really have that constraint and we can just share everything to everybody. [00:03:14]Swyx: Yeah, this is a level of openness that basically very few people actually embrace. One, it's an extra effort to write things down, of course, but two, it is secret sauce and so that not many people do it. And therefore, oftentimes the only way to learn this stuff is to actually work in one of the large model labs. And so you guys are doing a lot. The only other instance where I can think of where people actually open sourced their process was Facebook's OPT. What else is similar, like sort of trade knowledge, but not formal research knowledge? [00:03:45]Quentin: I would say Bloom. So the Hugging Face Bloom project in big science and all of that, that was very open. I'd say it's the same caliber, if not more detailed than OPT. Other than that, I think there was like a doc from Microsoft on like their Turing NLG. Their paper is pretty relaxed in that it did talk about some of those challenges. Other than like OPT and Bloom and us, I can't think of any. It's a new thing. [00:04:10]Swyx: It matters that you are going for the sort of good enough rules of thumb, because I think a lot of people try to go for precision and being overly precise actually is not helpful. Right. Yes. [00:04:20]Quentin: You'll see some like statements in the blog posts that are just like, we think this is about 1.2 in our experience. And, you know, we don't go any further into detail and it would take maybe an extra month for us to chase down every single little piece of memory. But instead, like getting good enough is still helpful to people. [00:04:36]Alessio: Let's jump into it. The first part of the article, and we'll put this in the show notes so people will be following along with the post. So we don't need to read every single equation and every footnote for it. [00:04:46]Swyx: Okay. [00:04:46]Alessio: But the core equation here is that not the cost of compute, but the compute required to turn a transformer model is roughly equal to tau times T, where like T is the, where tau is the hardware setup throughput that you have. So number of GPUs times the actual flops per GPU. And then T is the time spent. I think people can visualize that pretty easily. It's basically like how many GPUs do you have and how much do you let them run for? And the things that come to it that people have read before in the Chinchilla paper in a way, and the OpenAI scaling law is that you can then equal this to 6PD, where P is the number of parameters in the model and D is the size of the, of the dataset in tokens. So talk a little bit about how people should think about the two. I think a lot of times the focus is on tokens parameter ratio in the training dataset and people don't think as much about the actual flops per GPU, which you're going to mention later in the blog post too, in terms of how much you can get out. So how should people think about this when they're building a model and where should they go to this equation as they're starting to think about training their own transformer-based [00:05:58]Swyx: model? [00:05:58]Quentin: You touched a little bit on the fact that people usually start with the dataset. So you have some dataset that you want to train a model on. And then from there, from the 6PD, you should see, okay, I should have about six tokens per parameter. So that determines my model size thereabouts for Chinchilla Optimal. So since then we've seen that need more something like 20 or more than that to get a good quality model. But the next question that should be on your mind in terms of a systems perspective is how long is it going to take for this model to train and what kind of budget should I expect? So let's say I want some cloud instance for some amount of time and each of them will have some price attached to it. So that's where the throughput comes in. So now that you have this model, this number of parameters, you should map that to a transformer architecture and you should benchmark what throughput you get on your software stack for that type of model. So now you have your flops per second on a single GPU. And then given whatever parallelism scheme, which I'm sure we'll get into, like data parallelism or tensor parallelism or whatever else, how is that flops number going to scale to whatever number of GPUs? And then from there, you're going to get a time. And if you have a time, you have a cost. Those are like the business answers that you'll be able to get using this formula. That's why we sort of split it into the T and the throughput terms so that you can solve for one of them, which is usually get throughput, need time, and from time you get cost. In a nutshell, that's the answer. [00:07:19]Alessio: One thing that I noticed, you mentioned some of these laws are only true when a thousand GPUs for one hour cost the same as one GPU for a thousand hours, given that we have a shortage of the biggest GPUs out there. Any thoughts there on how people should prioritize this? [00:07:36]Quentin: Yeah, so I would say you should find what the minimum number of GPUs is to just fit your model first. The memory bottleneck is your biggest problem if you have a sizable model. If it's a small model, nobody cares. But most models that people care about will need to be split across multiple GPUs. So find the minimum number of GPUs to just fit your one instance of your model and then calculate how long that's going to take. If it's a reasonable amount of time, then you're done. If it takes too long, then you need to start worrying about having multiple instances of that model. I always feel like you should go with the minimum number of GPUs because the more number of GPUs that you have, the more likely it is for things to break. So I would say just find out what time is reasonable for you and then fit the number of GPUs to that and no more. Because people get greedy and they say, if I have twice the GPUs, I can get this done in half the time. And then you end up taking three times the time because everything is breaking every day. And that's when I am up at midnight trying to fix your model that's broken. [00:08:34]Swyx: We had a previous guest which has invested a lot in their framework for training these things. Would there not be an equivalent open source framework you guys would have made that would help with scaling up GPUs linearly like that? Or is this an oversimplification? [00:08:50]Quentin: Okay, yeah. So maybe I should step back. Both Mosaic and us have our own sort of software stack recipe that scales well, theoretically. But I'll get to that in a minute. Mosaic is all based off optimizer sharding. So it's based off ZeRO. So you basically perfectly split your model optimizer and your parameters and your gradients across all of the different GPUs. So your aggregate memory is number of parameters divided by number of GPUs. Same thing for optimizer and so on. Whereas we at Eleuther use a Megatron deep speed based library. And for that, it's a bit more complex. So the efficiency can be a little higher, but it's more prone to failure at the same [00:09:30]Swyx: time. [00:09:30]Quentin: So you kind of have to tune it. In both cases, getting back to like the practical case, you should be able to get linear speed up by adding more GPUs. The problem is that there are hardware failures. You tend to have problems with like maybe loss will overflow if you have too many GPUs or maybe one GPU will hang. You might have software issues. You might have synchronization issues. And that's why I'm saying practically that you should take the minimum number of GPUs that you have because those are the easier cases to debug. That make sense? [00:10:00]Swyx: Yeah. [00:10:00]Quentin: Any more detail on any specific point? [00:10:02]Swyx: Not particularly, just because we haven't actually had to debug those things. But I imagine basically there's a lot of return towards encoding these knowledge into software and not repeating it again. So it makes a ton of sense. I think Alessio had more questions before we move too far into high level, more questions on just the equation itself. I think we want to spend time on essentially, this is the central equation of figuring out compute requirements. Yeah. [00:10:25]Alessio: Another thing in it is that the computer is like the forward pass and like the backwards pass and forward is 2PD, backward is 4PD. Why it's to the ratio between the two? Can you explain that? Why is it two and four? [00:10:39]Quentin: Yeah. [00:10:40]Alessio: Why is it twice the amount? [00:10:42]Quentin: Oh, okay. Intuitively for forward pass, you're just moving, you're propagating forward the inputs through the layer. And then in the backward pass, you're doing something a little more complex than that. You're doing back propagation. And I don't think I can explain it intuitively enough to go into more detail on the exact [00:10:58]Swyx: numbers. Yeah. [00:10:58]Quentin: That's okay. [00:10:59]Swyx: I feel like you want to get out a whiteboard and start drawing like, you know. [00:11:02]Quentin: That's what I would normally do. [00:11:03]Swyx: Tangents and gradients. It's actually surprisingly low to do the back propagation. Honestly, that's one of the fundamental things I love about the math of deep learning so far that as I've explored it, which is, it's surprisingly efficient as compared to other, I guess, numerical methods you might be exposed to and, you know, college calculus. Yeah. [00:11:22]Alessio: And I think the other thing is that things sound simple, you know, when people go on Twitter and say, Oh, 20 is like the optimal ratio. And it's like, then it's like, well, why is that the number? And the answer is usually much, much harder, like what we're seeing right now. So I think it's a, it's a good reminder that the numbers are simple, like all the best and most popular, like math equations are like, so elegant. Obviously the proof behind that is, it's not that easy. That's always a good reminder. [00:11:52]Swyx: I want to put this equation to the test a little bit. We can do this from either GPT-3's perspective or GPT-NeoX, whatever you're more comfortable with. You have this distinction of actual flops versus theoretical flops. And a lot of times when people report the flops it took to train a model, like we just saw one in Lama 2 where the estimate is something that the amount of flops and that's, that's what we go with. So GPT-3 took a 3.14 times 10 to the power 23 flops. That is the theoretical flops. I want to get to a point where I can sort of work out if a number passes the smell test. And I wonder how to do that because I should be able to plug in this equation, right? I know that GPT-3 was trained on 300 billion tokens. I know the parameter size of 175. Is it, is it just like a 6 times 175 times 300? Like I haven't done the math, but what are the nuances here that you might want to call out? [00:12:42]Quentin: Theoretical flops is usually given from, you have a given set of hardware and this is what you expect your hardware to get. The problem is that in practice, full utilization, that's the key word, right? Because in practice, there are a lot of cases where like you're spending time waiting on data movement from like the GPU to CPU. Or for example, you might be waiting to synchronize across the different GPUs. So there's a lot of idle time basically that you're going to be spending during training. [00:13:05]Swyx: Smell tests. [00:13:06]Quentin: I don't know if I have a smell test myself, to be honest, like maybe I'll look at like what sort of flops, what you would expect on like an A100. There's sort of just an expected flops for a given GPU that everyone sort of knows what you should expect. So like for an A100, that number is somewhere between 100 and 180. T flops is what you would expect to see on an A100. For a V100, like an older GPU, it's something more like 40 to 30. So people sort of know, given the kernels that we're running for a deep learning, what sort of flops you expect. And then you sort of compare that to the theory, to the theoretical flops that people are reporting and see if that matches your expectations. [00:13:47]Swyx: Yeah. [00:13:47]Alessio: And in the article you mentioned for the A100, like if you're seeing below 115 teraflops a second, there's something wrong with your model or hardware. How did you get to 115? Is it just, you know, production observability and like you've seen over months and months and months that like that's the baseline or how do you come up with the numbers like that? Yeah. [00:14:08]Quentin: For a number like that, we basically, we compared a lot of different frameworks. So like I mentioned before, Mosaic has their own framework and we have our own framework. They all have their own flop counters too, right? And we saw across a bunch of different hardware configurations that if you tune things correctly, you should be getting above 115 in pretty much all cases. So like there are some cases where things are tuned poorly or your system is a little weird, but we've never been able to get a new system and not been able to get above [00:14:35]Swyx: 115. [00:14:35]Quentin: If something is below 115, you have something really wrong in your software. But that's really all it is, is just comparing across software stacks and hardware systems. [00:14:44]Alessio: What about different GPUs? We had George Hotz on the podcast and he talked about AMD cards and how in theory their flops should be much better than some Nvidia cards, but the reality is like the CUDA runtime makes up for it. How should people think about improving that? You know, like do you see, okay, the A100 is like 115 teraflops. I'd rather just stick with this than try and figure out all the kinks of like a better AMD card or any thoughts there? [00:15:10]Swyx: Right. [00:15:10]Quentin: Well, that's sort of touching on developer time, right? And which ends up being more expensive because at the end of the day, the AMD and Rockham software stack has a long way to go. I would say most things run there, not particularly efficiently, but you're going to have weird bugs that no one has encountered before. One of the big pluses of going with the Nvidia and PyTorch stack is that there are thousands of GitHub issues with everyone facing the same problem as you and resolving them quickly and in an open source way is probably the biggest benefit of going with the Nvidia software stack right now. AMD has about the same hardware, software, not so much. And they haven't quite got the momentum in the open source realm, for example, to get close. Like something, for example, like Flash Attention, it's spread to more Nvidia GPU types than it has like to AMD at all. And waiting on those latest and greatest features to reach AMD is something that's prohibitive to a lot of people, but it's getting there. I'm running a lot of experiments on AMD right now because it's sort of reached the government lab supercomputers now. And so a lot of experiments are going there and it will catch up, I'd say within a few [00:16:14]Swyx: years. [00:16:14]Quentin: Awesome. [00:16:15]Swyx: Maybe just talk about what's available from the government labs and I heard the original, the origin of Eluther started with a grant for TPUs. Is that right? [00:16:24]Quentin: Yes, that was a little before me, but there was a lot of just like getting a grabbing a Google Cloud or TPU pod or something like that is a lot of the original TPU work on Mesh TensorFlow, which is like now like an ancient distributed deep learning library. [00:16:36]Quentin: Eluther got a grant, an insight grant with Oak Ridge last year, and we got quite a bit of Summit Compute. So Summit is a V100 based supercomputer. It's got some weirdness to it. So there's six V100 GPUs per node. And we did a lot of experiments there. It's a challenging system to scale to because your interconnect across nodes is kind of slow in comparison to within a node, which I think we'll get to later. But now Oak Ridge has moved to AMD. So the next grant that we're trying to work towards is on Frontier, which has four AMD GPUs per node and again has a slower interconnect across nodes. So we get all of those new challenges again to try and overlap things. But that's just like you have Oak Ridge, you have Lawrence Livermore. There's a lot of government supercomputers that you can apply for compute towards like open researchers too. It's sort of a new thing. I think we're one of the first like us and like Lion, for example, is another organization that's getting compute from government providers and such. They're all moving to AMD as well. And we look forward to exploring that with them. [00:17:42]Swyx: Yeah. [00:17:43]Alessio: The computing is definitely, it used to be easy to find the GPU. Now, not as much. So you got to find them anywhere. [00:17:49]Swyx: Yes. [00:17:49]Alessio: Let's talk about memory requirements a little bit. So you touched on this a little bit before and just before this, we had a trade out on the pockets from FlashAttention and memory speed was one of our main focuses, but this time we're being bound by actually memory size, like the VRAM itself, when it comes to model weights and parameters and optimizer states and all that fun stuff. Let's go through this and Sean, we can, we can take turns. There's a lot to cover here, but maybe we can start from model weights. So one topic we covered a lot in the past is precision and quantization. That's one of the obviously main driver of memory. You mentioned most of, in the article, most transformers are mixed precision, like FP16 plus FP32 or BF16 FP32, and they can be cast down. And you mentioned up to like INT8 without a lot of performance hit. So let's start there and maybe run people through some of the maths and like the byte per parameter ratio and different precision. [00:18:50]Swyx: Sure. [00:18:51]Quentin: So when I started deep learning, it was all FP32. You have 32 bits, four bytes per parameter. Things were pretty simple. You didn't have to do any loss scaling at all. But the problem was that you didn't get a whole lot of flops once NVIDIA moved to V100s and introduced Tensor cores. So Tensor cores do all of their computation at FP16 precision. So you're kind of throwing all of those away if you're doing things in FP32. So once the hardware moved to V100, the software moved to like mixed precision and APEX and AMP and such. And one counterintuitive part of mixed precision is that you actually require more memory when you're trained because you need an FP16 copy of the weights and an FP32 copy of the weights. The FP16 copy is where you're doing like your actual computation on the Tensor cores. So you get maybe it's not uncommon to get double the throughput that you would see before in FP32. And then you at each step update that FP32 copy with the FP16 update. So both need to be stored in memory. The problem with that is that FP16 is very precise but doesn't have a whole lot of range, [00:19:55]Swyx: dynamic range. [00:19:55]Quentin: So you have a really big mantissa if you're thinking in terms of like floating point representations, not a whole lot of exponent. So BF16 puts more of the bits from the mantissa back to the exponent. So you have a much higher range and a lower precision. And that gets rid of all of this instability problem and loss scaling and such that anyone familiar with debugging knows how unstable it can be, especially for large scale training. And BF16 does away with a lot of that, but it's only supported on A100s. So you see the back and forth between hardware and software. So every time NVIDIA introduces some new Tensor cores or BF16 support or something like that, the software adapts to support it and then training adapts. And then now you mentioned like Ind8 and such. Now we're seeing that you have some model that's been trained in FP16, FP32, whatever else. And then now you want to, with minimal loss and accuracy, quantize that model into a smaller representation like Ind8 and now like Ind4 and things like that and see what you can get away with. And then since deep learning is such like a stochastic problem that a lot of those last bits of precision don't really matter is what we're finding. And I expect that to continue. [00:21:06]Alessio: And so just to put some numbers to it, when you have a FP32, you need four bytes per parameter at inference time to load it in memory. If you have a eight bits model quantized down, you need one byte per parameter. So for example, in an H100, which is 80 gigabyte of memory, you could fit a 70 billion parameters in eight, you cannot fit a FP32 because you will need like 280 gigabytes of memory. So how much does that play into it? Like you mentioned it was all FP32 when you first started. Is it just like a development complexity thing, like going down to FP16 and then Ind8? Or if they could get a GPU with like a terabyte of VRAM, will people just load this memory as like FP32 weights or would they still want to quantize them to make them more efficient? Right. [00:22:00]Quentin: I would say even if you had infinite VRAM, you would still want a quantized model, just a bigger model that's quantized is what I would say. And that's because like I was mentioning there at the end, how like deep learning is very stochastic and a lot, you could have all the precision in the world, but ultimately it's meaningless when you still depend so much like on what the input is. And you depend so much on little variations and maybe a few more samples of training data would matter more. A lot of that precision in a nutshell doesn't really matter in deep learning. All that matters is the big picture. What is that neuron actually saying? And not the tiny details of what it might be thinking. Oh, I also wanted to mention that even if you have an A100, the actual model size is quite a bit smaller that you could load than what you mentioned. That's because of the KV cache. So the KV cache intuitively during inference, it only matters during inference and think intuitively if you're writing a paragraph, you want to remember every single previous word that you've written before you write the next word. So like what is autoregressive language modeling? It's filling in the next word, the next token. So if I say like the dog went to the, and I need to write the next word, I would say park or something. Before I write the next word, my memory is wiped and I have to read the whole thing again. That is life without a KV cache. And a KV cache says, remember everything that I've generated before, as well as all the context before what I've generated. But the memory overhead for a KV cache commonly is either comparable or larger than the model in some cases, if you have a really long context. And I think the exact equation is something like, oh, it's like two times the number of layers, times the number of heads, times the dimension of each head. And then there's two of those. You have one for K, one for V. But that was just a quick aside. Yeah. [00:23:44]Alessio: I know this is Transformers math, but do you think one of the interesting things about RNNs too, it's like moving away from this, like KV cache, the scales with the sequence length and having like a fixed sequence pass. I know those are some of the things that people are working on. [00:24:00]Swyx: Yeah. [00:24:00]Quentin: So there's a paper that I was involved with called RWKV that I would recommend people read. It is answering this exact question. So how do you get Transformers quality without this quadratic attention overhead that Transformers requires? So it is interesting. I don't know if I can really dive too deep into the technical details there. I'd recommend people read the paper. But yeah. [00:24:23]Swyx: Yeah. [00:24:23]Alessio: It's interesting to see if attention is all you need, or maybe attention is all we need, but we need better ways to make it infer in a good way. [00:24:33]Swyx: We've actually done an unreleased episode with one of the RWKV core members and they call it soft attention or light attention. I forget what they call it, but yeah, just ways to approximate it such that it's linear and not quadratic. That's great. Yeah. [00:24:47]Quentin: I didn't know that you were involved. [00:24:48]Swyx: That's great. How did you get involved? Is it just because like everyone just hangs out in Discord and talks about the future of Transformers? Oh yeah. [00:24:55]Quentin: I mean, the RWKV people specifically are in Eleuther all the time. Like they're very close collaboration with us. And my contribution was we have all of these experiments done by all of these people on RNNs and how they relate to Transformers and how do we turn that into a paper and disseminate that digestibly so that people don't have to read through like a Discord log from a year ago to understand what's going on. [00:25:16]Swyx: Oh my God. [00:25:16]Quentin: Just read this paper. So that took some work, but I wasn't a core contributor. So that's why I don't want to go into like the technical details. But yeah, that's how I did. [00:25:24]Swyx: We'll try to get that RWKV episode out. It seems like there's increasing mentions of it and they are doing pretty important work as far as scaling these models are concerned. Okay. So we discussed inference type quantization and memory requirements. And then you also had a section on training with a lot of stuff I think mentioned. I think we probably want to spend the most of our time on optimizer states and the Atom optimizer. Yeah. What are your takes on it and what should people keep in mind when they deal with these optimizers? Okay. [00:25:57]Quentin: I would say the Atom optimizer is good at what it does. It's sort of a broad question. So let me think. You have the copy of the weights and then you have your momentum and your variance that [00:26:08]Swyx: you store. [00:26:08]Quentin: And like, okay, maybe an intuitive explanation for momentum is that like, let's say you have a canyon and you're trying to get to the bottom. And if you're just doing basic SGD, then every step is going to be an equal size. Whereas if you're using something like Atom with the momentum term, then your steps should be progressively larger because you can see, oh, the general trend is we're heading downwards very quickly. But stepping back from that, since you have all of these extra terms in Atom, you require a lot more memory to store it. Like three times as much memory as SGD. And if you have all of this memory being spent on your optimizer states, then how do you distribute it across GPUs? Because you'll find that what ends up being your bottleneck more than just raw compute, raw flops on a given GPU is your parallelism. And that falls back onto how much model you can fit on a single GPU before you need to split it up across a bunch of GPUs. And then you end up spending time, more time with them talking to each other than actually making progress. So that's why all of this time in the blog post is spent on how do you distribute your model? What are all those different distributed strategies look like? Which ones are more efficient? And given that a lot of your memory is being spent optimizers, how do you distribute that optimizer specifically? Because a lot of people, when they talk about parallelism, they talk about model parallelism, the parameters themselves. In actuality, when you're training, a good portion of your memory is actually spent on optimizer states. So what specific part of that would you like to go into? Would you like to go into like zero or sharded optimizers? [00:27:36]Swyx: I think the sharded optimizer stuff is really interesting, but I think we're kind of leaving that towards the end, right? Because that's the maybe more advanced distributed sections. Here, I think we're just going for rough intuition for people who've maybe are familiar with the ideas of these optimizers, but haven't actually had to implement them yet. They read your code, but they don't really understand the intuition behind the code. I see. [00:28:00]Alessio: And Quentin, when you say in the blog post, it says, Adam is magic. How much of it is like actual magic, even to like people like you that are pretty close to the metal, so to speak? Are some of these things just come as gospel? It's like, I know this works, like I'm not touching it. I'm just leveraging it. How much of it are you actually thinking about improving on in your day-to-day work? I see. [00:28:22]Quentin: So I'm a systems guy. I'm an engineer. And a lot of these things come to me as magic. Adam comes to me as magic. I see it from the gods. I say, this is how a deep learning model is trained. And this is how the next step is calculated. And then I say, okay, how do I make that fast? I would say I do look at ways to improve upon it using things like second order optimizers. So there's a lot of research on there because they're hard to distribute. But the core contribution for me always comes down to someone else has done like some deep learning optimization and I need to make it run fast. So I can't really speak to the motivation of why Adam came about other than like simple, intuitive things like I mentioned with like the momentum. But what matters to me is that Adam takes more memory than SGD, specifically three times. And all of that memory needs to go somewhere and it needs to be split efficiently. [00:29:14]Swyx: Yeah. [00:29:14]Alessio: So when you add them all up, you got 12 bytes per parameter with vanilla Adam. [00:29:20]Swyx: Yeah. [00:29:20]Alessio: And then you still get the model parameters and memory too. So as you mentioned, you need to keep a copy of both for like a FB32, FB16 mixed, a copy of both quantization levels. So there's precision levels. So it's six bytes per parameter. Right. [00:29:36]Quentin: Taking a step back again, is that like, okay, most people think of your model getting big. So you need to split with model parallelism purely, something like tensor parallelism. But we can see that the model only takes like two bytes per parameter if we're doing FB16. Whereas the optimizer itself requires four bytes per parameter for the model states, four bytes for momentum, four bytes for variance. So what matters more is how do you split your optimizer efficiently and how do you store it efficiently? And something like bits and bytes, where the optimizer, you got like eight bit Adam, where those optimizer states is only one byte per parameter instead of four or something like that. That is going to give you a much better return on your model training and on your memory overhead required than if you were to, for example, quantize your pure like FB16 model weights down to int8 or something. So for training specifically, your optimizer memory matters a lot. The most in most cases. [00:30:31]Swyx: Well, yeah. [00:30:31]Alessio: And before we dive into zero, just to wrap up the items that you're going to shard later. So you have the parameters, you have the optimizer states, and then you have the gradients. Just maybe touch a little bit on that. And then we can talk about how to efficiently load them in GPUs. [00:30:48]Quentin: So the parameters are the FP32 copies of the parameters. We include them in the optimizer discussion. Some people don't, but just for clarity, it's 12 bytes per param for the optimizer states and four of them are for that FP32 copy of the weights. Four of them are for the momentum. I already went into why it's important to store momentum, but that's also per parameter. You need to store where that parameter is going and where it's been going in the past. You also need to know, okay, we know where it's going, but there's going to be bumps on this canyon that we're going down. So we need to store its variance. How often are those bumps? Should we be focusing more on the momentum? Or is this parameter just kind of jumping around everywhere? Those are all important answers that we need the optimizer to store, and it's per parameter. So that's where all three of those terms come from. And we also include some competing bits and bytes, for example, an SGD to show that depending on your optimizer, you may store all or none of these and in different representations. [00:31:50]Alessio: I'm looking at the total training memory. You essentially have model memory, optimizer memory, gradient memory, and activation memory. I think that's one of the last discussed things. So maybe just give people a little bit of a view. [00:32:03]Swyx: Yeah, this is completely new to me. [00:32:05]Alessio: Active, you know, recomputation, checkpointing, and all of that. [00:32:08]Swyx: Right. [00:32:09]Quentin: So, okay. So to summarize before activation checkpointing, which will be complicated, you have your model params, like I mentioned before, they used to be FP32. Now they're probably BF16, maybe FP16 if it's an older GPU. Then you have your optimizer. That's where a lot of the memory is going. And it's your high precision, usually FP32, copy of the weights. So that's four bytes per param. And then you have, optionally, a couple more terms like we just discussed, like momentum or variance or whatever else, depending on what your optimizer is. Then you have your gradients. So your gradients is what is the gradient update that we get after running the forward pass on the model. And that's going to be whatever your low precision copy of the weights is. So like two bytes per param, if you're using FP16 or BF16. And all of those are sort of set in stone. And that overhead is not going to go away for the duration of training. Your gradients might get cleared after you back propagate them, but your optimizer states and your model states aren't going away. That memory overhead will be there. Activation recomputation and activation memory is dynamic. So some people will come and have this problem where the model loads fine for training. But then when you actually run your first iteration, or you run some future iteration or something like that, you run out of memory, seemingly at random. And it's because of these activations that you're computing on the fly. Good summary, or do you want to get into activation recomputation now, or do you want me to touch on anything else? [00:33:35]Alessio: Yeah, I was going to say, when is the recomputation happening? How does it decide between recomputing versus storing? And talk a bit more about that, maybe. [00:33:47]Quentin: Yeah, okay. So there's a lot of different ways to do this, but I would say there are a few main ones. First is a very simple scheme. You recompute everything. Every single activation that you calculate is just going to be either used or thrown away until the end. So in that case, you care very much about memory. You care very little about compute. Maybe this would be a case where you have to distribute across a lot of different GPUs, for example. And your communication speed is really low. Then that might be a good case for you to just recompute everything. It happens rarely, but it happens. Next up would be something like selective recomputation. So in selective recomputation, which Megatron has a good paper on, and I believe the figure that we have in our blog post is from, in that case, you sort of do a weighted decision for each activation. So for really big activation tensors, you decide, is this going to be more expensive to save in terms of memory or to recompute in terms of compute? So that's sort of the smart scheme that Megatron implements. And there's a lot of different heuristics they use. It's probably not worth mentioning off this super long equation on a pod, but you should go and read that paper if you're interested on selective recomputation. And then a really stupid scheme that most people go with, including NeoX, would be something like, instead of doing all of these heuristics, you just say, if my tensor is bigger than X, I throw it away. And you set X to some static number, and that's it. And that is good enough for a lot of cases. [00:35:18]Swyx: Why is it good enough? [00:35:20]Quentin: You don't want to store more than, you know, X-sized tensor. And some fall above that, some fall below it. And you're not trying to squeeze. You care more about getting something close enough to what the actual heuristic should be without actually computing the heuristic because you don't want to spend the time writing that heuristic code. [00:35:37]Swyx: Cool. I think that does take us on a grand tour of the memory math. Is there any sort of high-level takeaway before we go into the distributed stuff? Zero and all that. Perhaps more detail than most people have ever encountered. And so I'll repeat the equation that Alessio mentioned again, which is total training memory now has all these components that you've mapped out for the first time as far as we're concerned. Model memory, optimizer memory, activation memory, gradient memory. We covered quite a few algorithms as to the choices you can make there. Anything else that you want to mention about just memory math? I don't think so. [00:36:11]Quentin: I think that about covers it. I will say that it's a very different scheme for training and inference. It's common for people to say, oh, BF16 is the best. Done. Whereas a more correct take is that during training, precision matters a bit more. So BF16 will be around longer for training than it will for inference, in which case your model is sort of already baked. And it definitely doesn't need some of those last bits of precision so you can get away much easier with going to int8 for inference rather than training. So everything that you learn for training has to be relearned for inference and vice versa. [00:36:44]Swyx: There's a third category. You're talking about training versus inference. This third category is emerging with regards to fine-tuning and perhaps parameter-efficient methods of fine-tuning. The naive way to implement fine-tuning is just to do more training. But I don't know if you've developed any intuitions over fine-tuning that's worth inserting here. Any intuitions? If you were to write fine-tuning math, what would go in there? That might be an interesting diff to training math. [00:37:10]Quentin: I think there's a lot of questions that are unanswered for fine-tuning. For example, we know scaling laws for training. And some people have done scaling laws for fine-tuning. But how does a model that's already been trained on one domain transfer to another in terms of fine-tuning size? How many tokens per parameter should you have for your fine-tuning dataset? Maybe I'm ignorant, but I feel like a lot of those sort of practical questions on how a model can transfer and how a model can learn or grok some new ability that wasn't in its original training dataset is something that I would definitely put inside a fine-tuning blog post. [00:37:45]Swyx: Something related to perplexity and, I guess, diversity of the tokens that you get. [00:37:49]Quentin: Yeah, sort of dataset transfer is something that I would be curious in. Learning rate transfer is another one. So your model has some decayed learning rate over the course of training. How does that change for fine-tuning? Things like that. [00:38:00]Swyx: All right, cool. Thanks for indulging that stuff. Sure. Yeah. [00:38:03]Alessio: I think after all of this, you can quickly do the math and see that training needs to be distributed to actually work because we just don't have hardware that can easily run this. So let's talk a bit about that. So zero is one of the first things that you mentioned here, which is focused on sharded optimizers. Maybe run people through that and how to think about it. [00:38:25]Swyx: Sure. [00:38:25]Quentin: So zero is centered around two communication operations. And the first is scatter. And people should be looking at the zero figure that I think we have. [00:38:35]Swyx: Yeah. [00:38:36]Quentin: So there's a figure in the paper with parameters, gradients, and optimizer states that people should be looking at when I'm talking about this. Every GPU is going to get its own equal portion of the slice. And if we're doing... There are different stages of zero, but let's just start off with assuming that it's an equal slice of the optimizer states, gradients, and parameters. That would be zero three, stage three in that case. And we do that with a scatter. And the scatter takes, say, one over end GPUs, plus this offset of that slice goes to that GPU. Now all of the GPUs have an equal slice that's in its rank order. And then during each training step, that GPU is going to wait for all of the other slices to communicate so that we now have a whole pie on that GPU, that single GPU. Once we have that whole pie, we do the forward pass on it. And then we distribute that forward pass to all of the others using a gather. So it's a scatter, reduced scatter specifically, and then a gather back to all the others. And you do that each step. So the point of it is that you're sharding these states across GPUs. And with the different stages, you'll see in that figure that the optimizer state is taking the most proportion, which is because of what I mentioned before. We're including the FP32 copy and we're doing atom. So we need those four bytes per param for momentum and for variance. And then zero stage one, which is the most common one, is just optimizer. Zero stage two is optimizer plus gradients. And zero stage three is optimizer gradients and model parameters. But it all comes back to this splitting up and then gathering together back and forth over and over. So you get a lot of communication overhead from zero. But the plus part of that is that you can overlap a lot of that movement with computation. [00:40:23]Alessio: How do you get the optimal number of GPUs to do this on? Is there a way to shard too much as well and put too much overhead? [00:40:31]Quentin: It depends more on what your interconnect is. Taking a step back, there is synchronization that's required, a lot of it, across all of these GPUs. And those tend to be cumulative. So if you go to too many GPUs on an interconnect that's too slow, then you're going to end up spending more time synchronizing. And that magic number where you spend more time synchronizing is going to be different depending on what your fabric is and what your GPU memory is specifically. Just how small of a slice is each GPU getting? I can't, for example, for Summit, that number comes out to be about 20 billion parameters. Now you have 20 billion parameters, and then your magic number of GPUs for that is going to be something like 100 to 200 scale. Beyond that, you're just going to end up spending more time communicating. And the actual flops dipping below some predetermined number by you is going to be whatever your sweet spot ends up being. [00:41:24]Alessio: And then, so this one was like hard for me to go through, so I'm excited to have you run through it, which is a 3D parallelism. [00:41:33]Swyx: It's fancy, it's cutting edge. [00:41:35]Alessio: Yeah, let's talk a bit more about that and some of the work. [00:41:38]Quentin: Okay, 3D parallelism. So what is each dimension? First is the really basic one. That's data parallelism. And data parallelism is you have a copy of the model. Let's say for simplicity, one copy fits on one GPU perfectly. Data parallelism is that now you have two GPUs, so you have one copy on GPU one, one copy on GPU two. Both of them do the forward and backward pass and then synchronize and average the gradients. And then that's a step. Data parallelism for 3D parallelism is actually zero. So it's, you're sharding the optimizer states across all of your different GPUs. Next up is tensor parallelism. Tensor parallelism is you split your model. Like say, if you have two GPUs, you split your model down the middle and each GPU on its tensor specifically is going to do its forward or backward operation on its tensor. And then only when necessary, it'll synchronize that tensor operation with the other GPU. It's a bit more complex than something like pipeline parallelism, which is the third dimension. In pipeline parallelism, let's say you have four layers in your model. And you have four GPUs. You put one layer on each GPU and then GPU one does the forward pass and then sends the output of its activations to GPU two. It does the forward pass, sends activations to three, and you're just moving down a line. That is a naive scheme in that all of the other GPUs are doing nothing while a single GPU is doing its forward or backward pass. So the reason it's called pipeline parallelism is because you're splitting your mini batch into micro batches. So GPU one will do the forward pass on micro batch one and then send to GPU two. And then while GPU two is running on that first micro batch, GPU one is working on the next micro batch. And so you're sort of pipelining the movement and computation of each micro batch. The problem with that is that you need a really big batch size in order to split it up into both mini batches and micro batches. So combining all three of those together, you get a 3D mesh of where each parameter and optimizer state and so on maps to each GPU. And that's 3D parallelism. So let's start diving into details on what have that made sense, what should I jump into more on? [00:43:55]Alessio: I think the main question is, do you need all of the GPUs to be the same to do this? Or can you have mismatching GPUs as well? [00:44:03]Quentin: Okay, two things matter. If there's a difference in VRAM for the two different kinds of GPUs, then you're going to be bottlenecked by whichever GPU has the lower amount of VRAM because it's going to run out of memory. And then you can't like whatever's left on the larger GPUs is going to be empty. As far as I'm aware, there's no like GPU single GPU aware memory overhead scheme that would account for that. The second problem is that let's say all of your GPUs have the same amount of VRAM, but half of them are really slow. And the problem with that is that those synchronizations that I mentioned earlier are going to kill you. So you're going to move as quickly as your slowest GPU in that case. So in both cases, you end up regressing to your slowest or smallest GPU. So you might as well have the same GPUs for all of them. Otherwise, you're wasting the nicer ones. And that also goes to your CPUs and your interconnect. So going back to the 20 billion parameter model that Eleuther was training, that was on a cluster that was sort of Frankenstein made during COVID when there was all of that shortage of network switches and such like that. So every node had a different network switch. And so you ended up moving at the speed of the slowest switch and getting everything tuned properly so that it's not worse than the slowest switch was challenging and is like a real world problem that sometimes comes up. [00:45:28]Alessio: Is this work widely accepted? Like I hadn't learned about this before studying for this episode. Is this something that people are still trying and researching? Or is everybody just aware of this and running this in production? [00:45:43]Quentin: What is this specifically? [00:45:44]Alessio: Like the sharded optimizers plus the 3D parallelism, bringing the two things together and having this kind of mesh strategy. [00:45:51]Quentin: I would say that a lot of major GPT-based models use this scheme. A lot of them now are sort of going with just a pure zero scheme. So just a pure sharded. You just shard everything. And then since that's so easy, everyone gets an equal slice. There's no such thing as a pipeline stage. There's no such thing as what tensor should go on which GPU. Instead, we shard everything equally and treat everything equally. It's a much easier problem to debug, to checkpoint, to run training on than it is with this 3D parallel scheme. I say 3D parallel gives you the most control and also the most ways to go wrong. And depending on whether you have more engineers or whether you have more GPUs, that should decide which of these you go with. [00:46:35]Swyx: It's also not too hard, right? You've basically outlined the five or six different numbers that you need to keep in your head. And it doesn't feel impossible that if you need to achieve that level of control, you've given everybody the main levers to do it with. And that's wonderful. Definitely. [00:46:51]Quentin: The problem that comes up is like, say, like, okay, GPT-4 came out. Now we have VLLMs. [00:46:57]Swyx: Whoa, what are VLLMs? Oh, okay. Virtual LLMs, like the Metro of Expert things? No, like visual. [00:47:03]Quentin: So now you have like multimodal models and such. How do you distribute that? Do you distribute it in a pipeline stage? And do you just shard it? Do you split the tensor and make a tensor parallel? It's sort of hard to change your model and add new features and such when you have this 3D parallel scheme. That's when I say hard. I mean, it's hard to sort of adapt and modify it to new features. [00:47:26]Alessio: I know we're at the hour mark, and I think we put our listeners through a very intense class today. So this was great, Quentin. And we're going to definitely link the article so that people can read it and follow along. Any other research that you're working on in this space that you want to shout out? I know one of our usual, I mean, wrong question is, what's the most interesting unsolved question in AI? So curious to hear if you think it's still on the training inference, math optimization, or are there more areas that people should pay attention to? [00:47:58]Quentin: I think in my area of research, there are two things that I think people should really care about. And the first is multimodal parallelism and RLHF. You were seeing more and more reinforcement learning and coming into the training loop. And so how do you split that some model or some GPUs are working on inference and some GPUs are working on training? And like I mentioned before, you have to relearn everything and they have very unique challenges. How do you split up a KV cache during training, for example? Those are challenges that are not well studied, I don't think. And then multimodal, you have like maybe a vision transformer and a text transformer. How do you split those up? Do you split them up equally? Do you put them on separate GPUs or do you just shard everything? And just maybe one GPU will have some vision, some text parameters. And then the second case I would say is that communication is very often a bottleneck. So we talk about 3D parallelism, but a lot of those like, for example, tensor parallelism, you can't go across nodes with. You'll just get killed in communication. So what I'm getting to is how should you compress your communication before it happens? So on the fly compression, you have some buffer that needs to be communicated. You compress it with a GPU kernel, then you send it across the network and then you decompress it, something like that. Making people spend less money on communication fabrics and more on GPUs as intended is sort of a thing that people need to explore. I think those are my two. [00:49:26]Alessio: Sean, you went over the other half of the lightning round before we wrap it up. [00:49:30]Swyx: That's a good brain dump. Cool. Yeah, I have so many more questions on the multimodal stuff, but that should be for another time. Acceleration, what has already happened in AI that you thought would take much longer? [00:49:42]Quentin: I would say flash attention. Guys, just talk to Tree. And flash attention is just sort of a really great set of kernels that I thought would take a while to get to us. [00:49:51]Alessio: Well, Quentin, thank you very much, man. This was super informative and I think hopefully helps demystify a little bit the blog post. I think people open it and it's like a lot of math on it. And I think you walking them through it was super helpful. So thank you so much for coming on. [00:50:07]Swyx: Of course. [00:50:08]Quentin: And I'm happy to answer any questions that people have offline if they have them. I do read my email. [00:50:13]Swyx: Email and Discord. Of course, yeah. [00:50:15]Quentin: Discord I'm even faster on. [00:50:16]Alessio: Thank you, everyone. [00:50:18]Swyx: Thanks, Quentin. [00:50:19] Get full access to Latent Space at www.latent.space/subscribe

Podcast #727 - 4060 Launch Moves Up, AMD AI, GPU Sales Plummet, Thermaltake Ceres 500, Clippy! and MORE!

PC Perspective Podcast

Play Episode Listen Later Jun 17, 2023 103:01

It was another week, and that means more pcper podcast goodness. Josh had his very, very old headset and a bad laptop sound card on the road with him, Kent was back again, with a fab case review and we talked about a lot of other tech and newsy stuff. We even had Clippy.00:00 Prologue and Intro02:38 Burger of the Week04:26 RTX 4060 (non-Ti) release date moves up09:40 AMD's EPYC Bargamo CPUs14:07 AMD ups their AI game (and more AI discussion)24:31 Microsoft offering official Surface parts27:31 Josh interrupts to talk about AMD EPYC some more28:57 Are larger cards the key to more VRAM? (satire) 30:15 Moving 12VHPWR to the back of the GPU32:42 Kent uses a 4090 as wall art34:00 Just in time, PCI-SIG is working on PCI Express 7.037:11 Mandatory Arc coverage40:32 Desktop GPU sales lowest in decades (and much rambling)49:27 Clippy shame (and Kent's LGR story)53:07 Security Corner1:03:57 Gaming Quick Hits1:09:39 Fractal Terra corrections1:12:39 Thermaltake Ceres 500 TG ARGB case review1:26:35 Picks of the Week1:42:43 Outro ★ Support this podcast on Patreon ★

ai sales microsoft launch burgers kent surface prologue amd ceres rtx plummet clippy pci express lgr vram amd epyc thermaltake

209. Nvidia GPU Tiles, Zen 5 Threadripper Laptops, Intel Meteor Lake | Metrology Eng

Play Episode Listen Later Jun 9, 2023 114:33

A Megascanning expert joins to discuss Nvidia Laptop VRAM, AMD Strix APUs, and more! [SPON: Get 10% off Tasty Vite Ramen with code BROKENSILICON: https://bit.ly/3wKx6v1 ] [SPON: dieshrink = 3% off Everything, brokensilicon = 25% off Windows: https://biitt.ly/shbSk ] 0:00 What is a "Metrology Engineer"? 7:05 What are challenges in Megascanning? How big is Too Big for a Laptop? 18:49 Laptop VRAM Stagnation, AMD abandoning Threadripper 38:44 How does VRAM truly limit professionals? 48:48 Is Meteor Lake or Strix with 256GB of RAM strong enough? 1:03:30 Threadripper Laptops - An overlooked use for Zen 5 Strix Halo? 1:14:58 Nvidia GPU Tiles in Panther Lake - The only way to save Market Share? 1:33:41 Do Professionals like E-Cores? Could Raptor Lake have been better? Check out Brad's Podcast: https://www.youtube.com/@notyourtypicalitguys8834 https://www.linkedin.com/in/brad-medlin-418a5967/ Leak that OEMs are dropping Nvidia in laptops: https://youtu.be/GhLMbFgVfbY AMD Strix Halo Leak: https://youtu.be/-pQEdpMCrdU https://www.nvidia.com/en-us/studio/compare-gpus/ https://www.dell.com/en-us/shop/cty/pdp/spd/precision-17-7780-laptop/s001p7780usvp?gacd=9684992-1105-5761040-266906002-0&dgc=ST&gad=1&gclid=Cj0KCQjw4NujBhC5ARIsAF4Iv6fdJlz5009CW9DB8ozsj_lbMhP1g6uLrHB-EOJgtfr7fBWFUt-Q248aAhS9EALw_wcB&gclsrc=aw.ds https://www.notebookcheck.net/NVIDIA-Quadro-RTX-5000-Laptop-Graphics-Card.423752.0.html https://www.ebay.com/p/14052546406 https://www.amd.com/en/product/13036 https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i71065g7-processor-8m-cache-up-to-3-90-ghz.html https://news.skhynix.com/sk-hynix-enters-industrys-first-compatibility-validation-process-for-1bnm-ddr5-server-dram/

lake windows intel ram zen nvidia leak laptops amd meteors tiles oems market share too big nvidia gpus strix 256gb metrology threadripper vram intel meteor lake

I'm sure you have questions..... - WAN Show May 19, 2023

The WAN Show Podcast

Play Episode Listen Later May 22, 2023 226:00

Go to https://babbel.com/WAN for 55% off your subscription Visit Newegg at https://lmg.gg/newegg Enable your creative side! Check out Moment at https://lmg.gg/ShopMoment Timestamps (Courtesy of NoKi1119) Note: Timing may be off due to sponsor change: 0:00 Chapters 0:48 Intro 1:14 Topic #1 - Linus steps down 2:55 Linus's past suggestions for LMG solutions 5:18 Luke on Teams notifs, Linus workshop on storyboarding 10:56 Discussing writing knowledge-based articles 11:48 Linus on disruptive shooting behavior, mentions SC 14:05 "Chief Vision Officer" Discussing LMG teams 16:53 Linus flabbergasted at MarkBench's progress 21:03 Luke asks about oversaturation 23:51 Ideas for LTT Labs, SC update, RF chamber's cost 27:28 Linus explains "moat building," quoting DMs about FP 29:06 Community response, Linus on difficult administrative 32:39 Topic #2 - Free TV!that spies on you 34:01 Charged $500 if opting out of tracking ft. awkward high five 35:02 Linus on privacy concerns 36:37 Ad-supported V.S. business revenue 40:24 Linus asks "What's next?" ft. Ads revenue, domains 48:30 Linus recalls getting charged for overdraft 51:28 LTTStore's presale LABS #FIRST shirt & hoodie 53:26 LTTStore notebooks back in stock 53:32 Merch Messages #1 54:48 Thoughts on slow drop of 4070 laptop GPUs? 56:49 Would sponsors change? ft. Discussing Terren Tong 1:14:52 LTTStore screwdriver Noctua edition 1:16:48 Showcasing the screwdrivers in person 1:23:47 Luke on how the CEO would handle leaks 1:24:03 Topic #3 - Schools struggle with ChatGPT 1:24:48 Chegg claims ChatGPT hurt their revenue 1:25:36 "Plagiarism," Texas professor fails half of students 1:27:25 Should schools be responding to AI? ft. Comment bots 1:30:26 Linus struggles with calling his contacts 1:33:14 Topic #4 - Imgur's NSFW & old content purge 1:36:57 How should we be retaining internet history? 1:40:07 "Convert the moon into a server! NUKE THE MOON!" 1:42:34 Sponsors 1:45:46 Merch Messages #2 1:46:40 Would it matter if I like or finish a video on FP? 1:49:19 Which was first - CVO Linus or CTO Luke? ft. Linus trolling Dan 1:51:19 Decision of overturning the CEO? ft. Water bottle "ideas" 1:54:38 Is it possible to run an OS on a GPU's VRAM? 1:56:29 Topic #5 - Google's controversial domain extensions 1:57:32 Linus on Google Search, Luke shows file link V.S. domain 1:59:56 Reasons for doing this, what is the point? 2:03:50 Topic #6 - Toyota exposes live location ft. Dad jokes 2:05:43 Topic #7 - Roblox doesn't protect kids from ads 2:06:56 CARU's report, FTC complaint & Roblox's response 2:08:42 Kids "budgeting," Luke on cosmetic costs 2:16:30 Topic #8 - Valve sued by Immersion due to the rumble 2:17:42 OCBASE/OCCT 20th "Stableversary" giveaway 2:18:52 Topic #9 - Overwatch 2 cancels PvE mode 2:29:37 Merch Messages #3 2:30:58 How did you get Terren? Compensation? 2:34:40 Any project Linus is excited to work on? ft. Nuke fab 2:37:48 Most costly things you misplaced or lost? 2:42:28 What worried Linus the most when stepping down? 2:43:19 Thoughts on Kyle ending Bitwit? 2:46:50 Most challenging part of being a CVO? 2:48:04 Is cosplay going to be allowed in LTX? 2:49:02 Favorite FP info that Linus leaked, and its downsides? 2:50:42 Defunct company you would revive to thrive? 2:54:48 Does Linus plan to have time for non-LMG stuff? 2:56:10 Skill sets you'd like to improve on your new position? 2:59:55 Has Linus debated how much of his life should be on the show? 3:04:40 Is putting a camera to monitor your computer too far? 3:12:36 Do you see WAN Show outliving you as hosts? 3:22:46 Has LTT Labs always been the end goal of LTT? 3:23:24 Best memory Linus made with his Taycan? 3:25:26 How would past Linus react to what LMG became today? 3:29:28 How did Linus prepare? 3:31:32 What did Terren teach young Linus that stuck? 3:37:00 Minimum age of lifeguarding has been lowered 3:44:11 Any leadership minds that inspired you? 3:46:10 What if the new CEO stopped you from being who you are? 3:46:28 Outro

Podcast #722 - 16GB RTX 4060 SKU, Big Hard Drives Failing Faster, DDR4 Price Drops Again, Hacking & AI + more!

PC Perspective Podcast

Play Episode Listen Later May 14, 2023 61:26

Another podcast has occurred, and we had more to talk about than originally anticipated. Such as that MSI hack which keeps on giving and if you've got a flickering monitor - check your chair! MS and AMD vs Nvidia, large hard drives die faster, PC game making is actually profitable and does real gaming need 16GB of VRAM?Topics in the time stamps below.Timestamps:00:00 Intro01:24 Food with Josh03:23 RTX 4060 Series may have a 16GB version (and more VRAM discussion)15:38 Large hard drives live shorter lives?20:22 Microsoft lets users pay for beta AI in Office 36523:31 DDR4 prices may be going down again26:20 MS and AMD team up to fight NV at AI30:26 An alarming Google Pixel Pixie story32:52 Coldplay lyrics found in Kingston SSD firmware33:35 Gaming chair ESD blues38:30 Podcast sponsor - Bloomberg Jobs39:46 Security Corner47:36 Gaming Quick Hits50:03 Picks of the Week57:19 Outro (with recent livestream discussion) ★ Support this podcast on Patreon ★

ai food ms office microsoft price gaming series pc large failing drops hacking nvidia coldplay nv amd rtx msi esd hard drives 16gb ddr4 vram

204. RX 7600 8GB Pricing, AMD Zen 5 IPC, Intel 14th Gen, Arrow Lake Kills RTX 4050

Play Episode Listen Later May 9, 2023 152:17

We discuss the latest AMD RDNA 3, Intel CPU Architecture, and Nvidia Blackwell news! [SPON: dieshrink = 3% off Everything, brokensilicon = 25% off Windows: https://biitt.ly/shbSk ] [SPON: Get 6% OFF Custom PCs & GPUs at https://www.silverknightpcs.com/ w/ “brokensilicon”] [SPON: Get 10% off Tasty Vite Ramen with code BROKENSILICON: https://bit.ly/3wKx6v1 ] 0:00 Intel Codenames Galore, Content Overload (Intro Banter) 5:23 Nintendo's "Lateral Thinking", HDDs vs Tape Drives (Corrections) 16:21 AMD AM5 CPU Burning Issues Investigated 29:44 AdoredTV Leaks “Ladder L3” for Zen 5, DigiTimes confirms MLID Leaks 42:17 RX 7600 8GB targeting below 300 Leak, Navi 33 Recap 1:00:06 AMD Q1 2023 Earnings 1:09:57 Intel Q1 2023 Earnings 1:17:55 Intel 14th Gen Raptor Lake-R, Meteor Lake Ultra, and Arrow Lake Leaked 1:43:13 Adamantine, Dragon Range-X vs Meteor Lake, AMD Complacency 1:50:16 Nvidia Blackwell Delay Rumors - 3nm Woes, or Ampere Oversupply? 2:01:55 MS-Activision Blocked, Phoenix Z1 Extreme, RX 7950 XTX (Wrap-Up) 2:09:00 RDNA 4 VRAM, Broken Silicon 1, MLID Channel Goals (Final Reader Mail) https://www.techpowerup.com/301867/seagate-mechanical-hdd-with-u-2-nvme-interface-pictured-signals-the-decline-of-sas-12g https://youtu.be/kiTngvvD5dI https://youtu.be/3FDh9C59Z1A https://www.hardwaretimes.com/amd-ryzen-8000-cpus-to-be-based-on-4nm-node-not-3nm-5th-gen-epyc-to-get-3nm-rumor/ https://www.digitimes.com.tw/tech/dt/n/shwnws.asp?CnlID=1&Cat=40&id=0000662749_VR76FCFB51XKH38BW2VPS https://www.youtube.com/live/umJQXe5haa0?feature=share&t=736 https://www.newegg.com/sapphire-radeon-rx-6700-11321-02-20g/p/N82E16814202424?Description=RX%206700&cm_re=RX_6700-_-14-202-424-_-Product&quicklink=true https://twitter.com/mooreslawisdead/status/1651351480094408706 https://twitter.com/mooreslawisdead/status/1651489123528581120 https://community.amd.com/t5/gaming/game-beyond-4gb/ba-p/414776 https://ir.amd.com/news-events/press-releases/detail/1128/amd-reports-first-quarter-2023-financial-results https://www.anandtech.com/show/18845/amd-reports-q1-2023-earnings-back-into-the-red-as-client-sales-crumble https://youtu.be/Qa-ZAyQOviY https://www.anandtech.com/show/18839/intel-reports-q1-2023-earnings-a-record-losing-quarter-goes-better-than-expected https://wccftech.com/intel-announces-layoffs-after-paying-1-5-billion-in-q1-dividends/ https://youtu.be/GhLMbFgVfbY https://www.digitimes.com.tw/tech/dt/n/shwnws.asp?CnlID=1&Cat=40&id=0000662749_VR76FCFB51XKH38BW2VPS https://wccftech.com/nvidia-next-gen-3nm-gpus-not-launching-until-2025-tsmc-report/ https://youtu.be/8PVYOeHx8vA https://www.reuters.com/markets/deals/uk-blocks-microsoft-69-bln-activision-deal-over-cloud-gaming-concerns-2023-04-26/ https://www.techspot.com/news/98578-microsoft-might-partnering-amd-ai-chips.html https://videocardz.com/newz/nvidia-forces-msi-to-unlaunch-geforce-rtx-3060-ti-super-3x-series-after-just-one-week https://twitter.com/Kepler_L2/status/1651341246055497732 https://videocardz.com/press-release/amd-announces-ryzen-z1-zen4-apus-for-handheld-gaming-consoles-with-up-to-12-rdna3-gpu-cores https://videocardz.com/newz/amd-introduces-ryzen-7040u-phoenix-low-power-apus-with-up-to-8-zen4-cores-and-12-rdna3-cus https://videocardz.com/newz/nvidia-geforce-rtx-4060-ti-ad106-350-gpu-has-been-pictured https://www.techspot.com/news/98550-amd-confirms-mainstream-rdna-3-gpus-before-summer.html https://twitter.com/Bernard_P/status/1653022115367645190?s=20&fbclid=IwAR29SiFCOdRzQ-QmbSJWpOALAa8-GrqZREXxzD72LT_Iif_eRsYvACP7_uM

product nintendo cat lake windows pricing woes intel zen kills arrow leak rx navi gpus 8gb lateral thinking rdna spon hdds amd zen vram digitimes meteor lake amd rdna

Nobody Wants Jedi Survivor and Redfall - Inside Games

Talk to the Internet

Play Episode Listen Later May 4, 2023 22:54

Thanks to HelloFresh for sponsoring today's video. Go to https://strms.net/InsideGamesHelloFreshMayYT and use code POGINSIDEMAY16 to get 16 Free Meals plus Free Shipping! Support Inside Games! Patreon: https://www.patreon.com/insidegamesYT YouTube Membership: https://www.youtube.com/channel/UCFHQlasvjQ0JMOHoKOz4c0g/join Hosted by: Lawrence: http://twitch.tv/sirlarr | Bruce: http://twitch.tv/brucegreene Edited by: ShooklynTV: https://twitter.com/ShooklynTV Written by: Lawrence Sonntag & Brian Gaar: https://www.twitch.tv/briangaar Sources -- [reddit, r/gaming] https://bit.ly/3p58WJH [Metacritic] Redfall, Xbox Series X - https://bit.ly/3HHkgSR [Metacritic] Redfall, PC - https://bit.ly/3LWnp3x [GameReactor] Redfall - https://bit.ly/41lL1mV [Checkpoint Gaming] Redfall Review – A bloody disappointment - https://bit.ly/3pcszQc [Steam] Redfall Reviews - https://bit.ly/42Iyo6v [Steam, Karan76k] https://bit.ly/3NFQnpU [Reddit, r/gaming] https://bit.ly/3NGqUfY, https://bit.ly/42ij8wY [Xbox Wire] Get Ready for the Xbox Games Showcase and Starfield Direct Double Feature Airing June 11 - https://bit.ly/3AWwAL8 [YouTube, Digital Foundry] Star Wars Jedi Survivor PC Review: The Worst Triple-A PC Port of 2023... So Far - https://bit.ly/3VvbQ6J [Tom's Hardware] Pre-Launch 'Jedi: Survivor' Eats 21GB of VRAM, Struggles on RTX 4090 - https://bit.ly/3LvmyWh [Twitter, EA Star Wars] https://bit.ly/3Lwvs5P [BBC] Cyberpunk 2077: Sony pulls game from PlayStation while Xbox offers refunds - https://bbc.in/44vZxLj [Eurogamer] The Last of Us' many PC glitches are being turned into memes - https://bit.ly/3LY5d9Y [GameRant] Ubisoft Responds to 'Assassin's Creed Unity' Frame Rate Issues - https://bit.ly/3HFaKzw [GameRant] Todd Howard Says Fallout 76 Was a Let Down - https://bit.ly/3LXPtn2 [GameRant] Sony Offering WWE 2K20 Refunds for 'Broken' Game - https://bit.ly/3LWp35a [PCGamer] PC gamers are getting really, really fed up with one sh*tty port after another - https://bit.ly/3VCuFVt

game games struggle survivors sony pc playstation xbox assassins jedi broken xbox series x let down nobody wants rtx redfall xbox games showcase jedi survivor creed unity ea star wars vram brian gaar lawrence sonntag

201. RX 7800 XT, RTX 4060 Ti 8GB vs RX 7700 XTX 16GB, PS4 EOL | Infinity Ward Dev

Play Episode Listen Later Apr 18, 2023 138:46

A VFX Artist from Infinity Ward joins to discuss ongoing VRAM issues & developing AAA games! [SPON: Skip the waitlist & Invest in blue-chip art: https://www.masterworks.art/mooreslawisdead ] [SPON: dieshrink = 3% off Everything, brokensilicon = 25% off Windows: https://biitt.ly/shbSk ] [MASTERWORKS DISCLOSURE: Masterworks works allows you to purchase shares in great masterpieces from artists like Pablo Picasso, Banksy, Andy Warhol, and more. The process is simple: -Create your account with your traditional bank account -Pick major works of art to invest in or our new blue-chip diversified art portfolio -Identify investment amount -Hold shares in works by Picasso or trade them in our secondary marketplace See important Masterworks disclosures: https://www.masterworks.com/about/disclaimer?utm_source=mooreslawisdead&utm_medium=youtube&utm_campaign=4-15-23&utm_term=Moores+Law+is+Dead+Viewer&utm_content=disclaimer ] 0:00 Who is Chris? What is using up more DRAM in recent releases? 15:36 Are 12GB GPUs already in trouble? What's filling up VRAM? 35:26 RX 7800 XT 16GB – Methods of configuring a 4070 killer! 44:02 RTX 4060 Ti 8GB vs RX 7700 XTX 16GB, Future of Used 8GB Cards 50:51 Did Nvidia stop talking to Game Devs about what they need? 56:11 Ending PS4 Support - How will this affect games? 1:10:20 4K120Hz - The "New Standard" for Call of Duty Optimization 1:20:22 Cross-Gen Development, GPU Optimization, Engine Bottlenecks 1:54:49 Direct Storage, MW Install Sizes, 500GB Future Games 2:01:35 Dealing with Online Hackers Previous Time Guest was on: https://youtu.be/239ntwGzg7c https://www.linkedin.com/in/cjreaytechart/ https://www.techspot.com/review/2657-amd-ryzen-7800x3d/ https://youtu.be/Rh7kFgHe21k https://youtu.be/Isn4eLTi8lQ https://www.amd.com/en/products/professional-graphics/amd-radeon-pro-w7800 https://twitter.com/mooreslawisdead/status/1646631144689917954 https://cdn.mos.cms.futurecdn.net/frqtQnBW5427ACgTzxvQJf-1200-80.png.webp https://youtu.be/DKt7fmQaGfQ https://www.techspot.com/review/2663-nvidia-geforce-rtx-4070/

199. Death of 8GB GPUs, RX 7600 XT VRAM, AI Taking Jobs, Ray Tracing | UE5 Developer