POPULARITY
Categories
We're announcing AIEWF speakers this week! Take the AI Engineering Survey!Today's guest Ethan first joined us for the LS Paper Club as the lead on NVIDIA Cosmos World Model, but then joined xAI and built Grok Imagine in 3 months:He comes back on Latent Space with some nuclear hot takes: that Video Models primarily get their intelligence from LLMs, not from training on video data, and that the next frontier for truly interactive, realtime, long-horizon world models is to work on LLMs (perhaps Interaction Models as well…)Put it this way: In the near term, the next Sora won't be a better video model, but a video agent.Generative Media may more closely follow the evolution of AI coding which went from focusing on one-shot output performance and cost, to multiturn reasoning and planning models for agents and systems that can plan, edit, test, debug, and submit PRs.At a certain point, coding models got so good that the only significant next step to improve performance was handling the orchestration of these models.Now as the performance of video models increases significantly across realism, consistency, & prompt adherence while becoming more cost efficient, the next evolution of video generation may also be systems that can plan, generate, edit, critique, and iterate across an entire creative task. In this episode, Ethan joins swyx and Vibhu to unpack what it actually takes to build frontier image and video systems: data, VAEs, diffusion transformers, audio-video alignment, inference speedups, and the hidden cost of storing and moving massive video datasets. From building NVIDIA's Cosmos world model to joining xAI as Grok Imagine was being built from zero to one, Ethan He has been at the center of some of the most important work in video generation, multimodal models, and real-time world models.We go deep on Grok Imagine, how a small xAI team shipped its first multimodal video model in three months, why iteration speed matters more than almost anything in model development, and why many of the biggest gains come from fixing tiny bugs in data and training pipelines. Flipbook: The future of VideomaxxingVideo agents are almost a sure bet to be the trend in the coming year. We end with a glance at what's beyond video agents:Flipbook caused a minor sensation this year when it was released, but most treat it as a fun demo. Ethan takes it very seriously — with the speed and cost of inference coming down every year, the future of custom video JIT UI is closer than you think. We talked about why videogen models may become the front end of AI, how generative UI could replace traditional HTML/CSS, why world models need to be real-time, interactive, and long-horizon, and why the future of video generation may depend more on language models and agents than on diffusion alone.We discuss:* Why fast iteration mattered more than meetings* Why small training bugs can drive huge model quality gains* Why coding models may make compute the bottleneck again* How image and video models are trained with synthetic captions* The role of VAEs and latent space in frontier video models* Why image models are the foundation for video models* The tradeoff between temporal compression and real-time interactivity* Flipbook, Neural OS, and the future of generative UI* Why future interfaces may go from user intent to pixels* The hidden cost of training video models: storage, egress, and GPU hours* How step distillation and consistency models (like OpenAI sCM) makes video inference orders of magnitude faster* Grok Imagine 0.9 and large-scale audio-video generation* Why audio-video alignment is harder than text-video alignment* Ethan's definition of world models* Reference-to-video, video extension, and long-context video generation* Why xAI's research communication undersells Grok Imagine* How xAI culture shaped the speed of development* AI watermarking, SynthID, and detecting generated media* Why prompt rewriting matters for video models* Grok Imagine Agent and the rise of video agents* Why language models may unlock better video generation* Robotics, physical AI, and embodied world models* Why Ethan left xAI and shifted focus toward LLMs* Self-managed context, memory, and the next frontier for language modelsEthan He* LinkedIn: https://www.linkedin.com/in/ethanhe42* X: https://x.com/EthanHe_42Timestamps00:00:00 Introduction00:01:25 From NVIDIA Cosmos to xAI00:03:24 Building Grok Imagine from Zero to One00:10:07 How Image and Video Models Are Trained00:18:53 Video Compression, VAEs, and Real-Time Tradeoffs00:22:10 Generative UI, Flipbook, and Neural OS00:32:10 The Cost of Training Large Video Models00:37:04 Distillation, GANs, and Fast Video Inference00:41:21 Audio-Video Generation and Grok Imagine 0.900:48:34 What Makes a World Model?00:55:51 Reference Videos, Long Context, and Video Memory01:00:11 xAI Culture, Research, and First-Principles Building01:09:45 AI Safety, Watermarking, and Prompt Rewriting01:13:10 Video Agents and AI-Assisted Creation01:27:32 Why Language Models Unlock Better Video01:31:15 Robotics, Physical AI, and Embodied World Models01:32:38 Why Ethan Left xAI01:34:16 Self-Managed Context and the Future of LLMs01:38:43 Ethan's Career Path and Closing ThoughtsTranscriptIntroduction: Ethan He, Latent Space, and the Path to xAISwyx [00:00:00]: We're here in the studio with Ethan He, most recently of xAI. Welcome.Ethan [00:00:10]: Thank you. Glad being here.Swyx [00:00:11]: We're also here with Vibhu. you were first coming to us or joining the latent space world because you were working on Kosmos at NVIDIA, and you did a paper. We loved it. you presented it as well, so thank you for doing that.Ethan [00:00:23]: I've actually, I also presented the MoEs twice at latent space.Swyx [00:00:29]: How did you actually hear about us? Did we reach out to you? Is that how it worked?Ethan [00:00:33]: No, actually, I-- the community. Like I realized, oh, there is this online community that people talk about AI and also learn from each other through papers every week through the Paperclip. It's very nice.Ethan [00:00:49]: I learned a lot.Swyx [00:00:49]: I think three years stop. We haven't stopped even on Christmas and New Years. many weeks I want to stop but it keeps going.Vibhu [00:00:58]: No, that was good. I think you had posted that you worked on a paper, and I was “Oh, very cool. We have Paperclip. Present then.”Vibhu [00:01:04]: But I might have reached out to you after.Swyx [00:01:05]: you-- because it's an amateur club, right?Swyx [00:01:08]: so it's very unusual and but we have sometimes paper authors come by and actually explain the paper. Today we just did, the poolside paper, which was apparently very good.Vibhu [00:01:18]: Came out yesterday.Vibhu [00:01:19]: pretty interesting, right? Fully open. They talk about everything, systems. So it's a good one. We'll, we'll recommend people to read it.Swyx [00:01:25]: Bring us up to speed on your transition to xAI, ‘cause I actually don't even know when you joined. just like tell the, tell the story about the sort of transition.From NVIDIA Cosmos to xAI: Scaling Video and World ModelsEthan [00:01:34]: Before xAI, I was working on Kosmos world model as in-- at NVIDIA. So Kosmos is, it's a giant video foundation models that can-- that aims to simulate the world and for-- it serves as a foundation of-- for all of the roboticists to build on top of. There, once I built the Kosmos one, I realized as this thing also has a scaling law similar to language model, we need to scale up the video models further. that's, that's why I realized I need to move to somewhere with much more compute resources. That's how ISwyx [00:02:13]: Than NVIDIA?Vibhu [00:02:14]: The GPU rich came themselves.Vibhu [00:02:19]: And timeline-wise, when was Kosmo? It was pretty early, right? It was open world model, open paper, everything.Ethan [00:02:25]: It was end of twenty-four.Vibhu [00:02:28]: End of twenty-four.Ethan [00:02:30]: Then at mid twenty-five, I moved to xAI. At that time-- I joined about the time when xAI was about to build video models and in multi-model models. There were no infra, no data, and no model, and it just-- as a few engineers, we built it in three months and released the first model, Grok Imagine zero point nine.Ethan [00:02:55]: And since then, I keep working on video models and move more from training and to post-training of the video models. For example, like a reference to videos, kind of like the cameo feature and, video extensions. And, before I left, I worked on a world model, leading a small team to focus on the real-time long horizon video generation.Building Grok Imagine From Scratch in Three MonthsSwyx [00:03:24]: Can you give like a rough roadmap of okay, you're on a brand-new team. Grok previously was only text, or they partnered with BFL for their image gen stuff. What do you-- what are the building blocks, right? You have compute, data you can procure somewhere. Like just what are like the sequence of things that people should think about when you're setting up a new team?Vibhu [00:03:43]: actually even deeper, not just data you can procure. You guys had to go through getting the data too, right? So you shipped it pretty fast, but yeahSwyx [00:03:51]: three months is likeVibhu [00:03:52]: From everythingSwyx [00:03:52]: actually like very surprisingly fast.Ethan [00:03:55]: One thing I say like thanks to my experience at NVIDIA, ‘cause first time when we were building Kosmos together, we built it, for about a year. So this is like the second time I do it. Roughly have an idea, what to do. I say the most important thing is the talent. Everyone were very strong and clever, very close with each other towards a common goal. So that speed up things a lot. So you reduce the communication bandwidth among people, and everyone can work towards the same goal. It's, it's like every day there's not that much meetings on the calendar, like maybe like a, like a sync a day, and after that it's, it's just all building. It was pretty fun at that time.Ethan [00:04:47]: And another thing is that xAI has very strong foundations of like data inference, model inference, and the supporting there can help the model develop a lot. When I look at, training models, I don't so actually the top important thing is like how many, how many iterations can you do, per day? and the more iteration can you do, you can, you can train the model much faster. So if you have very strong infra and you have a lot of compute, you can, you can train these models in very short period of time. That can give you a much larger buffer to, for errors, and it also gives you the opportunity to spot more bugs.Iteration Speed, Compute, and Debugging Model PipelinesSwyx [00:05:46]: What is an iteration? Is it like a few hundred steps or what are youEthan [00:05:50]: Let's say just the train-training the model, like from acquire new data and maybe design new algorithms and train a new model, maybe at smaller scale orSwyx [00:06:01]: So cycle time for like any hyperparam that you're searching.Ethan [00:06:04]: Cycle time and tune to like eval this model. Is this model better than my previous iteration?Ethan [00:06:11]: SoSwyx [00:06:11]: So it's like before you, someone had already set this up that you can iterate very quickly.Ethan [00:06:15]: I think the foundation there is extremely good forDeveloping and research models.Ethan [00:06:23]: And often I find is it-- this is kind of boring, but like a lot of the improvements does not come from new algorithms. It comes from finding small bugs here and there in the data pipeline, in the, in the model training pipeline. Those give, those give the biggest boost to the model quality.Vibhu [00:06:46]: It's interesting, right? So you say it's like small team, less communication bandwidth, but also a lot of quality is like find little bugs. It seems counterintuitive, right? You have a lot of people, you can iron out more of those, but it's interesting to see the other side, right?Swyx [00:07:00]: I also wonder, have you-- do you try using LLMs to look for bugs? I don't know.Ethan [00:07:05]: I remember at that time it was mid two thousand and twenty-five, so it's the coding model wasn't quite there yet. I remem- I remember like December two thousand and twenty-five, it was extremely good. Yeah, I've been, I've been using it at that time. It's, it's helpful. sometimes it produce codes that are kind of difficult to maintain, even though like the first time it built something extremely fast. But it gave the, like a spaghetti code, thousands of lines that I couldn't maintain, and the LLM itself couldn't figure out what's, what's wrong and how to improve on top of it. But now I find it much better. Yeah, I want to bring up another point here is now coding models are much more efficient and can help us implement stuff much faster. Compute might become a bottleneck again because previously, like if you want to train a new model, say you want to generate new synthetic data and then or write a new algorithm, it might take a few weeks. And during that period of time, you don't-- you might not have experiments to run. But now you can build that thing within a few hours, then you can immediately train a model.Ethan [00:08:24]: Now you have to have enough compute to try all of the ideas. So compute might be the bottleneck of iterating speed again.Swyx [00:08:36]: yeah, I actually, honestly, I think it's like kind of a stressful job because you're “Well, I should be trying everything, and if I'm not, then I'm not doing my job well.”Vibhu [00:08:48]: there's also the stress of you're eating thousands of GPUs per hour, which is very expensive and, compute can go to other researchers.Swyx [00:08:56]: You got the daddy Elon toVibhu [00:08:57]: You got daddy Elon.Ethan [00:08:59]: It wasVibhu [00:09:00]: But there's still finite amount of compute, like you want to use it, you want to use it well, you want more of it.Ethan [00:09:06]: That was quite stressful indeed. Yeah, I think one thing is the-- with coding models now, like a lot of these jobs can be automated, which is much better. A second, it's a, it's a marathon, so you got to maintain good health and, a regular schedule.Vibhu [00:09:28]: It's, it's hard to hear that when you shift from zero to nothing in two months.Swyx [00:09:32]: and, I think obviously the culture at xAI is very famously, people work very hard. one thing I did want to dive into, in our-- in the notes that you, that you sent ahead of time, you had specific comments about the cost of Video Gen training. presumably this is on the Colossus-1, right? the two hundred megawatt cluster. Any whatever you want to just share on that.Vibhu [00:09:54]: I think there's, there's three things we're talking about, right? So there's Video Gen, there's also the Image Gen model that you put out. Do you want to like complete the, okay, so zero to one, you have a few months. Just what are the stages of create Image Gen model?Swyx [00:10:06]: Oh, yeah, maybe I got distracted.How Image and Video Models Are Trained: Synthetic Captions, Tokenizers, and VAEsVibhu [00:10:07]: Sorry. and then, from there's Video Gen, there's Audio Gen. Would love to get into those next. But what is that first few months like? So small team, a lot of bugs, iterations, but what does it look like? Do we take something off the shelf? Do we just get data compute? What's, what's the few months like? How do you go to state-art Image Gen model? How do you just start?Ethan [00:10:28]: I cannot comment specifically how xAI did, but it's, it's a quite standard process. I can draw some, examples from Cosmos. So mainly it's building a video model, you actually need to build a image model first. And building these two models, the data you need is a hundred percent synthetic pair of language and image or language to video. Because on the, on the internet, actually, the videos don't naturally associate with text. So you can say, oh, like on YouTube, you have the title and you have the description and the commentsSwyx [00:11:11]: TitleEthan [00:11:11]: of a video, but usually they're not relevant to the video itself. And say maybe like the video is a natural scene of mountains or something, and the title is, I'm so happy today.Ethan [00:11:26]: So they have they have no correlation at all. So the first step is to, you have to generate synthetic pair of language with the videos. So you gather videos from the internet, and you use a VLM to caption the videos. So that part, here's a question, like how do you, how do you gather VLM to begin with? So if there's noSwyx [00:11:55]: You, so you fuse the model, right? LikeEthan [00:11:57]: Say if there's no like VLM exists, like how do you generate the text to the beginning, right? It's, it's impossible.Swyx [00:12:04]: I see.Ethan [00:12:05]: In the beginning, it's like you ask human to describe the video as detailed as possible.For example, you ask them to describe everything, like all objects, all characters, and all interaction and dialogues in the, in the videos. So that's in the protocol of Cosmos labeling. We require the objective we give to the labelers was that you have to describe the video as detailed as possible, such that a blind person hears a blob of text can reconstruct what the video is like from their head.Swyx [00:12:43]: Video or image? You're talking about images.Ethan [00:12:44]: Video or image, either one of them.Vibhu [00:12:47]: This was pretty common when we went from clip and DALL-E, right?Vibhu [00:12:51]: It's all training on really detailed captioning of images. So same is applied to video, but insteadEthan [00:12:57]: same appliedVibhu [00:12:57]: of using multimodal model to pass in video images and write rich descriptions, you can alsoSwyx [00:13:04]: I think there's this traditional perspective of supervised, or, very highly human curated thing. I feel like there's a unlock with unsupervised, right? Where like you have enough to bootstrap that you can just throw common corpus on it or, whatever. like unsupervised vision and language pairing, right? Like where you just have, interspersed image and text and it just learns. To me, that is the VLM breakthrough that is different from the clip, different from the LM era.Ethan [00:13:36]: It's interesting to see that you kind of need both data.Ethan [00:13:41]: For example, for theSwyx [00:13:41]: You need it to bootstrap it up. YeahEthan [00:13:43]: for the generative model training, there's also usually like a small percentage of unlabeled data. So the model is instructed to generate a video without any text instruction. That can also help the model generalize. So after this stage of generative synthetic pair, so, one important common step is to train a compressor or a tokenizer of the image or videos. So because, if you train-- If you can technically, theoretically train image or video models on pure pixels, but the problem is that the, it's, it's a lot of tokens. So like one image, it's, a thousand by a thousand, it's like one million tokens, one million pixels. It's impossible to train transformer on that. So it's, you need to train a tokenizer, which can go from image to latent space and latent space back to image.Swyx [00:14:45]: That's why we named the podcast.Swyx [00:14:48]: But, basically, you're talking about vocabulary science.Ethan [00:14:50]: so vocab.Swyx [00:14:51]: And so, what is, what is imp-- like a million is impossible?Ethan [00:14:54]: In generative models, the vocab is continuous. It's a continuous space. We can think about like you map an image to a vector. It's a, it's a fixed length vector. It's sixteen or forty-eight, something like that. And then you map that vector back to the image space. And the mapping is, has-- The mapping is patch-based. So you say you haveEthan [00:15:22]: a sixteen by sixteen patch and you match, you map that patch of pixels into this latent space.Swyx [00:15:29]: We've covered thisVibhu [00:15:30]: This is like the vision transformersSwyx [00:15:32]: VAEs,Ethan [00:15:33]: VAEs.Vibhu [00:15:34]: You basically compress your input, you do your generation, you're reasoning all that generation in smaller dimension, and then you project back out.Swyx [00:15:43]: VAE is a form compression, but I think the for me, the patching thing is from VIT, right?Ethan [00:15:48]: You can make those.Swyx [00:15:49]: Literally the, yeah, the paper is titled like sixteen by sixteen is all you need. something like that. and then I think also, people make a lot of comparisons with this kind of patching with convolutions.Swyx [00:16:02]: Which is you're, you're kind of re- reconstructing the old paradigm with the new.Ethan [00:16:05]: Actually, in VAEs, there are, there are both convolution networks and transformers. You can actually do both.Ethan [00:16:14]: After this VAE, so what you've got is you've got latent space tokens and you've got the language tokens. So now the training of the diffusion transformer, usually generative models use diffusion transformers. It is actually quite standard. It's, it's very similar to how you train a language transformer models. It's not that much difference. It's just the tokens, the visual tokens in, visual tokens out. The only difference is there's a denoising process. So you train the model to unmask some of the noise. So you add, you add random noise to the visual tokens, and then you train the model to remove those noise to generate the clean tokens. Any inference, the model can iteratively remove noise from a hundred percent noise.Swyx [00:17:12]: And then there's also, to speed things along on the tech tree of diffusion, there's CFG, and then there's, there's also, latent diffusion that, there's, there's someone in there. I think, somewhere along the line, obviously, like stability and all these other guys, pioneered a lot of this, architecture. I don't know if you want to get into that or just, or do the video side up to you.Bootstrapping Video from Image Models and Temporal CompressionEthan [00:17:37]: After you train such model, such image model, the reason it's a, it's a foundation for video models is that image models are cheaper to train, and they have much denser connection between language and text. So, sorry, language and images. For example, you train a billion, you train on a billion images, and there's a mapping from the text to the image. And the cost to train the same, like the, a billion, a billion text to a billion videos, that's much more expensive because videosNaturally have more tokens than images. Because the diffusion models, their understanding of, language purely come from this mapping. So if you don't have enough mapping, so if you only train on like a ten million videos or something, there-- you might not see enough language tokens in your training, so your model does not understand human intention enough. So that's why you really-- you train-- you first train this image diffusion models, and then you bootstrap the video model from there.Swyx [00:18:53]: One thing I did want to ask, because I-- actually, I think you're, you're the first per-- video model person I've ever talked to, I think. we've, we've like talked to Luma and all those folks. There's all these tricks in video compression where basically frame by frame there's not that much difference, so actually you don't have to regenerate or save the whole frame, right? but I think MP4 compression or something else like that.Swyx [00:19:16]: is it tempting to use that? Or as far as I can tell, everyone just treats it as, “No, we would just generate every frame.” Is that roughly the state-art?Ethan [00:19:27]: There are a few different approaches. Let's say first, like you want to just directly use MP4 compression and use that as the tokens for the transformers to train, right? So people actually have tried that, but the main challenge is the latent space for the MP4 tokens were not, were not very comprehensible for the models. It's, it's extremely hard to train on that. And there's aEthan [00:20:01]: So that's why they created VAEs, which creates more continuous, latent space, so the models can understand that latent space and learn from it much easier. Even within the VAEs, there are different difficulties of the latent space. So you can imagine something the simplest, the most naive VAE is like you have an image, and you just shuffle all of the images into a, into a vector. So you don't need to train any VAEs, right? But that latent space is extremely hard for models to train on top of. That's why there are some debate on like how do you compress the tokens. So you mentioned like you can compress frame by frame. Also, you can compress, the temporal dimension.Ethan [00:20:52]: The difference is if you compress the temporal dimension, you get a much higher compression rate. Because there's temporal redundancy between frames, because, this frame and the last frame, likely they are mostly similar, so there's only some small difference. for example, I think in 12.1 VAE, they have like a eight by eight by four compression rate. So the four temporal tokens are compressed into one tokens. That can save a lot of, save a lot of the context length. If you do it frame by frame, you have to do maybe like eight by eight by one. Your context length will be four times larger. That being said, the benefit of the frame-- per frame compression, we might come back to this later, is, real-timeness and interactivity. ‘Cause if you, if you strain the output of the model, frame by frame, you can-- the model can respond to any user request immediately. So if you have like a temporal four compression, four times compression, thenSwyx [00:22:06]: It might be laggyEthan [00:22:07]: there's a lag there in nature.Swyx [00:22:10]: So you're very pilled on this. let's just go ahead and bring it up ‘cause we have the visual prepared anyway. There's some frontier applications of real-time video gen. So Flipbook is one of the examples that went viral recently, right? What is Flipbook?Real-Time Generative UI: Flipbook, Neural OS, and Diffusion Front EndsEthan [00:22:23]: Flipbook is kind of like a web brow- web browser. You can see like it has the web bro- browser UI on top. The difference is all of the UIs are generated by generative image model in real time, and anything here are fake. But you can, you can explore inside this wor- this imaginary world. Say like we-- here we have engineering the Great Pyramid. Like the model generates this for us to understand how it works, and if we want to navigate around and understand further, we can click on some of the, some of the description here, and the model will generate a new page, new subpage describing the details we want to know about.Swyx [00:23:14]: So it's basically kind of we're playing a video, but it's pausing for our next interaction, and then it just plays the next thing based on our interaction.Swyx [00:23:23]: Which is kind of cool.Vibhu [00:23:25]: and you kind of decide your story. So this was, how do you make a pyramid? levering technique seemed interesting, right? It shows how do you take Okay, I want to know what is thisSwyx [00:23:35]: The demo, the demo tweet had more animation between frames.Vibhu [00:23:38]: I think it's just skipping,Swyx [00:23:39]: Oh, it's just skipping a lot of frames.Ethan [00:23:40]: they also have a video modeVibhu [00:23:42]: It takes a lot. There's a lot of peopleEthan [00:23:42]: but, a lot of people are using it.Ethan [00:23:45]: So it's not available.Vibhu [00:23:46]: There's a live video stream. We can try,Swyx [00:23:50]: So this is an example of the kind of future that you see at the extreme. We don't-- we're obviously not in it today.Swyx [00:23:56]: But in a world where inference is completely free this is better than generating code and text?Ethan [00:24:02]: So this is, this is a final state of where Viva will be at for word model, I think. Imagine internet doesn't exist, and then you type in google.com. Like what should, what should, what should a model show you?the model can imagine something, and this is what the model imagine. And these web pages, they completely do not exist. So I think as the inference costs come down, we are going to have generative UI for everything. If you think about how the coding model works, so they write code for a web page, and they render the code might be con- converted into binary, and the binary render the pixels on the screen. So we in machine learning, every time we have some breakthrough, obviously it's, it's more intuit. So why don't we have like user instruction to the pixel directly? So the generative UI will be user intention to the pixels directly. And say like even if I want email, let's say everyone have the same interface, but I want, I want it slightly different. I want the email to show to me like a TikTok, so I can swipe left and right for the emails. And or maybe you want something else. We can have completely different things. Or like I have I'm looking at, Instagram stories, and I don't like the Like button. I always may click it. And, generative UI resolved it. So it's going to be a revolutionary replacement of the interface. So in the future, we might have much more powerfulEthan [00:25:50]: LLMs and coding models running behind the scene. And in the, in the front-end, the diffusion model will actually be the front-end to show stuff to you. That's how I imagine it.Swyx [00:26:02]: Diffusion front-end, deterministic back-end.Swyx [00:26:04]: Something like that. I find that very expensive, but,Vibhu [00:26:08]: I find it interesting you called LLMs writing code on the back end deterministic, but okay.Swyx [00:26:14]: you write it onceVibhu [00:26:15]: Compare it toSwyx [00:26:16]: And then you execute.Ethan [00:26:17]: If you think about the cost, say, let's say H100 costs $1 per hour, and if you use this eight hours a day and thirty days, so, every month you're paying this two forty, you'll actually not wanna pay for that. That's even more expensive than Cloud Code Max. But if you think about the compute costs come down like two times every year, and I think the future will likely arrive like within few years.Vibhu [00:26:49]: It's everything, right? compute cost comes down, compute gets faster, model gets smarterEthan [00:26:54]: More efficientVibhu [00:26:54]: model gets smaller.Swyx [00:26:55]: I don't know why you say two times, ‘cause I think it's like 100 times. In language models, it is roughly one hundred to a thousand times every twelve to eighteen months, for the same given level of LMSys, ELO.Vibhu [00:27:08]: That's a net of everything, right? That's model performance alongside compute. So different than just compute costs come down. But, a very interesting future.Swyx [00:27:19]: So the web designers will have to shout out that accessibility is an issue, right? how do you deal with screen readers or whatever. But yes, this is higher bandwidth storytelling than anything you can possibly generate with code, right? So I think that's the rough idea.Ethan [00:27:34]: And I'd like to add a little bit that so human naturally have the maximum bandwidth when we are looking at things, look at videos, and we also have maximum output bandwidth when we are talking. So in the future, it might be something like we talk to AI models, and the AI model responds back with a generative UI. So that would be the maximum input and output bandwidth to interact with AI models before neural link happens.Vibhu [00:28:06]: And it's also very custom, right? Some people are very visual, some people are not as visual, right? They prefer the text. But the best thing about generative UI, right, it can also be text.Swyx [00:28:17]: There's another project that we wanted to highlight, which is the Neural OS. Kinda similar idea, but here you're literally operating, simulating an operating system with a video model.Swyx [00:28:27]: and you can play Doom, you can do Firefox. I find this like mildly less impressive, obviously, because it's an OS that I can run.Swyx [00:28:37]: But here everything is imagined.Vibhu [00:28:40]: I was, used to the Command+W to close the Firefox tab. It didn't crash. That's why I saidSwyx [00:28:45]: It's too immersive.Vibhu [00:28:46]: It's, it's too immersive for me.Swyx [00:28:47]: Too immersive.Vibhu [00:28:48]: I wanted to close the tab.Vibhu [00:28:49]: But yes, I can play generated diffusion.Swyx [00:28:51]: this is shockingly fast.Swyx [00:28:54]: Because I remember there was a demo about like maybe one to two years ago. Someone tried to do the first-person shooter with a image model. There was no consistency. It was very slow. But here it looks like realistically it's-- this is Doom.Vibhu [00:29:07]: I think there's two sides to that, right? There's okay, what is running a game? The heavy part of it is actually the game engine, all the lighting, all that stuff, the graphics. This is just kind of video, right? Like we've solved consistency. This is still, it looks like a few years old image generation. There's some temporal consistency, but it's, it's kind of just images stitched together as frame video. But it's a good visual representation to pi- to picture the future you wanna see, right? that's, that's what I see in these more so.Ethan [00:29:38]: This reminds me of how the video models gets better and better. So Neural OS is kinda if you just look at it feels like it's just a crappy version of the, like the Windows we could have, right? And, but the difference is, so the model, this model is overfitted on the existing operating systems. It can generate nothing different than that. But it's actually also similar to video models. So when we are training these video model, image model, we train them on internet. There's no imaginary supernatural stuff on the internet. But once we train this model, you can prompt the model to generate something supernatural that have never existed in the data set. So if you train your Neural OS or neural computer on the standard screen recordings on the entire internet. The model can imagine completely new interface to interact with the computer.Swyx [00:30:43]: This is one of those things that is magical to me. usually generalizing out of distribution is bad, but somehow we have learned some kind of internal world model that you say, this plus, but it looks like rainbows and butterflies, it'll do it and it will kind of make sense.Swyx [00:31:03]: So yeah, that's kind of cool. Yeah, I don't know if there's any comment more on there. I do, I do wanted to, I did wanted to touch a little bit more on the model architecture stuff, which I think you were getting. It's, really fascinating. We don't get a chance to talk about this enough. So one of the papers that we covered, we've covered every annual, segment anything release. and I don't know if you follow-- you're a computer vision guy, so youEthan [00:31:26]: I knowSwyx [00:31:27]: . So they did memory attention, which is kind of interesting. And I always think, anything where you can, across the temporal dimension, keep some consistency, I think it's, very fascinating, and I don't know if Basically, does that-- the CV side bleeding into video gen side, I think is underexplored, right? we talk about it for labeling, but actually you can borrow the architecture itself.Ethan [00:31:50]: There's, there's also complete different approaches, right? you brought up the term world model, so we went from video model to world model. There is diffusion, but there's also other approaches that people are doing. So maybe we get into those after as well,?Swyx [00:32:03]: He has a whole definition of world models and stuff. I feel like we threw a lot at you. Whatever you want to comment on.Why Video Models Are Expensive: Storage, I/O, and Training ScaleEthan [00:32:10]: I think one thing that we should actually comment back on is okay, so we were talking about the steps to train image gen to video model. One thing we don't see as much of is okay, you brought up the delta in training data, right? SoEthan [00:32:24]: you won't have as much a video model might not generalize, but what is the cost of training a large video model? So we know for LLMs roughly, okay, even like the poolside thing that came out today, right? It's a Gemma level model trained on roughly forty trillion tokens at this many H200s over this much time, right? You can see what is the exact cost of that. So how many GPU hours over how much H200 costs? So how do we do the back-end math of, same thing for video models, image models. How do you, how do you kind of break that down? I can share some back-envelope calculation. So surprisingly, video models is-- the cost is very-- is comparable to language models and obviously the largest scale is language model, maybe like a medium scale to language models. I said just storing the videos alone, it costs a lot. You can, you can maybe look up on AWS or something.Ethan [00:33:20]: You really, say if you have a billion videos and let's say, let's just say like each video, like five megabyte, then you need five petabyte to just store those videos. And also remember we talk about you use a VAE to compress the videos, and you also need to store, typically you need to store those continuous feature, in-- also in your storage. That's also comparable size with the videos themselves. So just storing these videos and the features is tens of petabytes alone. And,Swyx [00:33:58]: I just, I just looked up the calculation. Five petabytes on S3 Standard is one hundred K per month.Ethan [00:34:05]: AndSwyx [00:34:05]: It's comparableEthan [00:34:05]: and you needSwyx [00:34:06]: AndEthan [00:34:06]: And then like tens of petabytes, two hundred K. And even more expensive is you have the ingress and egress.Swyx [00:34:13]: Oh, yeah.Ethan [00:34:14]: Like you-- through the internet. You have to just to download those videos, I believe it's, it's more expensive on AWS than just storing those videos.Swyx [00:34:25]: Storing, yeah.Ethan [00:34:25]: And each training runs, you probably need to pull them once. If you train multiple times, it's, it's even more than that. So it's like just storing the network, those costs is just, it would be a few, a few millions per month to just storing everything, not to mention the GPU cost.Ethan [00:34:45]: AndSwyx [00:34:45]: my side tangent, the compute rental, like GPU rental is very efficient. There's one side, okay, you can be XAI and build your data center. Should we not just build our, storage compute as well? LikeEthan [00:34:57]: Of courseSwyx [00:34:57]: cloud cost compared to just,Ethan [00:34:59]: You save so muchSwyx [00:35:00]: store. Yeah, exactly.Swyx [00:35:01]: Especially with like egress and stuff. So.Ethan [00:35:04]: That's a good idea, but it also comes to-- there are some of its own challenges.Swyx [00:35:09]: Of course, of course.Ethan [00:35:10]: like people who build the GPU data centers, they might not expect this much, storage. And yeah, people build storage, typically they just build it somewhere with just CPUs.Swyx [00:35:23]: I just looked it up. Five-- AWS only charges for egress, not ingress. Tier five for five petabytes is two hundred and thirty K.Ethan [00:35:32]: Even more expensive than the storage.Swyx [00:35:34]: But storing is per month, right? You check in, then you cannot check out. so it's so cool. It's okay. So there's that side.Ethan [00:35:41]: So the TLDR, my backhand mathSwyx [00:35:42]: Data is larger than you think. Yes.Ethan [00:35:44]: my backhand math of GPU hours times GPU cost is also very much, I'm missing some storage.Swyx [00:35:49]: You're also-- you're basically like also more IO bound than normal training.Swyx [00:35:55]: Yes. ‘Cause like data loading, so caching everything, it becomes super important.Ethan [00:36:00]: So in Cosmos, we did a lot of optimizations to make it not IO bound. So, speaking of the training, actually training the model, the GPU cost, if you look up like the open source model, how big these video models are, I think like LTX has nineteen B parameters. That's a dense model. And people are also exploring, MoEs, so it might be twenty B active and, like a hun- hundreds B, total. So that's, that's even-- that's similar size as medium-sized LLM models. And if you, if you look at number of tokens-Uh, we disclose that in Cosmos. It's also like tens of trillions of tokens on the visual tokens. So putting this together, the cost of, training these video models, it's actually comparable with LLMs. Not to mention, the infra is slightly different from LLM, so it might be less efficient to train these models.Inference Speedups: Step Distillation, Consistency Models, and GANsSwyx [00:37:04]: Do you get the benefits of traditional diffusion speed-up? So for, images, there's LCM, LoRAs for, fine-tuning. There's, there's a lot of stuff that's beenEthan [00:37:15]: Flow matching.Swyx [00:37:16]: there's flow matching. There's a lot of stuff that's been done. there's some overlap that applies to diffusion on the inference side and stuff or?Ethan [00:37:23]: so the difference-- the inference side is a completely different story.Ethan [00:37:28]: I think for the training side, it might be a little bit hard to reduce that cost. And for the inference side, the biggest gain is from the distillation of these models. You can-- It's called step distillation, slightly different from knowledge distillation in LLMs. So you-- Typically, for flow matching models, you need like 100 steps or something. Like a distortion model even need even more, like 1,000 steps to generate a good image or video. A step distillation is try to learn to generate fewer step from the model itself. It's kind of like now we-- you use the full model to generate in 100 steps, and then you take a model that only generate 10 steps and let that model to learn from the perfect one.Ethan [00:38:25]: why this workSwyx [00:38:27]: Strong to weak seemingly.Ethan [00:38:28]: It is. It's kind ofSwyx [00:38:29]: DistillationEthan [00:38:29]: kind of like strong to weak. the-- from the modeling perspective, the strong model, the teacher model is trying to model the image and videos of inter-internet, and that distribution is extremely complex. But the step distilled model is just trying to learn from the teacher. The teacher is a model, and the size is fixed, as the distribution is much simpler than the whole internet. That's the intuition I have why step distillation can work. So usually these models serve in productions, they only run in a few steps. In Cosmos, I believe we have, we have like four step and eight steps. If you do some simpler task, image-image translation, it can even run in fewer step, like one step in Cosmos Transfer.Swyx [00:39:22]: I think this is the same intuition that guides a lot of the consistency model work. I sent you a link for, SCM. I don't know if you covered that. To me, that was actually one of, the most impressive papers I've ever seen from OpenAI.Swyx [00:39:34]: That this is the unifying grand concept of consistency models. I don't know if you have any comments on this.Ethan [00:39:41]: So there are, there are a few different approaches,Swyx [00:39:46]: Oh, yeah. Here it is.Swyx [00:39:47]: Two steps versus twenty or 100 steps, whatever. It's already done.Ethan [00:39:52]: So there are, there are a few different approaches, for example, consistency model, and there are also Actually, we shouldn't forget GAN. So GAN, actually, that was, that was the OG ofSwyx [00:40:05]: OGEthan [00:40:05]: step distillation ‘cause it trained just one step to begin with. So actually, a lot of, uh-- For example, there's a distribution matching distillation which use, which uses GAN, as one of the laws for distillation. It-- GAN just tells you, “Hey, generate an image,” and thenEthan [00:40:31]: it has a discriminator to tell, is this image real or not? So the model, the model just need to learn one of the distribution, not the full distribution. Because in training, the model is asked to reconstruct the ground truth image from the internet, which is extremely hard. And in-- When you're training GAN, it's a step process. It's just a, “Hey, you generate image. Does this image look as real as the image from the internet?” Which is a much simpler task. And, yeah, combining a lot of these approaches together, people typically do that, like consistency model and distribution matching and GAN, and we can get these few step models.Audio-Video Generation and Time AlignmentSwyx [00:41:21]: Then there's one step I wanted to add, which is audio and video.Ethan [00:41:26]: So, Grok Imagine zero point nine, I believe it's, it's a first audio video transmodel deployed at a large scale. SoSwyx [00:41:39]: And that was your first model?Ethan [00:41:40]: that was, Grok Imagine's first model. It's, it's audio video, joint generation. I think the hard part is, the modality alignment, ‘cause before this transmodel, we have, we have text to video alignment. We have this, correspondence between text and video. Typically, most of the VLMs, they understand images and videos. Video's very rare, and they don't understand audio mostly. And if you look at the audio generation on the LLM side, you can talk to them perfectly fine, but if you ask them to sing a song or something, it typically is not very good. Also, they don't have, they don't have music either. The hard part is thatUh, actually audio has two component. It has like a discrete component, a continuous component. The discrete component is like the language.Ethan [00:42:44]: So when we speak, it's just, someSwyx [00:42:47]: It's an ASR issue, yeah.Ethan [00:42:49]: It's, it's text token with some characteristics, I would say.Ethan [00:42:54]: But musicSwyx [00:42:56]: I think the speech guys would disagree with this.Swyx [00:42:57]: Like disfluencies and then,Vibhu [00:43:00]: There's tones you can get angry.Ethan [00:43:01]: Well, I say largely.Ethan [00:43:03]: the mu- but the music is completely different. It's, it's very continuous, and you cannot model them like discrete tokens in language models. this is like the hard part for models is, not to mention we have to align text, video, and audio together.Ethan [00:43:26]: SoVibhu [00:43:26]: How?Ethan [00:43:28]: So significant-- some significant challenges are like-- So first, like we talk about as the VLMs, they cannot understand most of them cannot understand audio.Ethan [00:43:39]: So you have to have some way to do the synthetic data generation for audio. You have to caption the model, and that involve, that involve synthetic data and human data effort a lot. And not just surprisingly, most of the LLMs are very bad at recognizing, like the beat, tone, and the details of the of music. They can, they can give some general prediction of which song is this, but it's very hard to describe the details of the music. like we mentioned in image generation, like you have to describe image as detailed as possible so that someone blind can reconstruct that. So here is like someoneVibhu [00:44:32]: DeafEthan [00:44:32]: someone deaf can reconstruct how the music sounds like without actually listening to it. Maybe you can think of it need to have the-- or they call the script.Vibhu [00:44:49]: Subtitles, yeah.Ethan [00:44:49]: You gotta have all the details of the music, and the dialogue.Vibhu [00:44:55]: So is the challenge there typically stuff like music and audio, or is it just Like is there a baseline? Okay, there's enough data where we can understand, narration, conversation, but there's nuances in audio that's where you hit all the data issues or is it just from stage zero, you just do it all right?Ethan [00:45:15]: So one important thing is like the alignment. So the model, the model has to know like the video and audio, the, uh-- it has to have a time-based alignment, like at which time step the video and the audio token correspond to each other. But we actually don't have this kind of alignment for most of the other modalities. If you think about like text and image, text and video, they are loosely aligned. So you can, you can have a description of what's going on in the video, but you don't have to exactly, You typically don't have exact description, oh, at, time step one second like what happened?Vibhu [00:46:02]: It's veryEthan [00:46:03]: At time step two second what happenedVibhu [00:46:03]: coarse. Yeah.Swyx [00:46:05]: So what was the ideal time step? You have to oblate it, and then it's like four seconds or something.Ethan [00:46:09]: So that comes down to how you design the model to, for the model to be aware of as a time, as a time modality. So the model is like a time aware. And that's something pretty unique if you think about LLMs. So if you ask LLM to complete a task, say they, uh-- you ask them and they will say, “Oh, this task will probably take twelve hours to complete,” and they come back in one hour. Say “I've already spent two days on this and I've exhausted everything.”Ethan [00:46:47]: So the LLMs them-themselves, they don't have a sense of time there.Vibhu [00:46:53]: I actually don't think that's just them not having a sense of time. I think it's somewhat based, right?Vibhu [00:46:58]: Like you tell someone, “Okay, go work on this feature. Go implement this,” there's a general understanding you would have of how long that would take without LLMs working at LLM speed, right? So you think back like two years ago, if I tell you to like build me like a new front end for latent space, have a search bar, have all this, you'll estimate that it'll take a few days, right?Vibhu [00:47:19]: So you tell an LLM, “Go build this.” It'll take me a few days. But I think it's somewhat grounded as opposed to them not having the best-- Not saying that they have a great understanding, but I think that example is like you can see where it comes from, right? You're trained on all over the text.Swyx [00:47:35]: They're, they're trying to estimate what a human would say.Vibhu [00:47:37]: because that's what the, that's what the data kind of represents. It's not themEthan [00:47:41]: It came from the corpus on the internet. People have a estimate of how much time.Vibhu [00:47:45]: And not even just in direct like training samples, right? Just your world understanding of tokens of how long stuff takes, right? Go read a book. It'll take you a while, right?Vibhu [00:47:56]: Even if you do nothing but read a book, it takes a few days. So yeah, LLM, I read it took me a few hours.Vibhu [00:48:01]: It'll take me a few hours to go through this research. But this is a tangent.Swyx [00:48:05]: Somewhat, yeah.Swyx [00:48:06]: This is a train of thought I haven't really expressed until now is, which is basically like a full world model must also be recursive, meaning that the participant in the world model must also be aware that they have a world model. which is like this whole recursive thing down the, down the line. but yes, and that the world model can be wrong and that they need to update it and blah. Yeah. We've, argued this on the, newsletter as well, that there needs to be sort of recursive or adversarial world models.World Models: Real-Time, Long-Horizon, Interactive VideoVibhu [00:48:34]: just, to ask, how do you define world model?Swyx [00:48:38]: Oh, yeah, let's go there.Ethan [00:48:40]: SoVibhu [00:48:40]: So just for context, we talked about, video generation, and then there's a-- if you say there's a distinction between world models, what's your, what's your definition? How do you see the two?Ethan [00:48:53]: So disclaimer, I'm not going to debate, what is world model. Yeah. there are many definitions, so I'll just talk about my definition. Since I came from the multi-model, multi-model domain, so mainly talking from video. So world model is like real-time interactive long horizon videos. So there are three parts. so we-- let's talk about them one by one. So the so interaction, so we just, we just look at Facebook and neural computer. So the interaction part of it, so you, world model can allow you to interact with them through keyboard, mouse, and maybe also voice. So these all is-- all is a modality. You can, you can interact with the model, and the model should respond reasonably. Second part is real time. So once you, once, say, you move your mouse, if, say, the world model generate a game, how fast can the game respond? So if you're like professional CS: GO players- -my say, oh, you have to respond- He's beginner within sub ten milliseconds or- Yeah even less. So that's not most of the- No, sixty FPS. Let's go. Oh, three hundred FPS. Oh, five hundred FPS. Wait. okay, yeah. I didn't do the math, but yeah, okay. Uh- Yeah, three hundred FPS, that's a three millisecond. So you have to respond- Oh, s**t. Okay. YeahEthan [00:50:29]: within a millisecond. Most of the video models cannot do that. Yeah. And, but if you, say, if you have a video model that is, say, like a digital human, the response time might be more generous. Maybe typically, for real-time voice interaction, it's like two hundred millisecond. So that's, that's much more generous. But even two hundred millisecond is pretty, it is pretty tricky, ‘cause remember we mentionedEthan [00:51:01]: you have this, temporal compression coming from the VAE. So if you, if you don't compress the temporal dimension, your sequence length is going to explode. So if you want to have this real-time, real-timeness in your model, you have to do is one context problem. And the third part is long horizon, ‘cause we-- if you're not going to just play with, video games just, a few seconds, most video models only a few seconds. We're going to play with minutes, hours. The model have to be able to generate long-form content.Ethan [00:51:42]: So putting these three together, it's, real-time, long horizon interactive videos. I think the final state will be, for example, like a video, a video version of Playbook, where you can, you can interact with, a neural computer. You move your mouse, and you click on the generative interface, and it will reply to you through pixels- generating in real time. But getting there, it's, it's a very long way to get there. So one of the first step, at Grok Imagine, where I led a small world model team there, was to build video extension. So, video extension- it's the first step of interactivity. Yeah. It's, it's the first step. Yeah. So it's the first step- You have it here, video editing, yeah. Yeah. Yeah. So the first step is because, this unlocks long horizon videos. Typically, for most of the video generation models, you give it a prompt or an image as an initial frame. You generate video, that's it. That's just, one time, done. And some creators would try to, use the last frame as a first frame for the second video. It can-- sometimes it works, but if you do it a few times, it says the quality would decrease. And- It doesn't have that context- Yeah over the full video, so the temporal- Yeah, exactly. Yeah, ‘cause you only gave it the last frame, of course, right? Yeah. Exactly. And- it's actually a pretty fun hack. if you've seen like- Oh, no, he's saying something better. Yeah. And for example, like Vue, I remember Vue 3 has like a second context of the last video. It is slightly better than using the last frame, but it has the same problem-- similar problem that it, the quality would decrease. if you extend a few times to, one minute, the video quality would look much worse than the first video. Second, another problem is that the model doesn't have long-range knowledge of, what's happening before. Say, if they generate some dialogue, some, two people speaking, and their voice might change, over some time, especially if the second conditioning, it does not cover the previous context. So these are the core challenges. So the Grok Imagine video extension, it has historical context of all of the previous generated videos. It can, It has, it has the context of, who is speaking and what objects have appeared and everything, having that to generate the next video. So if we naively do this, you can imagine, just, put all of the previous history video tokens into the context. The context lens will easily explode. Especially for video models, that can be like a few, a few million context, I would imagine- context lens. Yes.Yeah.Swyx [00:54:58]: Let's run with that.Ethan [00:54:59]: for example, like in Cosmos, I think just five seconds of video is like a fifty K or sixty K number of tokens. So like if you do, if you do fifty second, that's a five hundred K tokens. If you do longer than that, easily explode. This long horizon, problem was the first step we're trying to solve world model. It turns out people, yeah, people love video extension. Like a lot, a lot of the creators love using video extension to create longer form videos. This is the part I liked that you have a, you have an intermediate step toward the final goal instead of just a straight shot to the final version very much.Swyx [00:55:48]: But I can see you have a strong vision of where we want to end up.Long Context, Redundancy, and Efficient Interactive VideoVibhu [00:55:51]: Does it seem like it's an efficiency issue? okay, we're at a few million tokens context,. If you draw the parallel to language models, we had very short context, two thousand, eight thousand, then, you scale it up one million, ten million. sure, there's effective context, but at the end of the day, it's just what's it worth? sure, there's a whole training data side. In video, it might be slightly easier ‘cause we have a hundred million token video, right? Just take a movie with the full context there. Like is this efficiency from an inference standpoint that like it's expensive, but we know how to solve it? Or like why is this not the approach? So like my broader point was on your second point of world models, you say it needs to be interactive and live, right? You should be able to play a game and see the interaction live. So one thing I see with research is a lot of what you actually serve is different than what you build, right? So we talked about distillation. You train big model, you distill it, you do quantization, speculative decoding. We do all this stuff to serve it efficiently. Should we not just have a solution, like a world model that can interact well, do inference optimization, serve it, distill it secondary, so make it real time after you solve it? So like a-- another parallel is say, continual learning, right? What we need is someone to solve it and show it works inefficiently. Give it a few years, people will make it efficient. Same thing with regular attention, right? It worked. Over a few years, people have different forms of attention, and we've scaled it to be efficient at log context,? So kind of two things there, right? One is it seems like it works. You've scaled it. Can we not just scale it a lot more efficiently over time? Do we need a separate approach if this works? And same thing with interaction, right? if we can get it done, like if we can solve some way that it works, we can solve making it more efficient from an inference standpoint later.Ethan [00:57:53]: that's actually a very good point. So in videos, there's actually a lot of redundancies. So we solve a lot of the pixel redundancy from VE, but there's more redundancy in long range and long horizon videos. Say, if a character appear in the first clip and then it disappeared, it only reappear at the end of the video, you probably don't need the-- the context, like in the middle of the generation. So you only need that character, where you need. So that's why, I helped build another feature. It's a reference video.Vibhu [00:58:36]: Is it here?Swyx [00:58:36]: is it the same model release or different one?Ethan [00:58:39]: It's a different one.Ethan [00:58:41]: You probably need to search onSwyx [00:58:43]: I'll find itEthan [00:58:43]: X reference to video.Ethan [00:58:46]: So reference video allow you to like upload up to seven images as condition and generate the video. Say, if like I want-- it can, it can be characters or objects or even scenes. Say like I want, I want condition on, Sean's selfie and holding a bladeSwyx [00:59:07]: We have a dogEthan [00:59:08]: or whatever.Swyx [00:59:08]: We put the dog in the thing.Ethan [00:59:09]: you can put them there and the video models will generate the video from and copies the context over. So that can solve a lot of the problems there, like the long context problem. It doesn't need to have a very long context, but it's-- I feel like it's an intermediate solution. The modelSwyx [00:59:29]: It's cheating.Ethan [00:59:30]: the model should be able to like selectively know, where should I draw the references. So say if I want to generate a movie, I generate it autoregressive, like a ten second at a time or something. And now this character appear, I can look back to where it first appear and, bring that back. Yeah, this one, I put the references. Yeah, that's, Optimus, Einstein myself, Annie.Vibhu [01:00:02]: Oddly enough, I used Grok Search to find it, and it pulled your LinkedIn post. But yeah we found it.Ethan [01:00:08]: Interesting.Vibhu [01:00:10]: ButxAI's Underrated Work, Culture, and WatermarkingSwyx [01:00:11]: this is a problem. This is not your fault, but like XAI doesn't communicate all this work that you do very well because they just have the model release and then that's it. But actually, these details are very good.Swyx [01:00:22]: As far as I understand, everything you just described is state-art, like no one else has done it.Vibhu [01:00:30]: A lot of-- yeah, I have a lot moreSwyx [01:00:32]: And then, and then you just put this blog post with the cookies. I'm this is not enough,?Swyx [01:00:37]: but I, obviously this is like the high level numbers that people want to know. But no, okay, soVibhu [01:00:42]: And I wonder, like part of that is also some labs don't share research into what happens. And ifSwyx [01:00:50]: No, but this is literally bragging about how good they are, right?Swyx [01:00:54]: Like, why would you not say that you are capable of extending with full context? this is not a secret sauce. This is like we did the work. yeah, I don't know.Ethan [01:01:02]: different labs have slightly different communication styles.Swyx [01:01:07]: Anyway, if anyone from XAI is listening we are always happy to help you tell your story. Yeah, okay, so you did references, and I think, I think kind of the point you're, you're making is it is sort of like a kludge, right? this is-- you can do seven, but what about 100?Swyx [01:01:23]: Right? Then you need a completely different thing.Ethan [01:01:26]: So I think it's-- this is, a mechanism to, select the context from the history, and you might not put the entire history into the context. for example, there's a paper called Frame Pack, which haveEthan [01:01:41]: a heuristic that the latest history, the last one second, I put the entire history, and the history before that, I would, compress it and makes the video smaller. So they follow this pattern, this build overall pattern that the maximum sequence length is fixed. So the further you are from the current frame, you have a smaller image. So this is just a heuristic. I think it can be more automatic. The model is aware like which history part of it can be select. So this part of the research is actually being actively, worked on by a lot of people. It's also quite interesting. I feel this is actually, this part of long context is a little bit ahead of the LLM part.Ethan [01:02:31]: So for example, like in LLMs, if you-- so contexts keep growing. Let's say if you call tool and the tool call history is extremely long, that's still in context, and keep growing, keep growing. Even if you switch the topic to something else, the whole context was there. There are some agentic harnesses that help you to, say, prune the tool results and, prune Like when you, when you query a file, only show like the top 200 lines or something. Those were very heuristic-driven.Swyx [01:03:08]: For listeners, we did a write-up on the cloud code, leak where there are eight different kinds of pruning, including like you prune the tool results and all that. So you can, you can read up on that kind of thing.Ethan [01:03:17]: I think, one breakthrough in continual learning might be like a way to automatically, manage its own context.Swyx [01:03:27]: These are all heuristics, and they will be replaced by machine learning.Ethan [01:03:30]: InterestinglyVibhu [01:03:32]: TheEthan [01:03:32]: the same thing is being researched in both LLMs and video models.Vibhu [01:03:36]: The interesting thing is also like in the paper you showed, it's actually happening at the model level, right? Compared to like language models, sure, we have base attention, but we'll do our own compression, we'll do our own pruning, which is separate from model error.Vibhu [01:03:49]: Eventually, it all just boils in, hopefully.Swyx [01:03:52]: I think this is a form of like attention, but like also know sort of reasoning attention. I feel like that's different than normal attention.Swyx [01:04:03]: Does that, does that make sense?Ethan [01:04:04]: It's, it's different in the sense that attention, not to mention, set sparse attention aside,
In this episode, Ben Felix and Braden Warwick unpack the surprisingly complex world of expected return modeling and why it matters so much for retirement projections, portfolio construction, and financial advice. They explain how PWL Capital currently estimates expected returns across asset classes, why traditional Monte Carlo methods relying on Gaussian distributions may miss important market behaviors, and how new research could improve the realism of long-term financial planning simulations. The conversation also explores a fascinating collaboration between PWL and Columbia Engineering student John Yang, who worked with Professor Michael Robbins on a project to build more realistic synthetic return data for financial planning. John explains how his team used empirical distributions, t-copulas, and Extreme Value Theory to better capture market crashes, fat tails, and asset co-movements during periods of stress. Ben and Braden then analyze how these improved simulation methods affect financial planning outcomes, sustainable spending estimates, and projections for long-term wealth accumulation. Key Points From This Episode: (0:00:00) Introduction to expected return modeling and why it matters for financial planning. (0:00:25) The importance of volatility, correlations, distribution shape, and time-series behavior in portfolio projections. (0:01:26) How Scott Cederburg's research on block bootstrapping influenced PWL's thinking on simulations. (0:02:03) Introduction to Columbia Engineering student John Yang and the industry research collaboration. (0:03:30) How Conquest Planning allows PWL to upload custom return simulations. (0:04:05) A new PWL client's detailed reasoning for moving from DIY investing to working with an advisor. (0:06:22) Why financial planning and Monte Carlo simulations were central to the client's decision. (0:07:22) Cross-border financial complexity and the value of professional advice. (0:08:03) Estate planning, cognitive decline, and the role of trusted financial relationships. (0:10:02) Research on cognitive decline and its impact on financial decision-making. (0:12:00) Delegation, accountability, and reducing mental overhead through advisory relationships. (0:13:47) Why the client chose PWL specifically and the appeal of evidence-based investing. (0:15:25) Ben and Braden discuss the perceived disconnect between online discourse and demand for AUM advisors. (0:16:12) Overview of PWL's methodology for estimating expected returns across asset classes. (0:17:05) How PWL combines historical returns with market-implied expected returns. (0:18:07) The use of factor premiums and expected return composition in taxable projections. (0:18:48) Why PWL previously relied on Gaussian multivariate normal distributions for simulations. (0:19:41) Arithmetic vs. geometric mean returns and why the distinction matters. (0:21:01) A simple example illustrating volatility drag. (0:23:29) Why diversification benefits must be incorporated into expected portfolio returns. (0:25:15) How correcting portfolio math improved expected return estimates by 20–30 basis points. (0:27:12) Transition to John Yang's interview and introduction to synthetic data generation. (0:30:07) John explains the limitations of Gaussian return assumptions. (0:31:04) Why realistic sequences of returns matter for retirement planning. (0:32:16) Empirical evidence that returns are not truly random. (0:33:25) The three modeling challenges: unique asset behavior, realistic co-movement, and tail risk. (0:37:49) Separating marginal distributions from dependency structures in the modeling process. (0:38:48) Using a t-copula to better model asset co-movement during market stress. (0:39:39) Why historical data alone struggles to capture rare crisis events. (0:40:06) Applying Extreme Value Theory and Generalized Pareto Distributions to model tail risk. (0:42:15) How Monte Carlo simulations generate many realistic future return paths. (0:43:00) Imposing forward-looking expected returns and volatility assumptions onto the simulations. (0:44:56) How the new framework better preserves skewness and kurtosis. (0:46:38) Evaluating the new model using marginal shape, tail behavior, and co-movement scores. (0:48:10) Why the new model significantly improved tail realism without sacrificing correlations. (0:49:05) Future extensions including dynamic correlations and volatility clustering. (0:50:28) Potential future use of GANs and machine learning for synthetic financial data. (0:52:02) Key takeaway: financial planning requires realistic return paths, not just summary statistics. (0:53:41) Braden analyzes how the new simulation framework affects financial advice. (0:55:04) Why monthly index data produced fatter tails than long-term annual DMS data. (0:58:47) The new model improved Monte Carlo success rates by roughly 2–3%. (1:00:25) Sustainable spending estimates changed only modestly under the new simulations. (1:02:27) Why the improved methodology matters more for alternative asset classes. (1:04:25) The surprising finding that median wealth outcomes increased while mean outcomes decreased. (1:05:47) Why Gaussian simulations can create unrealistic runaway wealth scenarios. (1:07:20) The practical implications for estate planning and multi-generational wealth projections. (1:08:30) Why better simulation methods are especially important for concentrated and alternative investments. Links From Today's Episode: Meet with PWL Capital: https://calendly.com/d/3vm-t2j-h3p Rational Reminder on iTunes — https://itunes.apple.com/ca/podcast/the-rational-reminder-podcast/id1426530582. Rational Reminder on Instagram — https://www.instagram.com/rationalreminder/ Rational Reminder on YouTube — https://www.youtube.com/channel/ Benjamin Felix — https://pwlcapital.com/our-team/ Benjamin on X — https://x.com/benjaminwfelix Benjamin on LinkedIn — https://www.linkedin.com/in/benjaminwfelix/ Editing and post-production work for this episode was provided by The Podcast Consultant (https://thepodcastconsultant.com)
In dieser Folge sprechen wir über den tragischen und außergewöhnlichen Zwischenfall in den USA, als ein Mensch auf der Startbahn in das Triebwerk eines startenden Flugzeugs geraten ist – ein Thema, das uns fassungslos macht und viele Fragen aufwirft. Aber was passiert eigentlich in so einem Triebwerk und was richten Fremdkörper für einen Schaden an, wenn sie hinein geraten? Wir tauchen tief in die Welt der Vogelschläge ein, diskutieren, wie sie getestet werden, welche Schäden sie anrichten können und was das für unsere Arbeit als Piloten bedeutet. Wir teilen Erfahrungen aus der Luftrettung und dem Polizeiflugdienst und geben Einblicke, wie wir uns gegen Vogelschläge wappnen – von technischen Details über Triebwerke bis hin zu kuriosen Statistiken und echten Erlebnissen am Himmel. Hört rein, wenn ihr wissen wollt, wie wir mit diesen Gefahren umgehen, was wirklich hinter den Kulissen passiert und warum wir trotz aller Risiken unseren Job lieben. Viel Spaß bei Abgehoben - der Hubschrauber Podcast
I avsnitt 210 pratar vi om allt från drev mot senaste The Odyssey-trailern, musikjournalister på film i Mile End Kicks samt mord & medelålderskris i DTF St. Louis. Vi har spelat racing-animen Screamer, Lovecraft-pusslat i Call of the Elder Gods och testat vad vi tror blir årets stora överraskning när spelåret ska summeras. Vi rundar avsnittet med en uttömmande diskussion om Christophe Gans slakt av ett av spelhistoriens mästerverk i Return to Silent Hill… eller är allt enligt plan från Gans för att lura oss? Böcker: Skolplattformen, Strange PicturesSpel: Directive 8020, Screamer, Call of the Elder Gods, Wardrum, The Adventures of Elliot DemoSerier: DTF St. LouisFilm: Mile End Kicks, Return to Silent Hill
Chanan Gans was a Jewish teen touring America in a rock band, immersed in the world of music, freedom, and self-discovery. At the very same time, Aharona was traveling through Thailand and India, living among monks and searching for spirituality, truth, and deeper meaning. Thousands of miles apart, both were on their own journey to find purpose and God. Then came a mysterious dream, an unexpected coincidence in Israel, and a meeting that changed everything. Today, Chanan and Aharona are Chassidish, raising a family in Baltimore and sharing the incredible story of how two wandering Jewish souls found faith, Judaism, and each other. This is their wild, emotional, and unforgettable journey of destiny, spirituality, and divine providence.✬ SPONSORS OF THE EPISODE ✬► BitBean: Smart Custom Software Built for YouYaakov here. Just make the call. They can help you.Reach Out Here→ https://bitbean.link/MeEBlY ► Ohr Chadash ProgramA life changing program in Ohr Somayach that can be right for you.Reach Rabbi Eli Jaffa here...WEBSITE: Ohr.edu/ohrchadashPHONE: 058-535-8060EMAIL: r.jaffa@ohr.eduWHATSAPP: https://bit.ly/4dnV3vx► Wheels To Lease: #1 Car CompanyFor over 35 years, Wheels To Lease has offered stress-free car buying with upfront pricing, no hidden fees, and door-to-door delivery.→ CALL/TEXT: 718-871-8715→ EMAIL: inspire@wheelstolease.com→ WEB: https://bit.ly/41lnzYU► NEW BOOK: To Live or To CryStruggling with stress, anxiety, overthinking, or addiction? To Live or To Cry reveals where stress actually comes from, helping people quiet their minds, break unhealthy mental patterns, and find genuine peace through surprisingly simple but life.GET HERE→ https://a.co/d/016ovVOX✬ IN MEMORY OF ✬This episode is in memory of:• Miriam Sarah bas Yaakov Moshe• Shimon Dovid ben Yaakov Shloima#iftnLchaim.
Brienne und ihr ganz persönliches Unglückskind kommen in Jungfernteich an und begegnen dort alten Bekannten, die Brienne lieber vergessen hätte. Doch es hilft alles nichts, denn ihre einzige Spur um Sansa endlich zu finden führt in die "stinkende Gans" und zu einem zwielichtigen Kerl, der alles andere als vertrauenserweckend ist.
Tonight on America at Night, Dan Mandis, fills in for McGraw Milhaven. Jared Gans, reporter for The Hill, joins the show to discuss the latest developments in California politics as Xavier Becerra emerges as a front-runner in the state's upcoming governor's race, and what it could mean for the national political landscape. Matthew Hurtt, Director of Professional Services at the Leadership Institute and an internationally recognized fundraiser and political organizer, joins the program to discuss redistricting efforts taking place across the country ahead of the next election cycle and how those changes could impact future congressional and state races. Later, Lt. Gen. Richard Newton, NewsNation Senior National Security Contributor, provides the latest analysis on Iran, breaking down the current geopolitical tensions and what recent developments could mean for U.S. national security and the broader Middle East. Learn more about your ad choices. Visit podcastchoices.com/adchoices
Pourquoi écouter cet épisode ?Dans ce contenu exclusif, nos deux invitées partagent des parcours de vie où l'éthique prend le pas sur le confort et les institutions.La "matrescence végane" d'Emilie Leblanc : Elle raconte comment elle a quitté un poste confortable de chef de projet digital chez Saint-Gobain, avec un salaire de plus de 3 000 €, pour se consacrer à la défense des animaux. Aujourd'hui salariée à temps partiel pour l'Association végétarienne de France, elle gagne 800 € par mois, mais vit une vie en parfaite cohérence avec ses valeurs.Le combat judiciaire d'Astrid Prévost : Nutritionniste-diététicienne, Astrid mène une procédure en justice pour faire évoluer l'examen national du BTS Diététique. Elle dénonce l'obligation pour les candidats véganes de cuisiner des animaux lors des épreuves de cuisine, une pratique qu'elle juge discriminatoire et contraire aux convictions philosophiques personnelles.L'ambition européenne : Prête à aller jusqu'à la Cour européenne des droits de l'homme (CEDH), Astrid espère que son action créera un précédent pour les 46 pays du Conseil de l'Europe afin d'interdire la discrimination des personnes véganes dans les examens.Rejoignez Papatriarcat+En vous abonnant à Papatriarcat+, vous soutenez un projet indépendant tout en profitant de nombreux avantages:Épisodes en avance et sans publicité.Accès à plus de 150 épisodes bonus exclusifs.Une écoute privilégiée pour approfondir les sujets de parentalité et de droits de l'enfant. Le saviez-vous ? Emilie a réussi à faire basculer toute sa famille vers le véganisme, y compris son compagnon, grâce à des campagnes comme Veganuary. Salutations adelphes et solidaires ✊
Andrew Gans produced a Documentary about his Dad, Vegas Legend, Danny Gans. Andrew joins First Look with Andy Morris to tell us about the Movie, showing at the SLO Filmfest.
Schlaf, Fitness oder Kalorienverbrauch: Timon ist großer Fan von Gadgets, mit denen er sich tracken kann. Aber was sagen diese Werte wirklich aus? Und machen wir unser Körpergefühl am Ende nicht zu stark von solchen Zahlen abhängig?**********Ihr hört: Gesprächspartner: Timon, hat verschiedene tragbare Tracker für sich ausprobiert Gesprächspartner: Can Dincer, Professor für Sensors and Wearables for Healthcare an der Technischen Universität München Gesprächspartnerin: Vivien Suchert, Psychologin am Institut für Therapie- und Gesundheitsforschung Kiel, hat ein Buch über Selbstoptimierung durch Vermessung des Körpers geschrieben Autor und Host: Przemek Żuk Redaktion: Ivy Nortey, Anna Maibaum, Friederike Seeger Produktion: Jan Morgenstern**********Quellen:Sazanov, E. [Hrg.] (2019). Wearable Sensors. Fundamentals, Implementation and Applications. Elsevier.Ates, H.C., Brunauer, A., von Stetten, F. et al. (2021). Integrated Devices for Non-Invasive Diagnostics. Advanced Functional Materials, 31.de Gans, C.J., Burger, P., van den Ende, E.S. et al. (2024). Sleep assessment using EEG-based wearables – A systematic review. Sleep Medicine Reviews, 76.Ferguson, T., Olds, T., Curtis, R. et al. (2022). Effectiveness of wearable activity trackers to increase physical activity and improve health: a systematic review of systematic reviews and meta-analyses. The Lancet Digital Health, 4(8), S. 615-626.**********Empfehlungen aus dieser Folge:Suchert, V. (2019). Das vermessene Ich: Von Selbstkontrolle, Optimierungswahn und digitalen Doppelgängern. ecoWing. ISBN 978-3711002426. **********Mehr zum Thema bei Deutschlandfunk Nova:Körperbild: Wie sieht fit sein aus?Fitness: Wie bleiben wir wirklich dran?Selbstoptimierung: Warum uns Self-Tracking so fasziniert**********Den Artikel zum Stück findet ihr hier.**********Ihr könnt uns auch auf diesen Kanälen folgen: TikTok und Instagram .**********Meldet euch!Ihr könnt das Team von Facts & Feelings über Whatsapp erreichen.Uns interessiert: Was beschäftigt euch? Habt ihr ein Thema, über das wir unbedingt in der Sendung und im Podcast sprechen sollen?Schickt uns eine Sprachnachricht oder schreibt uns per 0160-91360852 oder an factsundfeelings@deutschlandradio.de.Wichtig: Wenn ihr diese Nummer speichert und uns eine Nachricht schickt, akzeptiert ihr unsere Regeln zum Datenschutz und bei Whatsapp die Datenschutzrichtlinien von Whatsapp.
What separates the top 1% of AI professionals from everyone else? It isn't just coding; it's the ability to leverage the cutting-edge tools that drive innovation and high-paying careers. In this InfosecTrain masterclass, we pull back the curtain on the next generation of AI media creation, focusing on OpenAI's Sora and the latest image generation breakthroughs with Nano Banana.The "course titled" AI Media Creation Masterclass dives into the fascinating world of image and video generation, specifically exploring the front-end development of visual assets. We break down the high-level mechanics of Diffusion Models and Generative Adversarial Networks (GANs), providing a roadmap for content creators and marketers to move from raw prompts to professional-grade media production.
The guys talk with Danny Gans son a bout a new documentry and we turn to AI for everything now!
Mācītāja Ilāra Plūmes sprediķis par Jāņa evaņģēliju (10:11)
In Part 2 of this conversation, the focus shifts from understanding tinnitus to how it's actually managed.Dr. Jennifer Gans returns to speak with Shari Eberts about tinnitus management strategies. Building on their previous conversation, she outlines a practical framework for evaluating treatments, centered on three core elements: reducing anxiety, providing accurate education, and supporting nervous system regulation. Rather than focusing on specific products or claims, the discussion emphasizes how individuals can make informed decisions in a crowded and often confusing landscape.Dr. Gans also explores mindfulness-based approaches, sound therapy, hearing aids, and common misconceptions around supplements and “quick fixes.” The conversation reinforces a key idea: tinnitus is less about eliminating the sound and more about changing the brain's response—offering a grounded, evidence-based perspective for clinicians, researchers, and individuals seeking to reduce tinnitus distress.**Check out Dr. Gans' weekly column at: https://hearinghealthmatters.org/tinnitus-education-corner**Learn more about Dr. Gans and her work at: https://mindfultinnitusrelief.com/Be sure to subscribe to our channel for the latest episodes each week and follow This Week in Hearing on LinkedIn, Instagram and X.- https://x.com/WeekinHearing- https://www.instagram.com/thisweekinhearing/- https://www.linkedin.com/company/this-week-in-hearingVisit us at: https://hearinghealthmatters.org/thisweek/
We've been on a bit of a mini World Models series over the last quarter: from introducing the topic with Yi Tay, to exploring Marble with World Labs' Fei-Fei Li and Justin Johnson, to previewing World Models learned from massive gaming datasets with General Intuition's Pim de Witte (who has now written down their approach to World Models with Not Boring), to discussing the Cosmos World Model with with Andrew White of Edison Scientific on our new Science pod, to writing up our own theses on Adversarial World Models. Meanwhile Nvidia, Waymo and Tesla have published their own approaches, Google has released Genie 3, and Yann LeCun has raised $1B for AMI and published LeWorldModel.Today's guests have a radically different approach to World Modeling to every player we just mentioned — while Genie 3 is impressive, its many flaws demonstrate the issues with their approach - terrain clipping, noninteractivity (single player, no physics/no objects other than the player move), and maximum of 60 second immersion. Moonlake AI (inspired by the Dreamworks logo) is the diametric opposite - immediately multiplayer, incredibly interactive, indefinite lifetime, capable of MANY different kinds of world models by simulating environments, predicting outcomes, and planning over long horizons. This is enabled by bootstrapping from game engines and training custom agents: In Towards Efficient World Models, Chris Manning and Ian Goodfellow join Fan-Yun in explaining why their approach to efficiency with structure and casuality instead of just blind scaling is sorely needed:SOTA models still show physical or spatial understanding glitches, such as solid objects floating in mid-air or moving “inside” other solid objects.If the goal is to plan for the next action, how often is a high-resolution pixel view necessary for modeling the world? Our bet is that there is a disproportionately large share of economically valuable tasks where such detail is not required. After all, humans with a wide variety of sensory limitations have little difficulty doing almost everything in the world. Furthermore, for a large number of purposes, describing a scene or a situation in a few words of language (“the car's tires squealed as it cornered sharply”) is sufficient for understanding and planning.Experiments also show that humans only partially process visual input in a top-down, task-directed way, often making use of abstracted object-level modeling. In almost all cases, partial representations combined with semantic understanding are sufficient.…If the goal is to facilitate the understanding of causality in multimodal environments, then the world model—whether it is used in the virtual world or the physical world—must prioritize properties such as spatial and physical state consistency maintained over long time periods, and an ability to evolve the world that accurately reflects the consequences of actions. That's what Moonlake is building.Game engines are the right starting point abstraction to efficiently extract causal relationships, and building the interfaces and community (including their new $30,000 Creator Cup) to kickstart the flywheel of actions-to-observations.We were fortunate enough to attend their sessions at GDC 2026 (the Mecca of Game Devs), and were impressed by the huge variety and flexibility of the worlds people were building with Moonlake's tools already! Live videos on the pod.Full Video Pod on YouTube!Timestamps00:00 Benchmarking Gets Hard00:47 Meet Moonlake Founders01:26 Why Build World Models03:12 Structure Not Just Scale05:37 Defining Action Conditioned Worlds07:32 Abstraction Versus Bitter Lesson14:39 Language Versus JEPA Debate20:27 Reasoning Traces And Rendering Layer37:00 Gameplay Over Graphics38:02 Fiction Rules And World Tweaks39:15 Code Engines Beat Learned Priors41:10 Diffusion Scaling Limits43:23 Symbolic Versus Diffusion Boundary46:14 Platform Vision Beyond Games50:24 Spatial Audio And Multimodal Latents54:23 NLP Roots Hiring And Moon Lake NameTranscript[00:00:00] Cold Open[00:00:00] Chris Manning: Think this whole space is extremely difficult as things are emerging now. And I mean, it's not only for world models, I think it's for everything including text-based models, right? ‘cause in the early days it seemed very easy to have good benchmarks ‘cause we could do things like question answering benchmarks.[00:00:20] But these days so much of what people are wanting to do is nothing like that, right? You're wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month. It's not so easy to come up with a benchmark, and it's the same problem with these world models.[00:00:41] Meet the Founders[00:00:41] swyx: Okay. We're back in the studio with Moon Lake's, two leads. I, I guess there's other founders as well, but, sun and Chris Manning. Welcome to the studio.[00:00:54] Fan-yun Sun: Thanks. Thanks, Chris. Thanks for having us.[00:00:56] swyx: You've got, you guys have, come burst onto the scene with a really refreshing [00:01:00] new take of mold models.[00:01:01] I would just want to, I guess ask how you, the two of you came together. Chris, you're a legend in NLP and just AI in, in, in general. You're, you're his grad student, I guess[00:01:10] Fan-yun Sun: Actually my co-founder.[00:01:11] swyx: Oh, yeah.[00:01:12] Fan-yun Sun: I should give a lot of credit to my co-founder, Sharon. Yeah. She was, she was actually working with Professor Fe Androgyn and then she ended up working with, Ron and Chris Manning here.[00:01:22] And then, so I got connected through to Chris initially, actually through my co-founder,[00:01:26] What is Moon Lake?[00:01:26] swyx: what is Moon Lake? What, what is, actually, I'm also very curious about the name, but like why going into world models?[00:01:33] Fan-yun Sun: So I was working a lot. With actually Nvidia research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embody EA agents.[00:01:44] And then there's two observations. One in academia and one in industry. An industry like folks at Nvidia are actually paying a lot of dollars to purchase these types of interactive worlds, whether it's for the sake of evaluation or training the robots, or policies or models. And [00:02:00] then, in academia, same thing is happening.[00:02:02] And more specifically, when I was actually working with Nvidia on the synthetic data foundation model training project, we were actually generating a lot of these synthetic data and showing that, hey, you can actually, these synthetic data are actually as useful as real world data when it comes to multimodal pre-training.[00:02:16] But then, like I said, there's a lot of dollars being paid out to like external vendors or, or like. Other folks to manually curate these types of data. It was very clear to us that, okay, on our way to, let's call it embody general intelligence models need to learn the consequences behind their actions, which means that they need interactive data and the demand for those types of data are growing exponentially.[00:02:38] But everybody's sort of thinking about it from a pure, say, video generation perspective or something else. But we feel like the true actually opportunity is actually building reasoning models that can do these things, like how humans do these things today. So that's a little bit on the genesis of Moon Lake, and I think the reason I got into world models was partly.[00:02:59] A philosophical [00:03:00] take of the on the world where I like, believe the simulation theory and stuff like that. But on the other, on the other hand, it's really just like, oh, like there's an opportunity there that I feel like nobody's doing it the way I think should be done.[00:03:10] Structure, Not Scale: The Vision[00:03:10] Chris Manning: I can say a little bit about that.[00:03:12] Yeah. So of the overall goal is the pursuit of artificial intelligence and most of my career has been doing that in the language space and that's been just extremely productive. As we all know, the story of the last few years, I don't have to tell about how much we've achieved with large language models, but, uh.[00:03:31] Although they have been extremely effective for ramping language and general intelligence, it's clearly not the whole world. There's this multimodal world of vision, sound, taste that you'd like to be dealing with more than just, language. And then the question is how to do it. And despite, a huge investment in the computer vision space, right, as the research field computer [00:04:00] vision has been for decades, far, far larger than the language space, actually.[00:04:05] I think it's fair. Say that, vision, understanding sort of stalled out, right? You got to object recognition and then progress just wasn't being made right? If you look at any of these, vision language models, it's the language that's doing 90% of the work and the vision barely works. And so there's really an interesting research question as to why that is and at heart, the ideas behind Moon Lake are an attempt to answer that, believing that there can be a really rich connection between a more symbolic layer of abstracted understanding of visual domains, which aren't in the mainstream vision models, which are still trying to operate on the surface level of pixels.[00:04:50] swyx: I think one of your blog posts, you put it as structure, not scale. Is that, a general thesis?[00:04:57] Chris Manning: Yeah. Well, scale is good too.[00:04:58] swyx: Yeah. Scale is good. Too[00:04:59] lot,[00:04:59] Chris Manning: [00:05:00] lots of data is good as well and scale, but nevertheless, you want the structure Yeah. To be able to much more efficiently learn.[00:05:07] swyx: Yeah. The other thing I really liked also is you put out an example of what your kind of reasoning traces look like.[00:05:12] Right. Which you would distill is the word that comes to mind. I don't even think that's a good, good description, but it would involve, for example, geometry, physics, affordances, symbolic logic, perceptual mappings, and what, what have you. But like that, that is the kind of example that involves, let's call it spatial reasoning, role model reasoning as as compared to normal LM reasoning.[00:05:35] Yeah.[00:05:36] Defining World Models vs Video Generation[00:05:36] Vibhu: But also like taking it a step back. So how do you guys define world models? A lot of people see okay, you can do diffusion, you can do video generation. But, you guys put out quite a few blog posts. You put out a essay recently, we can even pull it up about efficient world models. You have a pretty like structural definition here, but for the general audience that don't super follow the space, right.[00:05:55] What's, what's the difference in what we see from like a video generation model to [00:06:00] a world gen A simulator? How do you kind of paint that last[00:06:02] Chris Manning: year? Yeah, so I think this is actually a little bit subtle because, people look at these amazing generative AI video models, SAWA VO three, one of these things, and they think Genie, they think, oh, this is amazing.[00:06:17] This is we've solved understanding the world because you can produce these generative AI videos, but. The reality is that although the visuals do look fantastic, those visuals actually are accompanied by an understanding of the 3D world, understanding how objects can move, what the consequences of different actions are, and that's what's really needed for spatial intelligence.[00:06:49] So I mean, a term we sometimes use is that you need action condition, world models. That you only actually have a world model if you can predict, [00:07:00] given some action is taken, what is going to change in the world because of it. And in particular, that becomes hard over longer time scales. So if you're simply, trying to.[00:07:12] Predict the next video frame. That's not so difficult. But what you actually want to do is understand the consequences, likely consequences of actions minutes into the future. And to do that, you actually much more of an abstracted semantic model of the world.[00:07:32] The Bitter Lesson & Data Abstraction[00:07:32] swyx: Yeah, the question comes where you want to have more structure than is available in just predicting the next token.[00:07:41] And typically, well, let's, let's call it the experience of the last five years has been that is just washed away by scale, right? So what is the right middle ground here that, you don't ignore the bitter lesson, but also you. Can be more efficient than what we're doing today.[00:07:57] Chris Manning: One possibility [00:08:00] is, look, if we just collect masses and masses and masses and masses of video data, this problem will be solved.[00:08:11] Under certain assumptions that could be true, but there are sort of multiple avenues in which it could not be true. The first is what's really essential is understanding the, the consequences of actions producing an action conditioned world model. And if you are simply, collecting observational video data, which is the easy stuff to collect, when you're sort of mining online videos, you don't actually.[00:08:41] Know the actions that are being taken to see how the video is changing. And so if you are never collecting directly actions and you are having to try and infer them from what happened in the observed video, that's not impossible. But it's very [00:09:00] hard and it's not really established that you can get that to work at any scale yet.[00:09:05] And so there's a lot of premium on collecting action condition video data, which is part of why there's been a lot of interest in using simulation so that you can be collecting data where you do know the actions, which isn't quite limited supply, but there's also in the limit of as much data as you could possibly have.[00:09:28] Maybe the problem is eventually solvable, but. Even though we collect huge amounts of text data is always at a great level of abstraction, right? Language is a human designed, abstracted representation where there's meaning in each token and it's representing and abstraction of the world, right?[00:09:51] As soon as you are describing someone as a professor, and as soon as you are saying that they're condescending, right? These are very [00:10:00] abstracted descriptions of the world. It's not at what you're observing as pixel level, and to get to that kind of degree of abstraction, starting from pixels is orders and magnitude of extra data and processing.[00:10:14] And so, although, we absolutely want to exploit, get as much data as possible, use the bitter lesson. Nevertheless, if there are ways in which you can work with five orders of magnitude less data than people working purely from pixels, you're gonna be able to make a lot more progress, a lot more quickly.[00:10:34] And that's the bet here. And so you could just say that's only wanting to be able to, do it more efficiently, do it more quickly, do it more cheaply. But I think it's actually more than that, I think. One should be making the analogy to how human beings work at one level. You know? Yes, we have these high [00:11:00] resolution eyes and we can look and see a scene like a video, but all of the evidence from neuroscience and psychology is that most of what comes into people's eyes is never processed.[00:11:13] Right. That you are doing fairly fine ated processing of exactly what you're focusing on. But as soon as it's away from that of yeah, there's another guy over there that you've sort of only processing top down this very abstracted semantic description of the world around you. And so, that's what human beings are doing.[00:11:33] They're working with semantic abstractions and so. I think it is just the right representation. ‘cause we also have other goals we want to be able to do, real time worlds. So that means there's a limit to how much processing you can do and we want to do long-term planning and consistency. And again, that favors abstraction.[00:11:55] I mean, I guess there was actually a recent. Blog posts that [00:12:00] came out from our Friends of physical intelligence and, they were sort of heading in the same direction they were saying Oh, to the pay[00:12:06] swyx: pay model.[00:12:07] Chris Manning: Yeah. Yeah. To maintain a long term memory of what's happening in the world. So we can, do longer term we actually storing text of what is, been happening in the world.[00:12:19] Right. It is not such a successful strategy of trying to keep it all at a pixel level.[00:12:24] Vibhu: And yeah, I mean, you can see it in video models like that Temporal consistency. We're at a scale of train on, all the video data we have. We have it for maybe 30 seconds, a few minutes. That's not the same as a game state played for half an hour.[00:12:37] Right. I thought you guys break it down pretty well. You have a, you have a blog post about. Building multimodal worlds with an agent. I dunno if you guys wanna talk about this. This is one of the things I read, I[00:12:48] swyx: thought, yeah, it's the thing I talked about with the reasoning chain. Yeah.[00:12:51] Vibhu: So there's like different phases to this.[00:12:53] It seems like it's more of an agent, a scaffold, very different approach than just, type in a prompt and you, you don't have the same consistency. [00:13:00] It also, like, for people that are listening, I, I would highly recommend reading it. It breaks down the problem in a different light, right?[00:13:06] So like, what do you need to consider when you're talking about video, like world game models, right? How would, what do you need to consider? What are the factors? What are the elements? What's the state? So I don't know if you guys have stuff to talk about for this one.[00:13:19] Fan-yun Sun: Yeah. Actually, I wanted to add on a little bit Yeah.[00:13:22] On our previous point, which is just like, change topics so quickly. I, I do feel like sometimes people confuse like, oh, like we're taking an an, an method with abstraction. That means they don't believe in bitter lesson. Like that's just false, right? Like we are believed is a bitter lesson. But then I feel like the question that we always discuss is like, what is the right abstraction level today?[00:13:42] The analogy I like to make is like, let's just say we can encode and decode. Represent all of images, videos, audio and bytes. Then the most bitter lesson approached is to train a next byte prediction model as opposed to the next token prediction model where it's just like, okay, it's natively multimodal, can just, but it's like, yeah, like [00:14:00] to, to Chris's point, it's like the scale and computing you need to achieve that.[00:14:03] So that's why we always come back to like, okay, what is the most efficient way to do it? And reasoning models to the point of this blog post is a showcase of like, Hey, we're actually just like reasoning about the world and reasoning about. The aspects of the world that CAGR that matter for me to learn what I want to learn from this role model.[00:14:21] swyx: Yeah, it's like you're improving the en encoder of whatever you're, trying to model. And like a better representation would just represent the important things in less space. Yeah. Which would just be more efficient.[00:14:33] Fan-yun Sun: Yeah.[00:14:34] swyx: So yeah, I, I, I fully agree that it is not, antagonistic to, bitter lesson.[00:14:38] I do wanna wanna mention one more thing. Is there any philosophical differences with the JPA stuff that, Yun is working on? I gotta go there. You, you, you, you're, you're imagining like some latent abstraction. I'm like, okay, fine. Let's, let's talk about it, right? Like it's an elephant in the room.[00:14:52] Chris Manning: Yeah.[00:14:53] JEPA & Philosophical Differences with LeCun[00:14:53] Chris Manning: There are philosophical differences. Jan Lacoon is a dear friend of mine, but. [00:15:00] He has never appreciated the power of language in particular, or symbolic representations in general. Yarn is a very visual thinker. He always wants to claim that he thinks visually and there are no words, symbols, or math in his head.[00:15:21] Maybe that's true of yarn. It's certainly not the way I think. Um. But at any rate, the world according to yarn is the basic stuff of the, the world and of intelligence is visual and language is just. This low bit rate communication mechanism between humans and it doesn't have much other utility and it's far inferior to the high bit rate video, that comes into your eyes.[00:15:53] And I think he's fundamentally missing a number of important things [00:16:00] there. Think of this evolutionary argument looking at animals, right? That the closest analogies, the things with chimps, right? So chimpanzees, have fairly similar brains to human beings. They have great vision systems, they have great memory systems.[00:16:18] They've got, better memory than we do of short term memories. They can plan, they can build primitive tools that, humans. Massively ahead in what we understand about the world, what we can plan, what we can build. And essentially what took off for us was that humans managed to develop language and that gave a symbolic knowledge, representation, and reasoning level, which just, okay if this sort of vaulting of what could be done with the intelligence in brains.[00:16:59] So the [00:17:00] philosopher Dan de refers to language as a cognitive tool and argues that, humans unique among the creatures in the world have managed to build their own cognitive tools and language is the famous first example. But other things like, mathematics and programming languages are also cognitive tools.[00:17:21] They give you an ability to. Think in abstractions, in extended causal reasoning chains. And that allows you to do much more. And we use that for spatial representation and intelligence and planning and gameplay as well. So we believe, and this is, underlying the specific technologies that Moon Lake is making, that symbolic representations are powerful.[00:17:50] And you want to use that in your understanding of the visual world when you want a causal understanding, when you want to maintain long-term [00:18:00] consistency and prediction. And as I understand it, that's just not in ya Koon's worldview. So I think that's the fundamental philosophical difference. Then there's the specific model.[00:18:11] He's been advancing jpa, that's a reasonable. Research bed is a direction as to, to head for building out a model of the visual world. To my mind, it's sort of one reasonable research bed. It's not really established. It's the best one that everyone should be following,[00:18:32] swyx: at least developed at scale, at Meta.[00:18:34] But it's not just vision, right? Like, I mean, JPA is a, just joint admitting prediction can be applied to anything really. And people have done it. The argument is that there is a latent representation or that is probably more. Suited to the task, then why not let machines do it for us instead of predefining it at all?[00:18:50] And isn't something like a JPA shaped thing the right answer? And if not, why not?[00:18:55] Chris Manning: So I think there's a part of jpa that's right, which is [00:19:00] you do want to have a joint. Embedding that gives you a consistent model of the world. And Jan's argument is you can never get that from auto aggressive language models ‘cause they're sort of left to right churning out one token at a time.[00:19:22] I guess this is where we're the research arguments of the field, I'm not actually convinced that's right. ‘cause although the token production is this auto aggressive, process that's heading, left to right, I guess don't have to be left to right. But anyway, in sequence of tokens we could have right to left Arabic.[00:19:40] But although that's true, all of the weights of the model that are internal to the transformer, they are a joint model of the model's understanding of the world. And so I think you can think of the weights of the model as a form of. Joint representation, [00:20:00] and therefore it is plausible to think that could be the basis of a world model, which avoids, ya's objections.[00:20:10] swyx: I think I follow, and obviously that would touch on what Moon Lake eventually ends up doing as well. Right. Like, which it's hard to tell because you put out the end results, but we don't know the inputs that go into it. So it's, it's, that's something that we have to figure out over time.[00:20:25] Vibhu: Yeah. I mean, I guess this kind of breaks down some of the outputs. Do you wanna walk us through it?[00:20:31] Reasoning Traces & Interactive Worlds[00:20:31] Fan-yun Sun: Yeah. So this, this really just walks us through the reasoning traces of like, okay. So that just say, if we wanna build a world in this context, it's really just a game demo that, that shows the, the variety of interactions that this world model can build.[00:20:45] And yeah, it's really just a reasoning traces of like, okay it prompted to create a bowling game. Like how did it achieve what you saw? That level of causality, interaction and consistency, right? So yeah, this is almost just like a, an example of [00:21:00] like a reasoning traces. Very[00:21:01] swyx: detailed.[00:21:01] Fan-yun Sun: Yeah.[00:21:01] Vibhu: Very, very detailed.[00:21:02] You gotta you don't even realize it, right? Like when a video is generated, what happens when a ball strikes a pin, right? So first, like you, there's audio in that, like audio triggers happens, score increments, the world changes. Like pins have to start dropping. There's a timer that goes on. It's just like very similar to how now we're used to reasoning for language models.[00:21:20] There's a whole state of what happens. So geometry, physics, all this stuff. And then yeah, there's kind of that single prompt. So asset, ation all this stuff. It's like a, it's a nice view to see what's going on.[00:21:32] swyx: I think Sun is also too polite to point out that, both like Google's genie, demos as well as world Labs is marble, do not have interactive worlds.[00:21:41] Fan-yun Sun: That's the benefit of having a reasoning model, right? Like, because you can, you can say, oh, like maybe in this particular context, I want to learn how to bowl. And then you can say, okay, then what is it important when it comes to learning how to bowl? Okay, maybe it's like I need to understand the, the basic of like, physics and I want to throw it over [00:22:00] them.[00:22:00] I wanna know that when I, when it resets it's a new game. So I know that yeah, basically, you know to pick up the ball, you know that ball's gonna cause the pins to fall down. You know that what's important to this particular bowling game is to score and you know that the score corresponds to the number of pins that fell down.[00:22:19] So it's just like, if it's a model that sort of knows what it. Looks like, knows what a bowling game looks like, but doesn't actually allows you to practice over and over again and to understand that, oh, like what it takes to actually get a high score. Then it sort of doesn't actually allow you to learn what you set out to learn within the world model.[00:22:38] And I think this is really just one example of showing like the advantages of the approach that we're taking over most the, let's call it the zeitgeist, is today, when people talk about clinical role models,[00:22:51] Chris Manning: right? So it sort of seems like the question to ask when there's a world model is.[00:22:58] Can I not [00:23:00] only just wander around the world and look at the beautiful graphics, can I interact with the objects in the world and see the right consequences of actions?[00:23:11] Vibhu: And you also understand what the consequences would be if you do something right. So it's not just like, okay, there's one thing if I pick it up, something will happen.[00:23:19] But, there's 50 options and I know I can expect, I can infer what would happen if I do any of them. Right. So very different when you can actually see it play around with it.[00:23:28] swyx: There,[00:23:28] Beyond Unity: Cognitive Tools for World Building[00:23:31] swyx: there's two cheeky elements of that. I mean, the, the, the I guess, less ambitious one is, let's really establish for listeners, why is this fundamentally different than writing Unity code, right?[00:23:40] Like just creating a model to translate a prompt into Unity code[00:23:44] Fan-yun Sun: so there is an underlying physics engine. Yeah. In that sense, there's some overlapping things to Unity, but the way we think about it is like physics engine. Tools or code are cognitive tools like borrowing Chris's term, right? Like tools [00:24:00] that the model can employ as means to an end.[00:24:04] So today maybe you say, okay, in this particular context we care about physics, we care about the long-term causality consequences. Then yes, we deploy it, employ physics engine, and then maybe tomorrow we say, okay, we're we're training that. Just say drones where we only care about really fluid dynamics and the visual aspect of the world.[00:24:25] Then, then yeah, maybe we don't actually, the model actually doesn't have to use a physics engine. Or maybe it employs other types of representation or physics engine to achieve the task. So yes, writing code for Unity is sort of similar to a tool that our A model can employ, but our goal is for a model to take a representation conditioned reasoning.[00:24:46] Approach or process.[00:24:47] swyx: Yeah,[00:24:47] Fan-yun Sun: internally.[00:24:48] swyx: Yeah. Using these things as just like general two calls. Right. Which I think is very interesting. The other more ambitious one is, some kind of recursive element where it becomes multiplayer, right? Like here, there's a single player element, you're not [00:25:00] modeling any other people involved.[00:25:01] And that is a whole other thing.[00:25:04] Fan-yun Sun: But in fact, we can really do multiplayers. Oh yeah, okay. I haven't seen any double situations. So just actually just like prompt our, our model to say, Hey, like configure to multiplayer. Then it'll do like this. You'll be able to configure multiplayer[00:25:16] swyx: great[00:25:17] Fan-yun Sun: persistency database for you.[00:25:18] Easy. Yeah.[00:25:19] Vibhu: So what, what are like some of the current limitations in where we're at? So there's one approach of like, okay, scale up video predictors. Obviously there's data issues. With approaches like this, is it data constraints? What are like the next steps? Is it real time? Like, so there's one side of, write an agent to write Unity code, but okay, I want to be streaming a game real time.[00:25:38] I want to have characters being also like agent, but where, where do we kinda see this scaling up? Right?[00:25:44] Fan-yun Sun: Yeah, there's definitely a data constraint. Like the more data, the, the better. This reasoning model can almost basically act as humans to like operate a variety of tools and softwares to build whatever's necessary.[00:25:57] And then there's a sort [00:26:00] of fidelity constraint, which we're actually solving with another model, which we can talk about later. But it's like, it's not as easy to get to photorealism with the approach that we're taking. But we think there are better solutions to that, which is we can dive into later.[00:26:14] Later.[00:26:15] Vibhu: The one one thing you note here is it's a diffusion model, right? So there's, there's a few approaches, diffusion caution, splatting, yeah, so Ry diffusion model, you guys wanna[00:26:25] Fan-yun Sun: Yeah.[00:26:25] Vibhu: Introduce,[00:26:26] Fan-yun Sun: yeah, totally.[00:26:26] Rie: Neural Rendering & Skins for Worlds[00:26:26] Fan-yun Sun: So within our world modeling framework, we think there are two models that we train, right?[00:26:31] Like, there's the multimodal reasoning model that we just talked about that essentially handles. Mainly the, the causality, the persistency and logic determinism of the world. And then RY is our bet on saying, okay, like while all those model, can take care of all these things that we just talked about, it's limitations compared to existing, say, video models, is that it doesn't have as high of a pixel [00:27:00] ality right off the gate, right?[00:27:02] And EE is to say, Hey, we can actually take whatever persistent representation that we generate with our multimodal reasoning model and learn to restyle it into photo photorealistic styles or arbitrary styles you want. So this model is almost to say, Hey, I'm going to respect the persistency and interactivity of the world that you created, but my only job is to make sure that its pixel distribution is close to what we want.[00:27:29] Vibhu: Yeah.[00:27:30] swyx: Great example right there. You kept the KL divergence.[00:27:33] Fan-yun Sun: Oh. Where,[00:27:34] swyx: no, no. I mean this, this is a, a classic like, how you don't stray too far from the source material as you, you kept the kl, which is Oh yeah. Kind of cool. Yeah.[00:27:43] Fan-yun Sun: Yeah.[00:27:44] swyx: I mean, and the[00:27:44] Chris Manning: difference is, and I mean sun was pointing at this, where sort of saying it's in one way a more difficult path, but a better path that, typically the diffusion models are producing the whole scene and it looks lovely, [00:28:00] but there isn't spatial understanding behind it, which is allowing for the real time graphics gameplay, the spatial intelligence, understanding the consequences of worlds where this is, taking a path where it is assuming an abstracted semantic model of the world's state.[00:28:20] And then the diffusion model is then being used on top of that to produce the high quality graphics.[00:28:27] swyx: Is there an intended practical, or business use for this, or is it like a, like a demonstration of capabilities?[00:28:34] Fan-yun Sun: We actually believe that this is gonna be the next paradigm of rendering. So it's gonna replace how ra raizer, it's gonna replace DLSS today because it not only has these pixel prior that's learned from the world such that you can literally play any game in photo realistic styles, which is a lot of people's desire when they do GTA, right?[00:28:51] Like,[00:28:51] Vibhu: all the mods, all the people adding perfect lighting and all this.[00:28:54] swyx: So[00:28:54] Fan-yun Sun: skins[00:28:55] swyx: for worlds, let's call it[00:28:56] Fan-yun Sun: skins, let's call it skin for worlds. I,[00:28:58] Vibhu: it's also like, you can call it skin, you can call it [00:29:00] customization. You can play it how you want, right?[00:29:01] Fan-yun Sun: Yeah, exactly. And I think another thing that we really pointed out specific specifically in this blog is the programmability of it, right?[00:29:09] So what this means is that this render historically render is always a derivative of the game state, right? You're saying, oh, here's the game state, I'm rendering out a frame. But here I'm saying actually this render can be part of the gameplay loop. I can say something along the lines of, if upon getting 10.[00:29:26] Apples, I'm gonna, my weapon of choice, my bullet's gonna turn into apples. And that's, that's possible because we can say, we can basically dynamically have certain game state trigger the, the preconditions to the render such that the rendering is now part of the game loop too. One thing is to just say, okay, it's, it's, it's the appearance.[00:29:47] But the second thing is also to say there's these novel interactions that are possible because this render now has actually priors of the world.[00:29:57] swyx: It is up to the artist to figure out what to do with it.[00:29:59] Fan-yun Sun: It [00:30:00] is up to the creators. Yes.[00:30:01] swyx: Yeah.[00:30:01] Fan-yun Sun: And I also think that's actually another big argument that we're making and the reason that we're picking, taking the bet we're baking is that a lot of the times, whether it's for embody AI gaming, like you want a layer where human can inject their intentions.[00:30:15] So, for example, let's just say in the context of gaming, it's obviously like my creative intent, but maybe in the context of embodied ai, it's like, oh, like I take this foundational policy and I want to actually fine tune it to deploy in my house. So you want to almost say, inject, have a layer where human can say, oh, here's the distribution of things I want to create to achieve my goal.[00:30:35] And I think 3D graphics as it as it is today, is basic, the layer for people to say, Hey, what do I care about in this world? And it allows, basically human intent to be expressed in these worlds much more explicitly and distributionally as opposed to just saying, Hey, I'm gonna generate like, arbitrary.[00:30:54] And it's like just prompts,[00:30:55] swyx: it's one of those things where like, I think you, you're going to build up a series of models, right? [00:31:00] This is just one of, this is probably like the highest utility or heaviest, frequency one, I don't dunno what to call this. Where like you Yeah. You can immediately drop this in on any game and you don't need anything else that.[00:31:10] That you guys do. But, I, I could see, I could see that I think the, the human intent is something that people are not even used to because we're so used to static worlds or, worlds that just don't react, or, I don't know. It's, it, you're kind of blowing my mind right now with like, I'm, I wonder if you've talked to people at GDC Hmm.[00:31:27] And what are they gonna do with it?[00:31:30] Fan-yun Sun: Yeah. Now the stance that we take on this front is like, we're not gonna be more creative than our users to ship[00:31:35] swyx: it out.[00:31:35] Fan-yun Sun: Yeah. But we wanna make sure that we're building things in a way that really allows them to express their intent.[00:31:41] swyx: The thing that you said about, here's the distribution that I want.[00:31:45] I think text may be too low of a bandwidth to. To really demonstrate, because I, I, there, I'm, I'm probably just gonna want to drop in a bunch of, reference assets and then you can figure it out from[00:31:58] Vibhu: there. But you probably wanna do a, a mixture of [00:32:00] both, right? Like you throw in a few images. I wanted this style.[00:32:02] Yeah. I want it to look like this. So it, it's, it's a mixture, right?[00:32:05] Chris Manning: I, I think it's a mixture. I mean, yeah, I mean there's clearly a visual component of this, and it's not that, everything can be text. ‘cause of course you want to give a visual look, but there's also a massive amount of giving the overall picture of the look of the world and the behavior of things that you can express in a few words of text.[00:32:32] And it be very time consuming and difficult to do via visual means. So I think, yeah, you want a combination of both.[00:32:40] Evaluating World Models[00:32:40] Vibhu: So one question I kind of have is, how do we go about evaluating world models? So like, there's many axes, right? One is like, okay. I have preferences. How well do we adhere to prompts? One is the simulation.[00:32:50] One is like do things, is there core logic that's broken? So coming from we know how to evaluate diffusion, there's fidelity, there's [00:33:00] stuff like that. But what are some of the challenges that most people probably aren't thinking about?[00:33:04] Fan-yun Sun: Yeah, I think this is like a great question and probably one of the hardest questions in role models because like, I think it always comes back to what are you building this role model for?[00:33:13] And depending on your end goal and purpose, the evaluation should defer. So in the context of games, then the most direct way of measuring is how much behind are people actually spending in this world that you create? And if your goal is to say, for example, in the context that we just talked about, like, hey, deploying, deploying action in body, a agent, then your, your end.[00:33:33] Metric is then, okay, after training in these worlds that you generate how robust it is to when you actually deploy to the target environment. But then, it's, it's hard to measure these end metrics. So today people have like these proxy metrics that I call that basically try to measure what we really care about, which is the end metrics, but then frankly it's different for every use case.[00:33:57] Yeah,[00:33:57] Vibhu: which seems like quite a challenge, right? Like in [00:34:00] in language models or video models. Image models, your benchmarks are proxies, right? People aren't actually asking instruction, following tool use questions. They're proxies of how well it will do downstream. But for this, so like, should teams, should companies have their own individual benchmarks outside of games?[00:34:16] If you think of stuff like, okay, video production, movies, stuff like that, that also want to use world models. Should, should they sort of internalize like. Their own proxy. Is this something you guys do? Where, where does that connect[00:34:28] Chris Manning: go? Yeah, I think this whole space is extremely difficult as things are emerging now.[00:34:35] And I mean, it's not only for world models, I think it's for everything including text-based models, right? ‘cause in the early days it seemed very easy to have good benchmarks ‘cause we could do things like question answering benchmarks and could you answer the question based on these documents and the various other kinds of, do pieces of logical reasoning or math.[00:34:58] But again, these are sort of. [00:35:00] And there were sort of visual equivalents of things like object recognition, right? For these small component tasks. These days so much of what people are wanting to do also with language models is nothing like that, right? You're wanting to, have an interaction with the language model and get some recommendations about which backpack would be best for you for your trip in Europe next month.[00:35:25] And it's not the same kind of thing, right? And it's not so easy to come up with a benchmark as to does this large language model give you an effective interaction for guiding you in a good way for shopping, right? So, and it's the same problem with these world models. So if we take the game design case, well success is that a game designer can.[00:35:57] Produce what they are [00:36:00] imagining in a reasonable amount of time. And that's really the kind of macro task. That's a very hard thing to turn into a benchmark and I think a lot of this is actually going to turn into people walking, walking with their feet. Right? I mean, I guess that's what's happening, at the large language model level, right?[00:36:23] When people are choosing to use, GPT five or Gemini or clawed, individuals are trying out these different models and deciding, oh, I like the kind of answers that GT five gives me, or no, I feel like I get more accurate detail from Claude, right?[00:36:43] Vibhu: It's a lot of[00:36:43] Chris Manning: vitech, a lot of people just using it.[00:36:45] It's vibe checking. I realize that, but it's actually whether. People feel it's giving them utility in what they want. Right.[00:36:52] Vibhu: And the the interesting thing there is like a lot of people prefer the visual, right? This looks pretty, which is not the objective of what this is [00:37:00] for, right? It's if a, if a game designer is working on something, they care about the game engine, right?[00:37:04] The state, it's, it can look whatever. You can fix that up later. Or you can have a really good game state and you can quickly edit it to 20. 20 different versions, like Keep State,[00:37:14] Chris Manning: right?[00:37:14] Vibhu: So[00:37:14] Chris Manning: that's a really important distinction, for and for speaking to Moon Lake strength, right? So, yeah, great visuals are lovely to look at for a few seconds, but gains are really all about the concept, the game play.[00:37:33] And a lot of the time that doesn't actually even require great visuals. I mean, there are just lots of very successful games which have relatively primitive visuals, and there are other games where people have spent millions producing photo realistic, visuals, and the game sucks, right? So, keeping those two axes apart is really important in thinking about what's important in a [00:38:00] world model for different uses.[00:38:02] swyx: This conversation is reminding me of some game review and fiction discussions I've, had in my sort of non-AI related life. Some, for some people might know Brandon Sanderson, who's a very famous, fiction author, had, is is a big game reviewer. And he, he's a big fan of video games where you change one thing about a normal what you might assume about, about the world.[00:38:22] For example, Baba is you, I don't know if you might have come across that, where like the rules change as you play the game. And also like where, you can do things like reverse time selectively or like change gravity selectively. And I think this is also reminds, reminds me of other kinds of world models that are created by authors.[00:38:38] Where Ted Chang is, is my typical example where he'll take the world that, you know today, but change one thing about it and, but then create a consistent world based on that. Which is long-winded answer of me to, of. For me to say is it's it easy to create alternative roles that don't exist, but you change one thing and then let's, let's run a whole bunch of people through it to see if it works.[00:38:58] Chris Manning: My first dance will [00:39:00] be, that seems a lot easier and more conceivable to do using Techn technology like Moon Lakes than with some of the other world models out there, where the sun can actually make it happen. I'll let him give a second answer.[00:39:15] swyx: If I guess for you, you're constrained by the game engine tool, right?[00:39:18] Like at the end of the day, that's the, that's the thought, partner that you have. If I ask for something where like, if it never is allowed to reverse time or if gravity only ever works one way, then well that's it. But sometimes gravity might change,[00:39:33] Fan-yun Sun: but it's a lot easier to change with code as opposed to a model that is learned primarily on data of.[00:39:42] Real world and virtual worlds that are, I guess, like for example, junior, like there's actually trained on a lot of real world data and a lot of virtual gaming data, and it's hard to say maybe it's easier to say, okay, I wanna change the visuals in like the time period of, of the world. Like, you can't change gravity, for [00:40:00] example.[00:40:00] Vibhu: I feel like you can to light bounds, right? Everything comes down to like, code is a better way to execute it, but the models aren't that diverse and creative, right? You can say, okay, make gravity slower. It can do that, but it's limited to your representation of how you text it out, right? Like they're, they're only gonna do a few iterations, whereas programmatically, if there's a game engine under the hood, you can kind of go wild, right?[00:40:22] So one of the, I dunno, one of the limitations of most models is that they're very overtrained to one style. Right. And extracting diversity is pretty difficult. At least that's something we've seen.[00:40:35] Fan-yun Sun: I mean, are there examples you have in mind where you Existing models? Yeah. Like it would be easier to do that's not using code.[00:40:43] Certain types of creative intent or like transition state transitions,[00:40:47] swyx: Clipping, other models, other wo models are very good at clipping through things. Clipping my, my, my legs clipping through a rock because it's, it's just, it's just bad. [00:41:00] Like, you would have to struggle very hard with your stuff to actually make that happen.[00:41:04] Which I think is maybe a topic that you actually prepared on, Gian Splatting versus, the other stuff.[00:41:09] Vibhu: Yeah. Yeah. It's just for those not super familiar, right? There's a, there's gian splatting, there is diffusion. Like what works, what scales up. I feel like in February when Soro one came out the blog post was literally titled like,[00:41:21] swyx: you bring it up.[00:41:22] You never know.[00:41:23] Vibhu: World, world, video generation models are world simulators. It's super bitter lesson pilled. Yeah, emer, a lot of it is emergence, right? So, not to go through their blog post, basically their whole thing was as you scale up all this consistency, all this stuff just kind of solves, it's a very simple premise, right?[00:41:41] They just scaled up, diffusion, and from there, this is, this is Feb 2024, how much can we, it's already been two years, which is basically five years. How much more in AI time do we need to just scale up or, or do we hit a data cap? But I think we already talked about this a lot, right? Like this is back to the beginning discussion of what's [00:42:00] appropriate for the time.[00:42:01] And that seems like your approach, right?[00:42:03] Fan-yun Sun: Yeah. The point I'm trying to make is that they're very many, many different types of world simulators and like having a world simulator that can produce pixel coherency is very, very useful for games and, marketing and all these things, but it's not as useful as people think when it comes to causal reasoning.[00:42:25] When it comes to embodied ai. Yeah, like it this title is true. We're not saying that it's, it's like, not a great world simulator, but actually in the blog that we, we, we, we wrote, the bet is more so that there are gonna be disproportionately large share of value of real world tasks or, and virtual tasks where high resolution pixel fidelity is not needed.[00:42:47] Yes. Video models have their values.[00:42:50] swyx: Yeah. This is at the absolute limit of my physics understanding, but one example that comes to mind is basically having to solve like ba the equivalent of a three [00:43:00] body problem in a deterministic Well, where the video models, which is approximated good enough. Yeah.[00:43:08] Right. Like there's, there's some point at which your approach kind of runs into like the you now have to simulate the world. Please, thank you very much. And like you're trying to do that, but only to the extent that the game engine lets you and like game engines cannot do some things.[00:43:23] Fan-yun Sun: Yeah, no, I mean, I think the interesting or more technical question here actually is where do you draw the boundary between.[00:43:32] What's handled with, let's say, diffusion prior and what, when? What's handled with symbolic priors?[00:43:38] swyx: Yes.[00:43:38] Fan-yun Sun: Okay.[00:43:38] swyx: Okay.[00:43:39] Fan-yun Sun: Right. Let's go there. Because this, this boundary can actually be fluid. Like I think like maybe what you're trying to get at is like, okay, people are saying pixel prior, everything. But what we're saying is, okay, there's a boundary that we draw where this is where we think provides the most economical value for the domains and things that we care about today.[00:43:59] [00:44:00] And I actually do think, and it's something that we do internally all the time, which is like, okay, given new equations that we learn or new elements of the world and that we, we learn, or maybe some other knowledge that we acquire in the process of developing the models. Should we still be maintaining this line exactly as it is today?[00:44:22] Or should we move it a little bit left or a little bit right? Right. Like sometimes that we realize that, oh, like maybe customers or, or folks like want certain things that are better handled with preop pryor as opposed to, symbolic prior than,[00:44:34] swyx: yeah. Your, your skin thing is a, is a example moving it, right.[00:44:37] Yeah.[00:44:37] Or left. Yeah,[00:44:37] Fan-yun Sun: exactly.[00:44:38] swyx: I dunno what the, the left right is.[00:44:39] Fan-yun Sun: Yeah, yeah, yeah. No the, the model.[00:44:42] swyx: Yes.[00:44:42] Fan-yun Sun: Actually we have a few iterations of them. They're actually at slightly different[00:44:45] swyx: I know boundaries. You should, you should do that. That's a cool dimension to show.[00:44:49] Fan-yun Sun: Yeah.[00:44:50] swyx: Is quantum mechanics the diffusion prior of our world?[00:44:55] Right. It's like that's the boundary of classical mechanics versus quantum. Right? Like, that's it. At one [00:45:00] point God plays dice and the other point doesn't.[00:45:02] Fan-yun Sun: I dunno if Chris, you wanna say it, but I think, I think generally I feel like physics is better with symbol P priors.[00:45:08] Chris Manning: Even quantum physics.[00:45:09] Fan-yun Sun: Even quantum physics.[00:45:11] swyx: Yeah. This is starts against to, MLST territory is, is what I call it, where, he, he likes to get philosophical. We, we we're quite friendly.[00:45:18] Vibhu: I mean, we need to get, we need to get singularity. I heard some of that.[00:45:23] swyx: No, no, I think that is actually really helpful and man, I just want you to productize this like, as a product guy, I'm just like, oh, also[00:45:32] Vibhu: a gamer, I[00:45:33] swyx: wanna, it's like a researcher, like, it's cool.[00:45:35] Like this is a, the theoretical, like you have a very good, I don't know, like the way of thinking about these things, but I just wanna see you like, express it. I do think like your fundamentally things when, when you leave open new tools, like, okay, use, use human intent to incorporate it into how you render.[00:45:52] Artists are gonna have to take like two to three years to figure out what to do with this. And you just don't know.[00:45:57] Chris Manning: Right. But I think, this is, [00:46:00] gives a much more approachable and controllable world for the society, which is the beauty, the beauty of, NLP, that that will enable it to be adopted and used.[00:46:10] And we are very hopeful about that. Yeah,[00:46:13] Fan-yun Sun: yeah. Yeah. I mean, we are, we are very focused actually on commercialization in the sense that like we do, we do really believe in the data flywheel app approach. Yeah. Where, we put this in the hands of the creators and the users and then they will teach us when, what capability our model should improve.[00:46:27] And that's why we are, we are actually, like products and beta[00:46:31] swyx: Yeah. Focusing on gaming. What, what's like the adjacent thing to gaming[00:46:34] Fan-yun Sun: embody adjacent, basically. So maybe we can, we can I'll maybe start with where we see the platform in three years. Yeah. Which is like, okay. The users would tell us what they want to achieve.[00:46:45] The end goal could be, Hey, I just, I wanna make something to teach my kids the value of humility. Or it could be, Hey, I wanna fine tune my, drones to be really good at rescue situations. I could be vacuum robots. I want to like train [00:47:00] my manipulation or like vacuum robot to be very robust to my office, right?[00:47:04] But it's like, whatever it is, scenario robust to[00:47:06] swyx: my office[00:47:07] Fan-yun Sun: or like navigate very robustly in my office. But then it's like, whatever end goal that you want, our role model will say, okay, given what you want to achieve, let me generate a distribution of environments such that I can train and evaluate whatever it is you want.[00:47:24] Yeah. Right. Maybe for the purpose of games, it's just the end simulation and that's the end product for certain policies. It's like I can train it within these environments and then help you see where your policy is failing or not. Yeah. And then, so I think,[00:47:37] swyx: so in that case, much more of a training tool.[00:47:40] Than in other training[00:47:41] Vibhu: evaluation? Both. Right?[00:47:43] swyx: Sure. Same. Same thing.[00:47:43] Fan-yun Sun: Yeah, same thing. I think it's just this role model that allows people to train any policy that can act in any multimodal environments.[00:47:51] swyx: Would it be harder to reward hack? Is there an angle here where it is harder to reward hack? Like it's just, I'll just put it generally because I think that's a, that's obviously a key [00:48:00] problem that a lot of people face when in training agents in these environments, and I don't know, can you solve it?[00:48:07] Chris Manning: I think not necessarily. To the extent that there's a mis specified reward that. It seems like it could be hacked in a more symbolic world or in a more pixel based world. I dunno if Sun's got any thoughts, but I don't think that's really being solved.[00:48:26] swyx: The other thing that comes to mind is just you could just build a better sawa as a video generator model, right?[00:48:31] Because then you, you would move the diffusion, side a bit more further to the right. I think if I got the directionality correct. And that's it.[00:48:40] Vibhu: It's better on domains, right? Like on consistency over now, or for sure it exists versus something doesn't, right.[00:48:46] Chris Manning: So[00:48:46] swyx: yeah. Yeah. Is[00:48:49] Vibhu: is a question more like, like[00:48:51] swyx: I'm just riffing on like, how do you, what can you build, you know?[00:48:54] Oh, with the stuff that you have. I do think that the minor, the academic does go immediately to training [00:49:00] and in eval evaluation, but like art tends to take unusual directions. Like you might end up,[00:49:06] Chris Manning: okay. Yeah. But the question is, can you use this piece of software to develop compelling gameplay and. I don't think you can take SOAR and produce compelling gameplay, right?[00:49:19] If you want to have a world that you can wander around in a bit, you are good. But what are your abilities to have gameplay mechanics implemented the way you'd like them to be and to have things stay, with the long-term history of your gameplay that influences future actions. I think there's just nothing there for that.[00:49:39] swyx: Yeah, I do tend to agree. I, I'm just trying to sort of test the boundaries. I would also make the observation that as AAA games industry has developed the line between what is a movie and what is a game has blurred. And you, you, you do end up basically producing a two hour movie as part of your game.[00:49:57] Fan-yun Sun: No, honestly, there, there's so many actually [00:50:00] applications in adjacent markets that our world model can go into. Yeah. But yeah, it, it's sort of fun to riff, riff on. Although on the execution side, we we, we need to stay focused with like, okay, what are the capabilities we want to unlock over time?[00:50:11] And there's a roadmap for that. But yeah, if we're just riffing on sort of like the possibilities, I feel like, whether it's endless Yeah, it's like classic[00:50:18] swyx: and the embedding for a possibility and endless in my mind, it's very close. Yeah. I do wanna, focus on one, like weird choice. I, I don't know if it's weird.[00:50:28] Maybe I'm, I got something here. Audio, right? You could have just said no audio And audio in my mind has a lot of recursion, whereas in video you can just do recasting and that's much computationally much simpler. Audio just seems way harder. I don't know if you wanna just comment on just the special 3D audio.[00:50:46] Problem. Did you really have to do it? I guess you do to be immersive, but like a lot of people do treat it as like, well, you just stick a, a tt S model on top of[00:50:57] Vibhu: Well, there's a lot more to game audio than [00:51:00] just speech. Right. It's not just[00:51:01] swyx: tts. Yeah. Tts. S Fxt, GM Spatial in my mind Echoes[00:51:06] Chris Manning: Yeah.[00:51:06] swyx: And reflections.[00:51:07] And I, I don't even know what's, what else? I don't know what, what other problems in this space.[00:51:13] Fan-yun Sun: Yeah, I think this point like the, it's sort of a more, more pointing to the benefits of using an game engine as a tool that's available to the model, right? Because like part of the spatial audio is from the code that is underlying the simulation.[00:51:32] And while we do give our model access to other types of audio models as. Tools.[00:51:39] swyx: None of them would be spatial, I think.[00:51:41] Fan-yun Sun: But that's exactly sort of more 0.2. We're giving our model an abstraction or a suite of tools such that it's able to achieve that. And you can argue that sort of spatial is like a, like a emergence out of the, the tools that we and abstraction that we provide to the agents.[00:51:59] And I think that's the beauty of [00:52:00] this, this, this approach is like there's a lot of things kind of like how human's built technology and they're like Lego blocks that build on top of each other. And it's the same thing here. There's gonna be things that sort of just sort of emerges from being able to put these things together in like combinatorially interesting ways,[00:52:14] Chris Manning: right?[00:52:15] So this integrated audio model exploits the understanding and semantics of the Moon Lake world, right? And whereas in general for the Gen AI video models. There's no actual integration across to audio at all, right? That someone might stick some music or stick a soundscape or whatever else on top of their video.[00:52:44] So it's not a silent video, but they're in no way connected into a consistent world model. And there's nothing that's okay. An action is happening in the video. Therefore there should be a sound that's [00:53:00] coming from this part of the visual field.[00:53:03] swyx: Yeah.[00:53:03] Vibhu: Is that different than Sora too? Does it not have audio?[00:53:06] Not to say it's not like[00:53:08] swyx: amazing[00:53:08] Vibhu: isn't a spatial[00:53:09] swyx: audio.[00:53:09] Vibhu: It doesn't,[00:53:10] swyx: no. I've played around it with it enough. It just sounds like someone put an 11 laps voice on top of it and just tried to do the lip sync.[00:53:18] Vibhu: Oh, yeah. I've seen, okay. Generate a dog at the beach and reactions to big wave and move[00:53:23] swyx: around.[00:53:23] It's definitely like, so have the dog, have the dog move away from camera and see if the, the song goes down. It doesn't. ‘Cause they don't have facial audio.[00:53:32] Fan-yun Sun: We do want to basically like we, our moral model, like the one we're training is basically towards the goal of having a combined latent representation across all these different modalities.[00:53:42] Right? Such that it can like reason across these different modalities. So for example, if I close my eyes and like you play a video, you play a sound of like a car skidding away from me. I almost can like, visually extrapolate that trajectory in my mind. And I think that type of capability, we want our model to be able to reason, right?[00:53:59] And that's the reason that [00:54:00] we're sort of taking this multimodal reasoning approach. It's like we want this combine late in space that can[00:54:05] swyx: Yeah. Oh, you said late in space. We like that. Here we have to play the, the bell Every time that someone says late in space, no, you gotta train daredevil one. Where you, you, you, it's only audio, but you have to work out.[00:54:15] Where everything is.[00:54:19] Cool. I I think that that was, that was about it for our Moon Lake coverage. I do think that we have like a couple of, Chris Madden questions on, on IR and, just any, any other sort of attention topics or n NLP topics.[00:54:31] Vibhu: Okay.[00:54:31] swyx: Go ahead.[00:54:32] Chris Manning's Journey: From NLP to World Models[00:54:32] Vibhu: Well, no, I mean, yeah, it's just fun. We talked a bit about how you guys met, but you basically, you, you were like the godfather of NLP per se, right?[00:54:39] You spent the whole career from early embeddings, early early attention. You did 2015 attention for machine translation, everything. You, you had information retrieval, so RAG before rag, we just wanna shout that out and admire a lot of that. Right? So what prompted the switch over to world models?[00:54:56] How, how'd all that come about?[00:54:58] Chris Manning: To some answer it [00:55:00] is, the enthusiasms and creativity of students, but there's a bit of a history there, right? So, yeah. So clearly most of my career has been doing stuff with language and how I got into research was thinking, ah, this is just so amazing how humans can produce speech and understand each other in real time.[00:55:21] And somehow they managed to learn languages from their kids. How could this possibly happen? And so, yeah, starting off I was very focused on language, but as it sort of got into the 2000 and tens, I started, going, I'd been working on question answering, and then I started to get, interest in visual question answering.[00:55:42] And that was an area where it was very noticeable. That the visual understanding was bad. Right. These were the days when like, it sort of seemed like there's almost no visual [00:56:00] understanding. You were just getting answers that came from priors. So, if you asked how many people are sitting at the table, it'd always answer two regardless of how many, how many people you could see in the picture.[00:56:11] And so it seemed like, oh, these models actually aren't able to get semantic information outta
“Fiction has this unprecedented power in tech spaces. The more I started talking to engineers about their technical problems, the more I realized there’s so much more that humanities could offer.” –Nina Begus About Nina Begus Nina Begus is a researcher at the University of California, Berkeley, leading a research group on artificial humanities, and the founder of InterpretAI. She is author of Artificial Humanities: A Fictional Perspective on Language in AI, which received an Artificiality Institute Award, and First Encounters with AI. Webiste: ninabegus.com LinkedIn Profile: Nina Begus Book: Artificial Humanities What you will learn How ancient myths and archetypes influence our understanding and design of AI Why the humanities—literature, philosophy, and the arts—are crucial for developing more thoughtful and innovative AI systems The dangers of limiting AI concepts to human-centered metaphors and the need for new, more expansive imaginaries How metaphors shape our interactions with AI products and the user experiences companies choose to enable The challenges and possibilities of imagining forms of machine intelligence and language beyond human templates Why collaboration between technical experts and humanists opens new frontiers for creativity and responsible technology What makes writing and artistic creation uniquely human, and how AI amplifies—not replaces—these impulses Practical ways artists, engineers, and thinkers can work together to explore new relationships and futures with AI Episode Resources Transcript Ross Dawson: Nina, it is wonderful to have you on the show. Nina Begus: Thank you for having me. Ross Dawson: You’ve written this very interesting book, Artificial Humanities, and I think there’s a lot to dig into. But what does that mean? What do you mean by artificial humanities? Nina Begus: Well, this was really a new framework that I’ve developed while I was working on the relationship between AI and fiction, and I started working on this about 15 years ago when I realized that fiction has this unprecedented power in tech spaces. So this is how it all started, but then the more I started talking to engineers about their technical problems, the more I realized there’s so much more that humanities could offer in this collaborative, generative approach that I’ve developed. I would say that now, as the field stands, it’s really a way to explore and demonstrate how humanities—as broad as science and technology studies, literary studies, film, philosophy, rhetoric, history of technology—how all of these fields can help us address the most pressing issues in AI development and use. And it’s been important to me that this approach uses traditional humanistic methods, theory, conceptual work, history, ethical approaches, but also that it’s collaborative and exploratory and experimental in this way that you can look back into the past and at the present to make a more informed choice about the future. You can speculate about different possibilities with it. Ross Dawson: Well, art is an expression of the human psyche, or even more, it is the fullest expression of humanity, and that’s what art tries to do. Also, I’m a deep believer in archetypes, human archetypes, and things which are intrinsic to who we are, and that’s something which you can only really uncover through the arts. Now we have arguably seen all these archetypes play out in real time, these modern myths being created right now in the stories being told of how AI is being created. So I think it’s extraordinarily relevant to look back at how we have depicted machines through our history and our relationship to them. Nina Begus: Yes, this is the reason why I started exploring this topic, actually, because there were so many ancient myths, these archetypal narratives that I’ve seen at the same time, both in technological products that were coming to the market and in the way technologists were thinking about it, and also in fictional products and films and novels in the way we imagined AI. I framed my book around the Pygmalion myth, but there are many, many other myths—Prometheus, Narcissus, the Big Brother narrative, and so on—that are very much doing work in the AI space. The reason why I chose the Pygmalion myth is because it’s so bizarre in many ways: you have this myth where a man creates an artificial woman, and then in the process of creation, falls in love with her. So there’s the creation of the human-like, and there’s also this relationality with the human-like. You would think this would not be a common myth, but quite the opposite—I found it everywhere I looked. It wasn’t called the Pygmalion myth, but the motif was there. I found it on the Silk Road, in ancient folk tales, in Native American folk tales, North Africa, and so on. So I think this kind of story is actually telling us a lot about how humans are not rational, how we have some very deeply embedded behaviors in us, and one of them is that we anthropomorphize everything, including machines.So I think this was a really important takeaway that we got already from the early days of AI with the first chatbot, Eliza. We’ve learned that that will be a feature of us relating to machines. Ross Dawson: So Joseph Campbell called the hero’s journey the monomyth, as in, there is a single myth. And I guess what you are doing here is—well, if you agree with that, which I’d be interested in—is that there are facets. The classic hero’s journey is quite simple, but there are facets of that monomyth, or something intrinsic to who we are, that is around this creation. And in this case, as you say, this relation we have with what we have created. Would you relate that at all to Joseph Campbell’s work? Nina Begus: I haven’t thought about it in this way, because I thought about myth and myths more and less of a storytelling issue, which here is definitely happening—the hero goes on a task, returns back changed, and maybe changes something in the community. The myths that I was looking into and the metaphors that I was exploring, primarily this huge metaphor of AI as a human mind, as an artificial reason—I think it works differently. It’s less of a narrative; it’s more of an imaginary of how or towards what we are building. I think this is a big problem, actually, because the imaginary around AI is very poor. What you get is mostly imagining machine intelligence on human terms, and a lot of people are bothered by that in the AI discourse—right, when you say the machine thinks, or the machine learns, or it has a mind, and some people go as far as to say it has consciousness. I think this kind of debate is actually not that productive. I think it’s more important to see how all these different AI products that we’ve created—and mostly when we talk about AI, people think of language models now—are very much designed as a sort of character, almost as an artificial human that, in literature, authors have been creating for a long time. So I think in that case, we can get back to a hero’s journey. But I think what I was looking at was actually more on the surface level of what kind of shortcuts we are using with these metaphors that we’re employing when building and using AI. I think the book makes a really good case showing that, yes, this is actually a very cultural technology. It’s very much informed by our imaginaries. One surprising part of it was really how hard it was to break out of this human mold. It was pretty much impossible to find examples of machines that are not exclusively human-like. I think Stanislaw Lem is one of the rare writers who can consistently deliver this kind of imaginary. Even looking at more recent works, like popular films such as Hollywood’s Ex Machina or Her, you can see how the technologists themselves would say, “Oh, we were influenced by this film,” in a way that it affirmed their product development trajectory. You can see it now, at this moment, with OpenAI launching companionship. So in many ways, not a lot has changed. Ross Dawson: Yeah, there’s a lot to dig into there. I just want to go back—in a sense, Pygmalion is a metaphor, but it’s also a myth. It is a story: creates a woman, and then falls in love with her, and then whatever happens from there. There is this, something happens, and then something else happens. That’s what a story is. I think that can impact the implicit metaphor, but coming back to the metaphor—so George Lakoff wrote the beautiful book Metaphors We Live By. I think the way the brain works is in metaphors and analogies to a very large degree. Some of those are enabling metaphors, and some of those are not very useful metaphors. I think part of your point is that some of the metaphors that we have for thinking about AI and machines are not useful. There may be, or we could create, some metaphors that are more useful. So, what are some of the most disabling metaphors, and what are some of the ones which could be more constructive? Nina Begus: Yes, So I think this main metaphor that I’ve mentioned—of AI as a human mind—is very limiting. I think it really limits the machinic potential to actually do something good with it. The fact that we’re still using the criteria that were made for humans, like different criteria developed on human language—the Turing test was one of them, right, a while ago. Now we have stricter ones. I think this tells you a lot about how we actually evaluate AI and how even these benchmarks that are supposed to be quantitative are actually often qualitative, often stories, like mini-narratives. But yeah, when we look at different metaphors in this space, there are other ones that also emerge from fiction. I mentioned the Big Brother, the AI as an Oracle, and we need to be aware that these ideas inform the very interaction we have with AI. If we think of it as a mirror, we’re going to use it differently—it’s almost as a bouncing board. If we think of it as a teacher, or as a coach, or as an assistant, it would again create a different use. So I think there are a lot of these metaphors that the companies themselves are trying to decide which one they will go with, because it completely changes the user and the interaction. I think they’re also very cultural, even though you might say, “Oh, it’s a categorical mistake to treat a machine as a human.” I think you can see this kind of treatment across, at least in part, and it doesn’t mean that we consider it human. It just means that we’re engaging with it on our own terms, as if it was human. Now, what could be productive? I do think metaphors, even if they’re not accurate, can be productive. My goal, really, with the book was to break out of this projection of what the machine could be, to find in this exploratory way other directions, other landscapes where we couldn’t go because we’re being limited by our imaginary, by our ideas. So in this way, I think humanistic approaches can be very helpful to designers, to technology builders, to artists, to explore the novelty that so many of these sectors are after. Ross Dawson: Yeah, and I guess people latch on to what they know. I think that’s part of the thing where with AI, “Oh, it’s like a human. Let’s treat it like a human, and let’s make it like a human.” It is, amongst other things, a lack of imagination. That’s where the humanities, the arts, can offer us—those who have the imagination to be able to envisage different possibilities or relationships. But I guess part of it is also that humans relate, and so we have learned to relate to other humans and also to other animals and hopefully to nature as well. But these are all established patterns of relating. So do we need to discover in ourselves new ways of relating to new categories—things which are not humans, not animals, and not nature? Nina Begus: Exactly, this is the exact problem we’re dealing with, and because we’re dealing with a yet unexplored, yet undefined relation, and we’re using old, outdated terms for that relation. This is why we don’t really have a good way of describing it and establishing it. It will take a while for this to develop, which is fine, but we need to realize that there are some concepts that we’re using that we better leave behind and go ahead by building new ones. This is why I think it’s really important to work in a more interdisciplinary collaboration, so that you can see what you can actually build from the technical perspective, so that you can see what these machines are actually capable of. Because you usually don’t know when you create them right?Machine learning is sort of exploratory by design. Ross Dawson: So, just to call it out more explicitly, what are the metaphors you think are the most destructive or most inappropriate, and what are some of the ones which you think are the most promising? Nina Begus: Well, I’m just writing on the Midas myth, which is sort of the opposite of the Pygmalion myth. With Pygmalion, you lean into that human imitation, but with Midas, you lean into the liminality that Midas presents as this sort of hybrid creature. I think leaning into the boundaries that we draw for ourselves—and now AI is not cooperating with them—this is where the productive part will be in actually creating something that has philosophical dignity, but also a kind of productive trajectory for the machines to go. I feel like we’re still in this first phase of developing AI, because when you look at it historically, we haven’t really moved from the conceptual and philosophical premises that were established in the 1940s, 50s, and 60s for this technology. We have now gotten the technology that caught up to the ideas from the 60s, but we’re still stuck in the same conceptual space. Ross Dawson: Yeah, very much so. And, you know, of course, what is AGI, which everyone talks about, is basically—the only way in which people seem to be able to frame it is as relative to humans, which is the only reference point we have. I mean, there’s, of course, animal intelligence, but that’s because of that. It is, again, that lack of imagination—saying, “Well, intelligence, oh, intelligence is what humans do, so let’s do something which is the same as that,” whereas there’s so much white space in what intelligence could be. I think this almost comes back to definition. When people say intelligence, the word, when they use the word intelligence, they are referring to what humans do. It’s not a general term, and so it all becomes a language problem as well, because we are so rooted to relating our language to human capabilities, as opposed to a more general potential. Nina Begus: Yes, I think you’re really on to something here, because I can see it also—because I work with animal communication researchers, and we’re finding things there that we didn’t find because we limited ourselves to thinking language is just a human production, that it needs a human subject. Now, as soon as we got rid of this presumption, we’re finding new things, things that are basically parallel to what we do in our language. So language is in a space of tension because it’s being attacked both from the animal side and from the machinic side, which is why I really focused on language in this book. It’s not a coincidence that we centered artificial intelligence in language as the interface, because this is how we relate to the world—this is our interface to talk to each other, to understand each other. I think the fact that language is coming under such pressure as an interface brings with it a lot of other concepts that are being challenged. Are only humans creative? Is there a natural creativity, machinic creativity? Is there a different kind of intelligence that’s maybe solely biological, embodied? How do we think about cognition? How do we think about culture? In AI and in the natural world, there’s so much that comes with it: agency, autonomy, freedom, community, which I think we will be grappling with for the next few decades, at least. Ross Dawson: I think you alluded before to the potential for AI to have its own languages. Nina Begus: I’ts happening already. The reason why I like Stanislaw Lem so much is because he can actually think about a machine—back in the 1970s, he’s doing that—about a machine that’s not human-like, that’s not limited to human language. It is trained on human language, but then it goes its own way, where the human linguistic ceiling just cannot go anymore. We’re already seeing that in the models, in Berkeley’s Biological Artificial Intelligence Lab, in the models that are not large language models, but generative adversarial networks that are based on speech. We see that as they are learning the words, they are encoding some information into silences that we don’t know what it is. I think what’s really exciting to me are two things about language in machines. The first one is, what is this non-human production of language? We did not think that non-humans can produce language, even though we had parrots who had to crawl their way to us to speak in “humanese,” to show that they have some kind of intelligence—even if it’s just parroting, even if it’s just what we call imitation, which some people consider not to be intelligence. We’ve had these examples before, but now it’s gotten nuclear—on this scale that LLMs are performing, it’s really challenged a lot of our solely human attributes: creativity, storytelling. A lot of journalists come to me because there’s this existential fear of machines taking over their work and so on. So we’ve been thinking about those things, and now it’s actually happening. Ross Dawson: One of the other key points here, I think, is that humanity is—the arts—there’s so much, as you mentioned, in terms of fiction, in terms of films, in terms of visual arts, and many other artistic domains. We have reference points that we use, and the amount which people refer to the movie Her in the last years is pretty extraordinary, partly because it’s obviously coming very much true. I think the Ex Machina story is very interesting as well, as are many others in the past. But there is also this act of imagination. There are people who have written these books, who have crafted these films, who have created these things, and they are the ones who have been not just manifesting our human psyche, but also pushing that out and coming up with ideas which others haven’t had, to give us something. So one thing we can certainly do is mine and dig into what has been created. But is there a way to interface through this to this act of imagining, which can give us new artifacts and ways of thinking and ways of relating? Nina Begus: Yes, I think imagination and humanities in general are going to become more and more important, because AI will do a lot of technical work, but imaginaries—this is what we really excel at. It’s actually interesting to see how you think fiction is this unbounded landscape where you can imagine anything, and yet it’s really hard to find examples of machines that are beyond the human. Even these writers, like the screenwriters for Her and Ex Machina, create these completely Pygmalion-esque films, where you have an artificial woman leading a relationship with a human man, and so on. For the whole film, you have her act as a human-like entity. But then at the end of each of those films—well, particularly in Her—Spike Jonze really tried to break out of this and show her AI side. Basically, there was no language to describe it, so he resorted to a metaphor—the metaphor of a book, where Samantha, the operations assistant, explains that her world is falling apart, like the way words are floating further and further apart in a book. That’s how she’s able to describe it; that’s the closest she gets. And then in Ex Machina, Alex Garland really wanted to portray the world from the social robot Ava’s perspective in a visual way. He wrote down a scene, but he said, “I failed to execute it visually. I just couldn’t do it well.” So instead, he gave us a different scene that’s shot from afar, where Ava embarks onto a helicopter and she has to undergo her Turing test—the helicopter pilot cannot recognize her as a robot; he needs to think she’s a human woman. There have been attempts, I think even in Garland’s next film Annihilation, they’re trying to set the grounds for something that’s entirely new and hard to imagine. I think a big takeaway for us is this is very hard to do. Ross Dawson: Yes, well, given that context, I do want to—as in the human plus AI framing—given all of this, what is it that we can do or should be doing in order to amplify our humanity, our capabilities, the positive aspects of what it is to be human? How can we relate to or use AI in order to amplify the best of us? Nina Begus: Yeah, I actually had, while I was writing the book Artificial Humanities, this other dream project to work with writers—professional writers, creatives, people who live in a world of words—to see what they make of AI. I waited a little bit for the public’s polarized reactions to calm down a bit and gathered 16 writers, some of whom already made a space for themselves in the field, like Sheila Heti and Ken Liu and Ted Chiang, and then some of the more junior writers who I knew were thinking about that—a Netflix screenwriter, and so on. I gathered them to see—I think the creative people are really the answer here—I gathered them to see how they approach this very human part of the new human and AI collaboration zone. What was common across a lot of essays that are coming out in October under the title “First Encounters with AI” is this argument that, well, AI doesn’t have subjectivity, it doesn’t have emotions, it doesn’t have a body, it doesn’t have experience, it doesn’t have meaning—all of these things that really make us human, all of these parts that actually make art compelling and literature compelling. So Ken Liu’s argument, for example, was, let’s leave machines what they’re good at—they’re good at imitating and copying—and we’re good at interpreting, we’re good at creating and imagining. I think this is really a way to go with this. This catastrophizing that’s very present in the public discourse, I think, is a bit misleading. I wish we had a more nuanced approach to what’s actually happening, particularly in the space of writing. Obviously, AI is a groundbreaking technology that affects pretty much every one of us and all the sectors, but when it comes to writing, we just don’t think it’s killable. We think that there’s this perennial impulse that humans have to play with language, and that is not going to go away with AI. We’re just going to amplify it through AI, through this new possibility that has now opened in many ways. I like to think about AI as—you know, we’ve figured out how to fly. As soon as we figured out the physics of flight, we had planes and helicopters and drones and kites, and these are the new possibilities for human activities. In the same way, we figured out the machine learning principles, and now we have large language models and diffusion models, and we have GANs and so on, and there will be more. These are the new spaces of possibility that have opened for our activities, for our spirit to work on, but they do not replace the human in a meaningful way. It’s more about extension than it is about automation. Ross Dawson: Yeah, that’s a wonderful way of framing it. So where can people go to find out more about your work? Nina Begus: I have a pretty populated website with my name, ninabegus.com, where I write about my books, I write about my public work. I have videos on there, podcasts, links, and so on. I also have a pretty lively lab with a lot of collaborators and students, where a lot of what I imagined when writing Artificial Humanities—where a lot of collaborative projects happen. We have artists, we have engineers, we have philosophers that work on the same question, but come at it from very different backgrounds and with very different skills. I think this is becoming more and more important in the world of AI. Ross Dawson: Yes, yes, bringing all of those disciplines and frames and thinking together. That’s wonderful. I love what you’re doing—very important. I hope the messages ripple through, and obviously wonderful to be able to share this with the Humans Plus AI audience. Thank you so much. Nina Begus: Thank you, Ross, and thank you all for listening. The post Nina Begus on artificial humanities, AI archetypes, limiting and productive metaphors, and human extension (AC Ep38) appeared first on Humans + AI.
Original Recording Date: March 26, 2026Ella Gans is our latest guest as we spend an hour talking about nerdy stuff!Ella's YouTube Channel: https://www.youtube.com/@ellagvaMy YouTube Channel: https://www.youtube.com/@robertjackson6644
Bernd Bender, Dharma-Vortrag am 22. März 2026, Zen-Tag im Akazienzendo, BerlinBernd verbindet das Koan „Nánquáns Pfingstrose wie in einem Traum“ (z.B. Buch der Gelassenheit Fall 91, siehe unten) mit dem Praxisansatz der Yogacara-Schule („Einfach nur Geist“), wie sie u.a. von Vasubandhu im Indien des 4. Jahrhunderts entwickelt wurde. Diesem Ansatz zufolge spaltet das Bewusstsein Selbst und Welt auf. Das „Außen“ erscheint uns dann abgetrennt von uns selbst, während wir diesem Außen gegenüberstehen. In dem Koan taucht das Bild einer Gans auf, die gefangen in einer Flasche aufgewachsen ist. Die Außenwelt sieht sie nur durch milchiges Glas. Indem wir mit dem aufspaltenden Bewusstsein vertraut werden, können wir erleben, dass es sich bei der Abgetrenntheit von der Welt nicht um eine Wahrheit, sondern eine Konstruktion handelt. Zur Konstruktion aufzuwachen bedeutet, dass sie ihre Macht verliert. Anstatt einer Zweiteilung können wir dann sehen, dass Selbst und Welt nicht unabhängig voneinander existieren, sondern in einem Verhältnis, in dem sich beide gegenseitig unaufhörlich hervorbringen.Nánquáns „Pfingstrose“FallDer Beamte Lu Geng sagte zu Nánquán:„Der Lehrmeister Zhao war wahrhaft außergewöhnlich: Er vermochte zu sagen:‚Himmel und Erde haben dieselbe Wurzel, die zehntausend Dinge sind ein einziger Leib.‘“Nánquán zeigte auf eine Pfingstrose im Garten und sagte:„Die Menschen von heute sehen diese Blume wie im Traum.“KommentarLu Geng aus der Tang-Dynastie war Mitglied des obersten Gerichtshofes.Einst fragte er Nánquán:„Ich habe eine Gans in einer Flasche aufgezogen. Allmählich wurde sie zu groß, um wieder herauszukommen.Wie kann man sie nun herausbekommen, ohne die Flasche zu beschädigen oder die Gans zu verletzen?“Nánquán rief: „Herr!“Lu Geng antwortete: „Ja?“Nánquán sagte: „Sie ist draußen.“Bei diesen Worten erwachte Lu Geng.Support the show
Op Saterdag 28 Maart gesels Halrika Breytenbach, kommunikasiekonsultant, vryskutjoernalis, skrywer en stigter van die musiekbediening, Halrika Music Ministries op Kopskuif. Halrika gesels oor haar debuutkinderboek, “Gertruida Gans leer ‘n les”, wat pas by Naledi Uitgewers verskyn het. Verder gesels sy oor haar liefde vir woorde en die opbouende krag daarvan. Sy deel gedagtes rondom haar musiekbediening en vertel ook oor haar optrede by KKNK.
Title: Return to Silent Hill [Wikipedia] [IMDb] Director: Christophe Gans Producers: Victor Hadida, Molly Hassel, David M. Wulf Writers: Christophe Gans, Sandra Vo-Anh, Will Schneider; Konami (original game) Stars: Jeremy Irvine, Hannah Emily Anderson Release date: January 23, 2026 (US) PROMO: Ninety For Chill: The Podcast with CatBusRuss SHOWNOTES: In Collateral Cinema's first At the Movies review—and first Collateral Gaming collab—of the year, we look at Christophe Gans' film adaptation of Silent Hill 2, Return to Silent Hill. Being fans of both the source material and Gans' previous Silent Hill movie (which we covered in an earlier season), we were highly anticipating this, but early reviews were.. disappointing. However, what we did think? Find out now, and stay tuned for our episodes on Night Trap and Vandal Hearts! Collateral Gaming is on Bluesky, Facebook, Instagram, Threads, and Twitter, and is on Goodpods, Apple Podcasts, Spotify, Podbean, Google Podcasts, YouTube, iHeart, and wherever else you get your podcasts! Also, check out Collateral Let's Play! on our YouTube channel. Collateral Gaming is happy to announce that we are now partnered with Dubby Energy! Use our promo code CGAMINGPOD to get 10% off your first purchase of Dubby Energy drinks on their website: dubby.gg/discount/CGAMINGPOD (Collateral Gaming is a Collateral Media Podcast. Intro song is a license-free beat from Purple Planet Music. All music and movie clips are owned by their respective creators and are used for educational purposes only. Please don't sue us; we're poor!)
In dieser Folge war wieder alles dabei: Schneechaos, Kranksein, große Sorgen – und Rudi! ❄️ Winter & LagerkollerKaum waren wir beide halbwegs gesund, kam der Winter zurück – mit Sturm, Schneemassen und Narnia-Feeling. Die Tiere nahmen's gelassen (Unterstand? Brauchen wir nicht
Title: Return to Silent Hill [Wikipedia] [IMDb] Director: Christophe Gans Producers: Victor Hadida, Molly Hassel, David M. Wulf Writers: Christophe Gans, Sandra Vo-Anh, Will Schneider; Konami (original game) Stars: Jeremy Irvine, Hannah Emily Anderson Release date: January 23, 2026 (US) PROMO: Ninety For Chill: The Podcast with CatBusRuss SHOWNOTES: In our first At the Movies review—and Collateral Gaming collab—of the year, we look at Christophe Gans' film adaptation of Silent Hill 2, Return to Silent Hill. Being fans of both the source material and Gans' previous Silent Hill movie (which we covered in an earlier season), we were highly anticipating this, but early reviews were.. disappointing. However, what we did think? Find out now, and stay tuned for our two-part episode on the Lord of the Rings trilogy very soon! Collateral Cinema is happy to announce that we are now partnered with Dubby Energy! Use our promo code CCINEMAPOD to get 10% off your first purchase of Dubby Energy drinks on their website: https://dubby.gg/discount/CCINEMAPOD… (Collateral Cinema is a Collateral Media Podcast. Intro song is a license-free beat from Purple Planet Music. All music and movie clips are owned by their respective creators and are used for educational purposes only. Please don't sue us; we're poor!)
On the 101st episode of Bomb Squad Matinee, Joe V, Tim, and Tanner discuss Christophe Gans' Return to Silent Hill. Is the film a worthy successor to Gans' first Silent Hill film, or is it a failure of an adaptation to Silent Hill 2? Does the cast effectively portray the characters from the game? What was up with that beard? Tune in to find out!
After just two older and one brand new entry, we've come to the end of the franchise already. How do you successfully adapt these games? What did Gans learn in his 20 years in between entries? Best rendition of Pyramid Head? .... and what franchise is next!
What makes tinnitus distressing for some people—but barely noticeable for others?In this in-depth conversation, clinical psychologist and tinnitus researcher Dr. Jennifer Gans explains why tinnitus is best understood not simply as a sound, but as a brain-driven experience. Drawing on neuroscience, clinical experience, and mindfulness-based research, she explores how the brain's response—rather than the sound itself—plays a central role in tinnitus distress, and how that response can change over time.Dr. Gans discusses why accurate education is foundational to effective tinnitus care, how anxiety and stress amplify tinnitus distress, and why habituation is a natural process—not something patients need to force. She also shares insights from her work with thousands of tinnitus patients and introduces her new weekly column at Hearing Health & Technology Matters (HHTM), "Tinnitus Education Corner," focused on evidence-based education and practical guidance.This conversation is designed for clinicians, researchers, and individuals living with tinnitus who want a clearer, more grounded framework for understanding—and reducing—the impact of tinnitus in daily life.Check out Dr. Gans' weekly column at: https://hearinghealthmatters.org/tinnitus-education-cornerLearn more about Dr. Gans and her work at: https://mindfultinnitusrelief.com/Be sure to subscribe to our channel for the latest episodes each week and follow This Week in Hearing on LinkedIn, Instagram and X.- https://x.com/WeekinHearing- https://www.instagram.com/thisweekinhearing/- https://www.linkedin.com/company/this-week-in-hearingVisit us at: https://hearinghealthmatters.org/thisweek/
durée : 00:59:32 - Mauvais genres - par : François Angelier - 20 ans après sa première adaptation à l'écran du jeu vidéo Silent Hill, Christophe Gans fait retour à la cité maudite. - réalisation : Laurent Paulré - invités : Christophe Gans Réalisateur
durée : 00:59:32 - Mauvais genres - par : François Angelier - 20 ans après sa première adaptation à l'écran du jeu vidéo Silent Hill, Christophe Gans fait retour à la cité maudite. - réalisation : Laurent Paulré - invités : Christophe Gans Réalisateur
On this episode of NOW SLAYING, Colton & Rowan flick on their fog lights and RETURN TO SILENT HILL! After 20 years, does Gans still have it? Was this a loving adaptation of such cherished source material? Or, do the guys wish they simply never returned at all? Tune in to find out if we gave this film a NAY, OKAY, YAY, or SLAY!CHAPTERS:Theme/Intro (00:00:00)Trailer (00:09:35)Synopsis (00:10:05)Review (SPOILER FREE) (00:10:22)Review (SPOILERS) (00:26:03)Rating (00:57:38)Promotions/Outro (01:00:25)Follow us on all social media:FacebookTwitterInstagramTumblrYoutubeTikTokSlasherThreadsBlueskyWant some official Merch?!SHOP HERE!*Intro Music by Rowan Fraser (IG: @biggiehauls)*Support the show
In dieser Folge trifft Ingo auf eine Frau, die weiß, wann ein japanisches Messer Burnout hat. Jana ist Kommissarin und Serienliebling. Ein Gespräch über norddeutsche Geheimnisse, Berliner Bodenhaftung und die Kunst, zwischen Gans und Gewaltverbrechen zu pendeln. Jana verrät, warum Drehpausen manchmal härter sind als Mordfälle und zeigt was passiert, wenn Ingo um einen Punkt – und fast um seine Würde - kämpft.
Ein kleiner Junge, ein Kindermädchen und eine missglückte Entführung: Aus diesen Elementen setzt Zhang Yueran in ihrem Roman "Schwanentage" ein Porträt der chinesischen Gesellschaft zusammen. Und dann ist da noch diese Gans.**********Weitere BeiträgeLiteratur: "Im Leben nebenan" von Anne SauerLiteratur: "Minihorror" von Barbi MarkovićLiteratur: "Stolz und Vorurteil" von Jane Austen**********Den Artikel zum Stück findet ihr hier.**********Ihr könnt uns auch auf diesen Kanälen folgen: TikTok und Instagram .
Emily Pilbeam presents volume 2 of her 2025 mixtape highlights from BBC Introducing, with Chalk, DJ Subaru feat. Chopper Johnson, Monks, Nightbus, Adult DVD, Goodnight Louisa, Divorce, Humour, GANS, Welly, Shale, jasmine.4.t, Natalie Wildgoose, and The Orchestra (For Now).Produced in Salford by BBC Audio for BBC Radio 6 Music.
Heutebei Dr. Hart und Dr. Zart: eine achtsame Gans
In 2A Tuesday, Brian Gans, CEO of Byrna, discusses non-lethal firearms designed for personal protection. Gans shares the story behind developing Byrna, emphasizing situations where carrying a lethal firearm may not be ideal. He explains how these CO₂-powered launchers shoot chemical irritant projectiles that temporarily incapacitate an assailant without causing permanent harm. The segment covers the different models available, legal restrictions in certain states, and how non-lethal options can complement traditional firearms for safety. The conversation also touches on California's restrictions on ammunition sales for these devices and the broader debate on personal defense.
"Denkt Euch, ich habe das Christkind gesehen …" oder lieber "Tiefgefroren in der Truhe, liegt die Gans aus Dänemark …" - Kinder haben sich mit alten und neuen Weihnachtsgedichten beschäftigt. Vergnügt, besinnlich, überraschend und festlich, einfach weihnachtlich. Zusammengestellt von Karin Hahn www.kakadu.de, Kakadu
Ladies and gentlemen — howdy & aloha!In this episode of Airey Bros Radio, we're lacing them up and heading down to Columbia, Missouri for a deep dive into Missouri Tigers Cross Country and a full preview of the 2025 NCAA Cross Country Championships at Gans Creek with Mizzou Head Cross Country & Distance Coach Kyle Levermore.Coach Levermore is a North Jersey native who starred at Don Bosco Prep, battled with Christian Brothers Academy at Homedale Park, and went on to run at Oregon and Arkansas before jumping into coaching at Georgetown. With a background in sports industry management, sports marketing, and now working on his MBA, Kyle is helping turn Mizzou XC into a Top-25 NCAA program while the Tigers get ready to host nationals on their home course at Gans Creek.We cover:
Fanny geht untertags in einen Waldkindergarten. Abends im Bett träumt sie, sie wäre ein kleiner Fuchs. Heute rettet sie eine weiße Gans, die nicht mehr heimfindet, oder doch?
This is a very in depth conversation with Ben Gan's Crew Chief, Avery Tompkins, and their thoughts ans suggestions for crewing, not only a 200-mile event, but the Triple Crown of 200s. Thank you both for sharing so much! Race/Crew Planning Templates - https://docs.google.com/spreadsheets/d/1p3tmFVMpZ9iPp2FOIGoSnLt2f_97ei9ErEcl6FHLChU/edit?usp=sharing Avery mentioned the Black Diamond Moji R+ light. That can be found here - https://blackdiamondequipment.com/products/moji-r-rechargeable-lantern To connect with Avery - @prodessoravery on Instagram To connect with Ben - @dr_bentendo on Instagram Aaron's information: My Socials, Channels, & Newsletter: https://www.facebook.com/MRRUNNINGPAINSCOACHING https://www.instagram.com/runningislifecoaching/ https://www.youtube.com/channel/UCQ6J512qA34z_N0KJSU4jfw https://www.strava.com/athletes/18431982 Email - coachsaft@gmail.com Thanks to all of you for listening! Please share the Podcast and please leave a review, rate, & subscribe if you haven't done so already! THANK YOU! Aaron Saft Running Is Life Coaching & Podcast
In this long-anticipated episode, host Peter Bauman (Le Random's editor in chief) speaks with one of the most exciting duos in contemporary digital culture, Ann Hirsch and Maya Man. They cover their collaborative projects, Ugly Bitches and Little Darlings, which explore online gender performativity. We discuss the works in relation to the so-called "vibe shift" of the 2020s. The artists also discuss how their work, often using GANs and other AI technologies, counteracts the "girl boss" rhetoric of early 2020s NFT projects by presenting a more flawed, nuanced, and sincere depiction of both femininity and masculinity. They detail how UB uses intentionally distorted AI dolls to comment on female failure, while LD employs shinier AI imagery to critique the "hustle grind gain success" male influencer culture. Finally, the conversation touches upon their admiration for, and points of departure from, the "Gay NFT" or Avant Schizocollage scene, with the artists expressing an interest in "ironic sincerity" in their work.Monday's Editorial with Jess Tucker: https://www.lerandom.art/editorial/jess-tucker-on-longing-for-a-faceChapters
"Gibt's keine vegane Gans?" - "Ne vegane Gans? Die darf doch gar nicht mehr so heißen!" Von René Steinberg.
Another big episode with lots of show notes! We start with my interview of Dr Ben Gans and his amazing journey through the Triple Crown of 200s (Tahoe 200, Bigfoot 200, & Moab 240). We go deep into the challenges he faced in this series of races. I loved this conversation! Then I dive deeper in to the metrics of running and how they can affect your base training period. Lastly, I interview my daughter, Ambrin Saft, after she completed her Freshman season of Cross Country. Enjoy! Resources: Wes Plate YouTube Moab 240 Video - https://youtu.be/tD2Q6ZOksuk?si=aiA0m1AiZMafmF8_ Salomon S/Lab Adventure 20 Pack - https://www.salomon.com/en-us/product/s-lab-adventure-20-lc13870/LC2710000?CMPID=ps|pm|google|pma_pm_Google_pmax_conv_b_lw_perf_ong_all_us_en_slm|||&utm_source=google&utm_medium=paidsearch&utm_content=aa-cc&utm_keyword=&utm_campaign=pma_pm_Google_pmax_conv_b_lw_perf_ong_all_us_en_slm&gclsrc=aw.ds&gad_source=1&gad_campaignid=16891227972&gbraid=0AAAAADMpyOhbSHR4YV3LLCOfgLsfrO870&gclid=Cj0KCQiAiKzIBhCOARIsAKpKLANrwxc8tyy3ownZNY5laJynw8kxe_raKoyPdmo0-5nP67ifuOMfL3QaAqwXEALw_wcB Fixing Your Feet Book - https://www.walmart.com/ip/Fixing-Your-Feet-Injury-Prevention-and-Treatment-for-Athletes-Paperback-9781643590639/804610914?wmlspartner=wlpa&selectedSellerId=0&wmlspartner=wlpa&cn=FY25-ENTP-PMAX_cnv_dps_dsn_dis_ad_entp_e_n&gclsrc=aw.ds&adid=22222222297804610914_0000000000_21407473164&wl0=&wl1=x&wl2=c&wl3=&wl4=&wl5=9010303&wl6=&wl7=&wl8=&wl9=pla&wl10=8175035&wl11=online&wl12=804610914&veh=sem&gad_source=1&gad_campaignid=21690411341&gbraid=0AAAAADmfBIoaumu5-yERaV9ZU_ICha5AG&gclid=Cj0KCQiAiKzIBhCOARIsAKpKLAOh2iY40HDbN2f6QVMZ0pL9L8PiwzjdOnh3-ycrku9r1Ek6oFDU0LcaAsLqEALw_wcB Squirrel Nut Butter - https://squirrelsnutbutter.com/ Outdoor Research Sun Gloves - https://www.outdoorresearch.com/collections/sun-protection-gloves/products/activeice-chroma-sun-gloves-280133 Outdoor Research Sun Hoodie - https://www.outdoorresearch.com/collections/sun-protection/products/mens-echo-hoodie-287625 Leki Trail Running Poles - https://lekiusa.com/collections/trail-running Petzl Swift RL (the light we both recommend) - https://www.petzl.com/US/en/Sport/Headlamps/SWIFT-RL Arc'Teryx Norvan Jacket - https://arcteryx.com/us/en/shop/mens/norvan-insulated-hoody-8435 Dynafit Rain Jacket with Zipper on Back for Pack - https://www.dynafit.com/alpine-gore-tex-jacket-men-08-0000071468 Mountain Hardwear Ghost Whisperer Puffy Line - https://www.mountainhardwear.com/c/ghost-whisperer/?srsltid=AfmBOorSe9uGyS2oDCXXv1xHqlh9uguAsFxNcBUyz955lfL0ybUhVxUJ Wahoo Trackr Heart Rate Monitor - https://www.wahoofitness.com/devices/heart-rate-monitors/trackr-heart-rate-buy Doctor's of Running Podcast on Off Season - https://podcasts.apple.com/us/podcast/266-do-not-make-these-offseason-running-mistakes/id1518639507?i=1000735352115 Aaron's information: My Socials, Channels, & Newsletter: https://www.facebook.com/MRRUNNINGPAINSCOACHING https://www.instagram.com/runningislifecoaching/ https://www.youtube.com/channel/UCQ6J512qA34z_N0KJSU4jfw https://www.strava.com/athletes/18431982 Email - coachsaft@gmail.com Thanks to all of you for listening! Please share the Podcast and please leave a review, rate, & subscribe if you haven't done so already! THANK YOU! Aaron Saft Running Is Life Coaching & Podcast
In this extra special episode, host Peter Bauman (Le Random's editor in chief) speaks with prominent AI researcher Ian Goodfellow about the legendary origins of GANs, their unexpected success and indelible impact on both twenty-first-century image making and AI research. This episode contains Peter and Ian's full conversation and serves as a companion to Monday's written interview, which covered the first half of the discussion only.Monday's editorial: https://www.lerandom.art/editorial/ian-goodfellow-on-inventing-gansChapters
Sign up for Alex's first live cohort, about Hierarchical Model building!Get 25% off "Building AI Applications for Data Scientists and Software Engineers"Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!Visit our Patreon page to unlock exclusive Bayesian swag ;)Takeaways:Why GPs still matter: Gaussian Processes remain a go-to for function estimation, active learning, and experimental design – especially when calibrated uncertainty is non-negotiable.Scaling GP inference: Variational methods with inducing points (as in GPflow) make GPs practical on larger datasets without throwing away principled Bayes.MCMC in practice: Clever parameterizations and gradient-based samplers tighten mixing and efficiency; use MCMC when you need gold-standard posteriors.Bayesian deep learning, pragmatically: Stochastic-gradient training and approximate posteriors bring Bayesian ideas to neural networks at scale.Uncertainty that ships: Monte Carlo dropout and related tricks provide fast, usable uncertainty – even if they're approximations.Model complexity ≠ model quality: Understanding capacity, priors, and inductive bias is key to getting trustworthy predictions.Deep Gaussian Processes: Layered GPs offer flexibility for complex functions, with clear trade-offs in interpretability and compute.Generative models through a Bayesian lens: GANs and friends benefit from explicit priors and uncertainty – useful for safety and downstream decisions.Tooling that matters: Frameworks like GPflow lower the friction from idea to implementation, encouraging reproducible, well-tested modeling.Where we're headed: The future of ML is uncertainty-aware by default – integrating UQ tightly into optimization, design, and deployment.Chapters:08:44 Function Estimation and Bayesian Deep Learning10:41 Understanding Deep Gaussian Processes25:17 Choosing Between Deep GPs and Neural Networks32:01 Interpretability and Practical Tools for GPs43:52 Variational Methods in Gaussian Processes54:44 Deep Neural Networks and Bayesian Inference01:06:13 The Future of Bayesian Deep Learning01:12:28 Advice for Aspiring Researchers
Aujourd'hui, Didier Giraud, Charles Consigny et Flora Ghebali débattent de l'actualité autour d'Alain Marschall et Olivier Truchot.
This week's The Summit League Segment highlights the University of Missouri Kansas City Kangaroos and includes an interview with UMKC sophomore Mariah Belmont. Plus highlights of this past week's Kwik Star Summit League Peak Performers, news from around The Summit League, and more.
Emily Pilbeam presents a mixtape of her personal selection of tracks from BBC Introducing, including a new Track of the Week by WOOM, and we get to know dream pop superstar-in-the-making Goodnight Louisa!There's also music from British Birds, Fright Years, GANS, Ellur, Jennifer Walton, Nightbus, WHITEHORSE, Euan Blackman, Mên An Tol, Wax Head, and Thandii.Produced in Salford by BBC Audio for BBC Radio 6 Music.
Joshua Gans, a professor at the University of Toronto and co-author of "Power and Prediction: The Disruptive Economics of Artificial Intelligence," joins Kevin Frazier, the AI Innovation and Law Fellow at the University of Texas School of Law and a Senior Editor at Lawfare, to evaluate ongoing concerns about AI-induced job displacement, the likely consequences of various regulatory proposals on AI innovation, and how AI tools are already changing higher education. Select works by Gans include: A Quest for AI Knowledge (https://www.nber.org/papers/w33566)Regulating the Direction of Innovation (https://www.nber.org/papers/w32741)How Learning About Harms Impacts the Optimal Rate of Artificial Intelligence Adoption (https://www.nber.org/papers/w32105) Hosted on Acast. See acast.com/privacy for more information.
Kerrie Cosh presents a mixtape of her personal selection of tracks from BBC Introducing - Mercy Girl, Elf Jaw, Harpy, Chloe Foy, Katie Keddie, Nectar Woode & Obed Otchere, Big Softy, GANS, The Youth Play, Cortney Dixon, BEX, L E M F R E C K, Kindelan, Kayla Grace, Balancing Act, The Itch, and a new Track Of The Week from GREAT ADAMZ
Send us a textTom Brady, Fontainebleau and Jim Gray partner to open sports and entertainment memorabilia museum on the strip. We have all the details about what you will be able to see on this self-guided tour at The Hall of Excellence. It opens June 20th. Plus, The Wizard of Oz at Sphere opens on August 28th and tickets are finally on sale! There's also a larger than life photo moment at Sphere to promote the film. A new documentary about iconic Las Vegas headliner, Danny Gans, will premiere soon in Hollywood. We talked to his son, Andrew, about why he decided to make this movie. You can get tickets to the world premiere HERE. We chat with showman Frankie Scinta at the South Point. Steele Panther is coming back to Las Vegas and CasaBlanca Resort & Casino is hiring for two new restaurants that debut later this summer. WrestleMania couldn't stay away! The big event will return to Las Vegas in 2026 for WrestleMania 42.We also discuss the tragic shooting that happened on the Las Vegas Strip. If your home was damaged in the California wildfires, Galindo Law may be able to help you get more compensation. Call 1-800-251-1533 or visit galindolaw.com If your Texas home was damaged by hail or a hurricane in the past 2-years, Galindo Law may be able to help you get more insurance compensation. Call 1-800-251-1533. Or, visit GalindoLaw.com VegasNearMe AppIf it's fun to do or see, it's on VegasNearMe. The only app you'll need to navigate Las Vegas. Support the showFollow us on Instagram: @vegas.revealedFollow us on Twitter: @vegasrevealedFollow us on TikTok: @vegas.revealedWebsite: Vegas-Revealed.com
Topics discussed on this podcast includes Republican hypocrisy, lip filler, and being a xennial. This podcast features I Think I Like You by GANS
Jake Mintz and Jordan Shusterman react to the news that Aaron Judge will captain Team USA in the 2026 World Baseball Classic. Is he the missing piece Team USA needs to win? The boys then The boys then attempt to predict the Team USA roster for the WBC.Jake and Jordan then bring on “Hoo Lee Gans” founder Kyle Smeallie to chat the inpsiration behind baseball's newest fan group.Later, Jake gives Jordan some baseball trivia and the boys break down news around the league, including Alex Bregman's perfect day. (2:30) - Why Jackie Robinson Day should be thought provoking(8:00) - Predicting Team USA 2026 WBC roster(38:00) - “Hoo Lee Gans” founder joins the show(52:15) - Baseball trivia(59:30) - News around the league Subscribe to Baseball Bar-B-Cast on your favorite podcast app: