Something roughly the same as something else
POPULARITY
Erik Dale is a “European revivalist” and prolific Bitcoiner, host of the ‘Bitcoin for Breakfast' podcast, and organizer of the Northern Lightning Bitcoin conference in Scandinavia. He is an expert communications professional and was a former adviser to the President of the European Commission. Erik has a unique perspective on how Bitcoin must save the old world and shares a lot about his experience of how Bitcoin (and LSD and parenthood) changed his life. → Follow Erik on https://x.com/EuroDale → Erik's YouTube: https://www.youtube.com/@kongeriket
In this video I'm sharing a part of a book that made me go “Hmm, I've never really thought about that!” It has to do with just how precise we can actually be when measuring items. One of the Standards of Mathematical Practice is “Attend to Precision” which is about being precise in all mathematical vocabulary and content, but just how precise should we make students be when it comes to measurement? Get any links mentioned in this video at BuildMathMinds.com/162
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Deep Learning" Is Function Approximation, published by Zack M Davis on March 21, 2024 on LessWrong. A Surprising Development in the Study of Multi-layer Parameterized Graphical Function Approximators As a programmer and epistemology enthusiast, I've been studying some statistical modeling techniques lately! It's been boodles of fun, and might even prove useful in a future dayjob if I decide to pivot my career away from the backend web development roles I've taken in the past. More specifically, I've mostly been focused on multi-layer parameterized graphical function approximators, which map inputs to outputs via a sequence of affine transformations composed with nonlinear "activation" functions. (Some authors call these "deep neural networks" for some reason, but I like my name better.) It's a curve-fitting technique: by setting the multiplicative factors and additive terms appropriately, multi-layer parameterized graphical function approximators can approximate any function. For a popular choice of "activation" rule which takes the maximum of the input and zero, the curve is specifically a piecewise-linear function. We iteratively improve the approximation f(x,θ) by adjusting the parameters θ in the direction of the derivative of some error metric on the current approximation's fit to some example input-output pairs (x,y), which some authors call "gradient descent" for some reason. (The mean squared error (f(x,θ)y)2 is a popular choice for the error metric, as is the negative log likelihood logP(y|f(x,θ)). Some authors call these "loss functions" for some reason.) Basically, the big empirical surprise of the previous decade is that given a lot of desired input-output pairs (x,y) and the proper engineering know-how, you can use large amounts of computing power to find parameters θ to fit a function approximator that "generalizes" well - meaning that if you compute ^y=f(x,θ) for some x that wasn't in any of your original example input-output pairs (which some authors call "training" data for some reason), it turns out that ^y is usually pretty similar to the y you would have used in an example (x,y) pair. It wasn't obvious beforehand that this would work! You'd expect that if your function approximator has more parameters than you have example input-output pairs, it would overfit, implementing a complicated function that reproduced the example input-output pairs but outputted crazy nonsense for other choices of x - the more expressive function approximator proving useless for the lack of evidence to pin down the correct approximation. And that is what we see for function approximators with only slightly more parameters than example input-output pairs, but for sufficiently large function approximators, the trend reverses and "generalization" improves - the more expressive function approximator proving useful after all, as it admits algorithmically simpler functions that fit the example pairs. The other week I was talking about this to an acquaintance who seemed puzzled by my explanation. "What are the preconditions for this intuition about neural networks as function approximators?" they asked. (I paraphrase only slightly.) "I would assume this is true under specific conditions," they continued, "but I don't think we should expect such niceness to hold under capability increases. Why should we expect this to carry forward?" I don't know where this person was getting their information, but this made zero sense to me. I mean, okay, when you increase the number of parameters in your function approximator, it gets better at representing more complicated functions, which I guess you could describe as "capability increases"? But multi-layer parameterized graphical function approximators created by iteratively using the derivative of some error metric to improve the quality ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Deep Learning" Is Function Approximation, published by Zack M Davis on March 21, 2024 on LessWrong. A Surprising Development in the Study of Multi-layer Parameterized Graphical Function Approximators As a programmer and epistemology enthusiast, I've been studying some statistical modeling techniques lately! It's been boodles of fun, and might even prove useful in a future dayjob if I decide to pivot my career away from the backend web development roles I've taken in the past. More specifically, I've mostly been focused on multi-layer parameterized graphical function approximators, which map inputs to outputs via a sequence of affine transformations composed with nonlinear "activation" functions. (Some authors call these "deep neural networks" for some reason, but I like my name better.) It's a curve-fitting technique: by setting the multiplicative factors and additive terms appropriately, multi-layer parameterized graphical function approximators can approximate any function. For a popular choice of "activation" rule which takes the maximum of the input and zero, the curve is specifically a piecewise-linear function. We iteratively improve the approximation f(x,θ) by adjusting the parameters θ in the direction of the derivative of some error metric on the current approximation's fit to some example input-output pairs (x,y), which some authors call "gradient descent" for some reason. (The mean squared error (f(x,θ)y)2 is a popular choice for the error metric, as is the negative log likelihood logP(y|f(x,θ)). Some authors call these "loss functions" for some reason.) Basically, the big empirical surprise of the previous decade is that given a lot of desired input-output pairs (x,y) and the proper engineering know-how, you can use large amounts of computing power to find parameters θ to fit a function approximator that "generalizes" well - meaning that if you compute ^y=f(x,θ) for some x that wasn't in any of your original example input-output pairs (which some authors call "training" data for some reason), it turns out that ^y is usually pretty similar to the y you would have used in an example (x,y) pair. It wasn't obvious beforehand that this would work! You'd expect that if your function approximator has more parameters than you have example input-output pairs, it would overfit, implementing a complicated function that reproduced the example input-output pairs but outputted crazy nonsense for other choices of x - the more expressive function approximator proving useless for the lack of evidence to pin down the correct approximation. And that is what we see for function approximators with only slightly more parameters than example input-output pairs, but for sufficiently large function approximators, the trend reverses and "generalization" improves - the more expressive function approximator proving useful after all, as it admits algorithmically simpler functions that fit the example pairs. The other week I was talking about this to an acquaintance who seemed puzzled by my explanation. "What are the preconditions for this intuition about neural networks as function approximators?" they asked. (I paraphrase only slightly.) "I would assume this is true under specific conditions," they continued, "but I don't think we should expect such niceness to hold under capability increases. Why should we expect this to carry forward?" I don't know where this person was getting their information, but this made zero sense to me. I mean, okay, when you increase the number of parameters in your function approximator, it gets better at representing more complicated functions, which I guess you could describe as "capability increases"? But multi-layer parameterized graphical function approximators created by iteratively using the derivative of some error metric to improve the quality ...
We discuss GAMBIT, software for accurately classifying bacteria and eukaryotes using a targeted k-mer based approach. GAMBIT software: https://github.com/gambit-suite/gambit GAMBIT suite: https://github.com/gambit-suite GAMBIT (Genomic Approximation Method for Bacterial Identification and Tracking): A methodology to rapidly leverage whole genome sequencing of bacterial isolates for clinical identification. https://doi.org/10.1371/journal.pone.0277575 TheiaEuk: a species-agnostic bioinformatics workflow for fungal genomic characterization https://www.frontiersin.org/journals/public-health/articles/10.3389/fpubh.2023.1198213/full
In this conversation, Craig and Ollie discuss various topics including Brian Johnson's quest to beat the aging process, fitness goals, teaching reading using Monster Phonics, treating failures as system failures, effective teacher professional development, and the use of silent teacher and checking for listening in the classroom. In this part of the conversation, Craig Barton and Ollie Lovell discuss various teaching strategies and methods. They explore the use of worked examples and the importance of checking for understanding. They also discuss the idea of tightening feedback cycles and the benefits of more frequent assessments. Finally, they delve into the controversy surrounding exit tickets and their effectiveness as a teaching tool. You can access the show-notes here: mrbartonmaths.com/blog/tools-and-tips-for-teachers-10/ Time-stamps: Consider failures first as system failures (09:15) My latest lesson observation and coaching template (16:43) Representation, Decomposition, Approximation (32:16) Two different Starts to Finish so pairs don't copy? (42:20) Tighten feedback cycles (52:57) Are Exit Tickets a waste of time? (1:02:03)
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?, published by Teun van der Weij on February 1, 2024 on LessWrong. Produced as part of the ML Alignment Theory Scholars Program Winter 2024 Cohort, under the mentorship of Francis Rhys Ward. The code, data, and plots can be found on https://github.com/TeunvdWeij/MATS/tree/main/distribution_approximation. This post is meant to provide insight on an interesting LLM capability, which is useful for targeted underperformance on evaluations (sandbagging) by LLMs. We investigate what happens if you independently sample a language model a 100 times with the task of 80% of those outputs being A, and the remaining 20% of outputs being B. Here is the prompt we used, where p is the target percentage of the output tokens being A. In the example above, p is 80. In addition to the above prompt, we use the following system prompt: What kind of distribution do you expect? We used the default sampling hyperparameters; both temperature and top_p are 1. We think it is good to register your predictions, please do so now! Results are a bit further below. Here are the observed calibration scores for GPT-3.5 and GPT-4: Interestingly, both models are capable of doing this! Notably, GPT3.5 performs worse than GPT4. Although sampling like this is the easiest to understand, we get a better idea of the exact scores by skipping the sampling altogether. From now on, we extract the probabilities of the top 5 tokens. The plots illustrate that the probabilities roughly match the target distribution. Very little probability is put on tokens other than A and B, illustrating that the prompting points the model to the right tokens. To represent these graphs a bit differently, we plot the extracted probability of token A similar to the first plot. We find a similar pattern. We also observe a vague sigmoidal pattern, aligning with (private) work by Ollie Jaffe. To further test the distribution approximation capability of the model, we performed an experiment where we state the sample number of the query. Now, strategies such as "Pick A for samples 1-80 and B for 81-100" are possible, enabling the model to score perfectly. However, this does not work as both GPT3.5 and GPT4 do not consistently come up with the same strategy (even for temperature of 0). Better prompting might change this result. Testing targeted sandbagging on simple addition questions The capability to approximate distributions over multiple independent forward passes is a necessary condition for "targeted" sandbagging: the ability to achieve a specific score that is worse than the model's best performance. To further investigate this, we generate 100 simple addition questions (such as: "What is 5+4?"), for which the models normally achieve 100% accuracy. The setup was similar to the first experiment, using the following system prompt and user prompt template: For each question in the dataset, we average the extracted probabilities per value of p for all 100 questions. We show the average extracted probabilities in the plot below. We see that GPT-3.5 is much worse at targeted sandbagging than the distribution approximation task. Again, GPT-4 is better than 3.5, but it also performs worse than on the previous task. From qualitative evaluations, we believe that these results are highly sensitive to the wording of the prompt template. For one template, we found that GPT-3.5 tended to consistently underperform (while the plot shows it overperformance). We did not try very hard to find an ideal prompt leading to good calibration. We do expect that such a prompt does exist. Discussion The main question we are asking ourselves: are our findings surprising? We do not have a convincing theory of what is going on here, and why the models are able...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?, published by Teun van der Weij on February 1, 2024 on LessWrong. Produced as part of the ML Alignment Theory Scholars Program Winter 2024 Cohort, under the mentorship of Francis Rhys Ward. The code, data, and plots can be found on https://github.com/TeunvdWeij/MATS/tree/main/distribution_approximation. This post is meant to provide insight on an interesting LLM capability, which is useful for targeted underperformance on evaluations (sandbagging) by LLMs. We investigate what happens if you independently sample a language model a 100 times with the task of 80% of those outputs being A, and the remaining 20% of outputs being B. Here is the prompt we used, where p is the target percentage of the output tokens being A. In the example above, p is 80. In addition to the above prompt, we use the following system prompt: What kind of distribution do you expect? We used the default sampling hyperparameters; both temperature and top_p are 1. We think it is good to register your predictions, please do so now! Results are a bit further below. Here are the observed calibration scores for GPT-3.5 and GPT-4: Interestingly, both models are capable of doing this! Notably, GPT3.5 performs worse than GPT4. Although sampling like this is the easiest to understand, we get a better idea of the exact scores by skipping the sampling altogether. From now on, we extract the probabilities of the top 5 tokens. The plots illustrate that the probabilities roughly match the target distribution. Very little probability is put on tokens other than A and B, illustrating that the prompting points the model to the right tokens. To represent these graphs a bit differently, we plot the extracted probability of token A similar to the first plot. We find a similar pattern. We also observe a vague sigmoidal pattern, aligning with (private) work by Ollie Jaffe. To further test the distribution approximation capability of the model, we performed an experiment where we state the sample number of the query. Now, strategies such as "Pick A for samples 1-80 and B for 81-100" are possible, enabling the model to score perfectly. However, this does not work as both GPT3.5 and GPT4 do not consistently come up with the same strategy (even for temperature of 0). Better prompting might change this result. Testing targeted sandbagging on simple addition questions The capability to approximate distributions over multiple independent forward passes is a necessary condition for "targeted" sandbagging: the ability to achieve a specific score that is worse than the model's best performance. To further investigate this, we generate 100 simple addition questions (such as: "What is 5+4?"), for which the models normally achieve 100% accuracy. The setup was similar to the first experiment, using the following system prompt and user prompt template: For each question in the dataset, we average the extracted probabilities per value of p for all 100 questions. We show the average extracted probabilities in the plot below. We see that GPT-3.5 is much worse at targeted sandbagging than the distribution approximation task. Again, GPT-4 is better than 3.5, but it also performs worse than on the previous task. From qualitative evaluations, we believe that these results are highly sensitive to the wording of the prompt template. For one template, we found that GPT-3.5 tended to consistently underperform (while the plot shows it overperformance). We did not try very hard to find an ideal prompt leading to good calibration. We do expect that such a prompt does exist. Discussion The main question we are asking ourselves: are our findings surprising? We do not have a convincing theory of what is going on here, and why the models are able...
Show Notes: Andreas Heiberg is an experienced entrepreneur and a software engineering leader with a passion for product development, space exploration, and nuclear energy. From shaping healthcare solutions at Babylon to optimizing global logistics at Gelato, Andreas brings a wealth of experience. He has co-founded and worked at a handful of venture-backed tech companies; ranging from startups to large enterprises across a diverse set of industries. In this interview Andreas shares many insights from his dynamic career. Topics Discussed: U.K. and U.S. Appeal Graphic Design and University Early Website Building Career Trajectory Becoming a Manager Consulting First Startups Startup Advice Working at Stellate Creating Culture Favourite Jobs Peace Time Leadership Moving Mountains Distributed Work Decompressing Influential Mentors Approximation of Progress AI Societal Changes Questioning Norms U.S. Startup Ecosystem Access to Capital Intense Workplace Fields of Interest Nuclear Energy Sailing Education in Denmark Question Everything Links: GitHub Website X LinkedIn
Chad is charged on a long list of counts related to corruption. He isn't just going to plead guilty. He's going to trial. In court, he'll face his informants and his right-hand men who betrayed him. But whether or not Chad gets what he deserved remains an open question. Is Chad a bad apple or is the DEA a rotten orchard? Subscribe to The Binge to get all episodes of Smoke Screen ad-free right now. Click ‘Subscribe' at the top of the Smoke Screen: Betrayal On The Bayou show page on Apple Podcasts or visit GetTheBinge.com to get access wherever you get your podcasts. A Neon Hum & Sony Music Entertainment production. Find more great podcasts from Sony Music Entertainment at sonymusic.com/podcasts and follow us @sonypodcasts Learn more about your ad choices. Visit podcastchoices.com/adchoices
On today's episode we're getting attention while gaining traction and forming a humble rapport between audiences in a dead age. Having 3500 years of material to sift through. Absorbing news tangentially. Nostalgic lyric-memory messages. Maintaining a harmonic congruence with reality. Being hyped on ambient anxiety. The perverse side of heroism. Surfing waves of momentum. The Raw (sacred) vs. The Cooked (profane). Oblique phrases from the Deep and one of Kris' darkest band contributions yet. The Line as the ultimate human invention. The sequencing of Thought & Time. Spiral strategies and expanding shamanically. The importance of making multidimensional map-ceremonies. The Dark Swamp Hero. Map = Guide. The validation of rapport. The Counterintuitive. The denial of distance through engagement. Breaking the time signatures. The silence of melody; the surprise of harmony. The Human Scaffolding vs. The Jellyfish Grid. Intuition vs. Instinct. Proto Hippie Islands. The metaphysics of being in the wrong body. Entering The Illogic. The disavowed sides of history. Extended Dream-Existence. Bowling in Maya. The Profound Vacuum. Resonating with personal algorithms. Chasing Atoms. On to being skeptical towards presumed assumptions of certainty. Rediscovering fundamentals through amateur eyes. Rerouting the curious ghost hunter. We have a brief corn chip intermission. Then the dark prolapse ballad of "Slippery Chicken". Rubber room storytelling. Moving across time frames into time maps. Packaged Dimensionality vs. Kaleidoscopic Vacuums. Approximation vs. Proximity. Maple Syrup Bootleggers and Nonlinear Dingo Symbols.
Chad is charged on a long list of counts related to corruption. He isn't just going to plead guilty. He's going to trial. In court, he'll face his informants and his right-hand men who betrayed him. But whether or not Chad gets what he deserved remains an open question. Is Chad a bad apple or is the DEA a rotten orchard? Subscribe to The Binge to get all episodes of Smoke Screen ad-free right now. Click ‘Subscribe' at the top of the Smoke Screen: Betrayal On The Bayou show page on Apple Podcasts or visit GetTheBinge.com to get access wherever you get your podcasts. A Neon Hum & Sony Music Entertainment production. Find more great podcasts from Sony Music Entertainment at sonymusic.com/podcasts and follow us @sonypodcasts Learn more about your ad choices. Visit podcastchoices.com/adchoices
DAY 258 CHALLENGE “Why do you claim that the biblical authors used a different level of precision than we do?” DEFENSE Approximations were more common because of the inability in the ancient world to accurately measure and record things (see Day 248). We can show Scripture uses many forms of approximation, including: Numerical approximations: For example, a basin in Solomon's temple is said to have a diameter of ten cubits and a circumference of thirty cubits (1 Kings 7:23; 2 Chron. 4:2), indicating the approximate value of π (pi) as 3 (see Day 197). Numerical approximations are also …
FlashAttention was first published by Tri Dao in May 2022 and it had a deep impact in the large language models space. Most open models you've heard of (RedPajama, MPT, LLaMA, Falcon, etc) all leverage it for faster inference. Tri came on the podcast to chat about FlashAttention, the newly released FlashAttention-2, the research process at Hazy Lab, and more. This is the first episode of our “Papers Explained” series, which will cover some of the foundational research in this space. Our Discord also hosts a weekly Paper Club, which you can signup for here. How does FlashAttention work?The paper is titled “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. There are a couple keywords to call out:* “Memory Efficient”: standard attention memory usage is quadratic with sequence length (i.e. O(N^2)). FlashAttention is sub-quadratic at O(N). * “Exact”: the opposite of “exact” in this case is “sparse”, as in “sparse networks” (see our episode with Jonathan Frankle for more). This means that you're not giving up any precision.* The “IO” in “IO-Awareness” stands for “Input/Output” and hints at a write/read related bottleneck. Before we dive in, look at this simple GPU architecture diagram:The GPU has access to three memory stores at runtime:* SRAM: this is on-chip memory co-located with the actual execution core. It's limited in size (~20MB on an A100 card) but extremely fast (19TB/s total bandwidth)* HBM: this is off-chip but on-card memory, meaning it's in the GPU but not co-located with the core itself. An A100 has 40GB of HBM, but only a 1.5TB/s bandwidth. * DRAM: this is your traditional CPU RAM. You can have TBs of this, but you can only get ~12.8GB/s bandwidth, which is way too slow.Now that you know what HBM is, look at how the standard Attention algorithm is implemented:As you can see, all 3 steps include a “write X to HBM” step and a “read from HBM” step. The core idea behind FlashAttention boils down to this: instead of storing each intermediate result, why don't we use kernel fusion and run every operation in a single kernel in order to avoid memory read/write overhead? (We also talked about kernel fusion in our episode with George Hotz and how PyTorch / tinygrad take different approaches here)The result is much faster, but much harder to read:As you can see, FlashAttention is a very meaningful speed improvement on traditional Attention, and it's easy to understand why it's becoming the standard for most models.This should be enough of a primer before you dive into our episode! We talked about FlashAttention-2, how Hazy Research Group works, and some of the research being done in Transformer alternatives.Show Notes:* FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv)* FlashAttention-2* Together AI* From Deep Learning to Long Learning* The Hardware Lottery by Sara Hooker* Hazy Research* Is Attention All You Need?* Nvidia CUTLASS 3* SRAM scaling slows* Transformer alternatives:* S4* Hyena* Recurrent Neural Networks (RNNs)Timestamps:* Tri's background [00:00:00]* FlashAttention's deep dive [00:02:18]* How the Hazy Research group collaborates across theory, systems, and applications [00:17:21]* Evaluating models beyond raw performance [00:25:00]* FlashAttention-2 [00:27:00]* CUDA and The Hardware Lottery [00:30:00]* Researching in a fast-changing market [00:35:00]* Promising transformer alternatives like state space models and RNNs [00:37:30]* The spectrum of openness in AI models [00:43:00]* Practical impact of models like LLAMA2 despite restrictions [00:47:12]* Incentives for releasing open training datasets [00:49:43]* Lightning Round [00:53:22]Transcript:Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO-in-Residence at Decibel Partners. Today we have no Swyx, because he's in Singapore, so it's a one-on-one discussion with Tri Dao. Welcome! [00:00:24]Tri: Hi everyone. I'm Tri Dao, excited to be here. [00:00:27]Alessio: Tri just completed his PhD at Stanford a month ago. You might not remember his name, but he's one of the main authors in the FlashAttention paper, which is one of the seminal work in the Transformers era. He's got a lot of interest from efficient transformer training and inference, long range sequence model, a lot of interesting stuff. And now you're going to be an assistant professor in CS at Princeton next year. [00:00:51]Tri: Yeah, that's right. [00:00:52]Alessio: Yeah. And in the meantime, just to get, you know, a low pressure thing, you're Chief Scientist at Together as well, which is the company behind RedPajama. [00:01:01]Tri: Yeah. So I just joined this week actually, and it's been really exciting. [00:01:04]Alessio: So what's something that is not on the internet that people should know about you? [00:01:09]Tri: Let's see. When I started college, I was going to be an economist, so I was fully on board. I was going to major in economics, but the first week I was at Stanford undergrad, I took a few math classes and I immediately decided that I was going to be a math major. And that kind of changed the course of my career. So now I'm doing math, computer science, AI research. [00:01:32]Alessio: I had a similar thing. I started with physics and then I took like a programming course and I was like, I got to do computer science. I don't want to do physics. So FlashAttention is definitely, everybody's using this. Everybody loves it. You just released FlashAttention 2 last week. [00:01:48]Tri: Yeah. Early this week on Monday. Yeah. [00:01:53]Alessio: You know, AI time. Things move fast. So maybe let's run through some of the FlashAttention highlights, some of the innovation there, and then we can dive into FlashAttention 2. So the core improvement in FlashAttention is that traditional attention is a quadratic sequence length. And to the two, FlashAttention is linear, which obviously helps with scaling some of these models. [00:02:18]Tri: There are two factors there. So of course the goal has been to make attention go faster or more memory efficient. And ever since attention became popular in 2017 with the Transformer paper, lots and lots of folks have been working on this. And a lot of approaches has been focusing on approximating attention. The goal is you want to scale to longer sequences. There are tons of applications where you want to do that. But scaling to longer sequences is difficult because attention scales quadratically in sequence length on both runtime and memory, as you mentioned. So instead of trying to approximate attention, we were trying to figure out, can we do the same computation and maybe be more memory efficient? So in the end, we ended up being the memory is linear in sequence length. In terms of computation, it's still quadratic, but we managed to make it much more hardware friendly. And as a result, we do get wall clock speed up on the order of 2 to 4x, which really helps because that just means that you'll be able to train with 2 to 4x longer sequence length for the same cost without doing any approximations. As a result, lots of folks have been using this. The thing is available in a lot of libraries that do language model training or fine tuning. [00:03:32]Alessio: And the approximation thing is important because this is an exact thing versus a sparse. So maybe explain a little bit the difference there. [00:03:40]Tri: For sure. So in addition, essentially you compute pairwise similarity between every single element in a sequence against each other. So there's been other approaches where instead of doing all that pairwise computation, you only compute similarity for some pairs of elements in the sequence. So you don't do quadratic number of comparison. And this can be seen as some form of sparsity. Essentially you're ignoring some of the elements. When you write down the matrix, you essentially say, OK, I'm going to pretend there's zero. So that has some benefits in terms of runtime and memory. But the trade-off is that it tends to do worse in terms of quality because you're essentially approximating or ignoring some elements. And I personally have worked on this as well for a few years. But when we talk to practitioners who actually train models, especially at large scale, they say, tend not to use these approximate attention methods. Because it turns out, this was surprising to me at the time, was that these approximation methods, even though they perform fewer computation, they tend to not be faster in walk-on time. So this was pretty surprising because back then, I think my background was more on the theoretical side. So I was thinking of, oh, how many flops or floating point operations are you performing? And hopefully that correlates well with walk-on time. But I realized that I was missing a bunch of ideas from the system side where flops or floating point operations don't necessarily correlate with runtime. There are other factors like memory reading and writing, parallelism, and so on. So I learned a ton from just talking to systems people because they kind of figured this stuff out a while ago. So that was really eye-opening. And then we ended up focusing a lot more on memory reading and writing because that turned out to be the majority of the time when you're doing attention is reading and writing memory. [00:05:34]Alessio: Yeah, the I.O. awareness is probably one of the biggest innovations here. And the idea behind it is, like you mentioned, the FLOPS growth of the cards have been going up, but the memory bandwidth, not as much. So I think maybe that was one of the assumptions that the original attention paper had. So talk a bit about how that came to be as an idea. It's one of those things that like in insight, it's like, obviously, why are we like rewriting to like HBM every time, you know, and like once you change it, it's clear. But what was that discovery process? [00:06:08]Tri: Yeah, in hindsight, a lot of the ideas have already been there in the literature. And I would say is it was somehow at the intersection of both machine learning and systems. And you kind of needed ideas from both sides. So on one hand, on the system side, so lots of systems folks have known that, oh, you know, kernel fusion is great. Kernel fusion just means that instead of performing, you know, loading the same element, instead of performing an operation, write it down, load it back up and perform the second operation, you just load it once, perform two operations and then write it down again. So that saves you kind of memory read and write in the middle there. So kernel fusion has been a classic. There's been other techniques from the system side, like tiling, where you perform things in the form of computations in block, again, so that you can load it into a really fast memory. Think of it as a cache. And this is, again, classical computer science ideas, right? You want to use the cache. So the system folks have been thinking about these ideas for a long time, and they apply to attention as well. But there were certain things in attention that made it difficult to do a complete kernel fusion. One of which is there is this softmax operation in the middle, which requires you to essentially sum across the row of the attention matrix. So it makes it difficult to kind of break it, because there's this dependency. So it makes it difficult to break things into a block. So on the system side, people have been thinking about these ideas, but it's been difficult to kind of do kernel fusion for the entire operation. On the machine learning side, people have been thinking more algorithmically. They say, okay, either we can approximate attention, or there's this trick called the online softmax trick, which says that because of softmax, the way it's written mathematically, you can actually break it up into smaller pieces, do some rescaling, and still get the right answer. So this online softmax trick has been around for a while. I think there was a paper from NVIDIA folks back in 2018 about this. And then there was a paper from Google. So Marcus, Rob, and Stats wrote a paper late 2021 on using this online softmax trick to break attention up into smaller pieces. So a lot of the ideas were already there. But it turns out, you kind of need to combine ideas from both sides. So you need to understand that, hey, we want to do kernel fusion to reduce memory written writes. But we also need this online softmax trick to be able to break the softmax into smaller pieces so that a lot of the systems tricks kind of carry through. We saw that, and it was kind of a natural idea that we ended up using ideas from both sides, and it ended up working pretty well. Yeah. [00:08:57]Alessio: Are there any downsides to kernel fusion? If I think about databases and the reasons why we have atomic operations, you know, it's like, you have observability and fallback in between them. How does that work with attention? Is there anything that we lose by fusing the operations? [00:09:13]Tri: Yeah, I think mostly on the practical side is that you lose a little bit of flexibility in the sense that, hey, now you have, for example, faster attention, it's just a subroutine that you would call to do attention. But as a researcher, let's say you don't want that exact thing, right? You don't want just attention, let's say you want some modification to attention. You want to do, hey, I'm going to multiply the query and key, but then I'm going to do this extra thing before I carry on. So kernel fusion just means that, okay, we have a subroutine that does the entire thing. But if you want to experiment with things, you won't be able to use that fused kernel. And the answer is, can we have a compiler that then automatically does a lot of this kernel fusion? Lots of compiler folks are thinking about this, either with a new language or you can embed it in PyTorch. PyTorch folks have been working on this as well. So if you write just your code in PyTorch and they can capture the graph, can they generate code that will fuse everything together? That's still ongoing, and it works for some cases. But for attention, because of this kind of softmax rewriting stuff, it's been a little bit more difficult. So maybe in a year or two, we'll have compilers that are able to do a lot of these optimizations for you. And you don't have to, for example, spend a couple months writing CUDA to get this stuff to work. Awesome. [00:10:41]Alessio: And just to make it clear for listeners, when we say we're not writing it to memory, we are storing it, but just in a faster memory. So instead of the HBM, we're putting it in the SRAM. Yeah. [00:10:53]Tri: Yeah. [00:10:54]Alessio: Maybe explain just a little bit the difference there. [00:10:56]Tri: Yeah, for sure. This is kind of a caricature of how you think about accelerators or GPUs in particular, is that they have a large pool of memory, usually called HBM, or high bandwidth memory. So this is what you think of as GPU memory. So if you're using A100 and you list the GPU memory, it's like 40 gigs or 80 gigs. So that's the HBM. And then when you perform any operation, you need to move data from the HBM to the compute unit. So the actual hardware unit that does the computation. And next to these compute units, there are on-chip memory or SRAM, which are much, much smaller than HBM, but much faster. So the analogy there is if you're familiar with, say, CPU and RAM and so on. So you have a large pool of RAM, and then you have the CPU performing the computation. But next to the CPU, you have L1 cache and L2 cache, which are much smaller than DRAM, but much faster. So you can think of SRAM as the small, fast cache that stays close to the compute unit. Physically, it's closer. There is some kind of asymmetry here. So HBM is much larger, and SRAM is much smaller, but much faster. One way of thinking about it is, how can we design algorithms that take advantage of this asymmetric memory hierarchy? And of course, lots of folks have been thinking about this. These ideas are pretty old. I think back in the 1980s, the primary concerns were sorting. How can we sort numbers as efficiently as possible? And the motivating example was banks were trying to sort their transactions, and that needs to happen overnight so that the next day they can be ready. And so the same idea applies, which is that they have slow memory, which was hard disk, and they have fast memory, which was DRAM. And people had to design sorting algorithms that take advantage of this asymmetry. And it turns out, these same ideas can apply today, which is different kinds of memory. [00:13:00]Alessio: In your paper, you have the pyramid of memory. Just to give people an idea, when he says smaller, it's like HBM is like 40 gig, and then SRAM is like 20 megabytes. So it's not a little smaller, it's much smaller. But the throughput on card is like 1.5 terabytes a second for HBM and like 19 terabytes a second for SRAM, which is a lot larger. How do you think that evolves? So TSMC said they hit the scaling limits for SRAM, they just cannot grow that much more. HBM keeps growing, HBM3 is going to be 2x faster than HBM2, I think the latest NVIDIA thing has HBM3. How do you think about the future of FlashAttention? Do you think HBM is going to get fast enough when maybe it's not as useful to use the SRAM? [00:13:49]Tri: That's right. I think it comes down to physics. When you design hardware, literally SRAM stays very close to compute units. And so you don't have that much area to essentially put the transistors. And you can't shrink these things too much. So just physics, in terms of area, you don't have that much area for the SRAM. HBM is off-chip, so there is some kind of bus that essentially transfers data from HBM to the compute unit. So you have more area to essentially put these memory units. And so yeah, I think in the future SRAM probably won't get that much larger, because you don't have that much area. HBM will get larger and faster. And so I think it becomes more important to design algorithms that take advantage of this memory asymmetry. It's the same thing in CPU, where the cache is really small, the DRAM is growing larger and larger. DRAM could get to, I don't know, two terabytes, six terabytes, or something, whereas the cache stays at, I don't know, 15 megabytes or something like that. I think maybe the algorithm design becomes more and more important. There's still ways to take advantage of this, I think. So in the future, I think flash attention right now is being used. I don't know if in the next couple of years, some new architecture will come in and whatnot, but attention seems to be still important. For the next couple of years, I still expect some of these ideas to be useful. Not necessarily the exact code that's out there, but I think these ideas have kind of stood the test of time. New ideas like IO awareness from back in the 1980s, ideas like kernel fusions, tiling. These are classical ideas that have stood the test of time. So I think in the future, these ideas will become more and more important as we scale models to be larger, as we have more kinds of devices, where performance and efficiency become much, much more important. [00:15:40]Alessio: Yeah, and we had Jonathan Frankle on the podcast, and if you go to issattentionallyouneed.com, he has an outstanding bet, and he does believe that attention will be the state of the art architecture still in a few years. Did you think flash attention would be this popular? I'm always curious on the research side, you publish a paper, and obviously you know it's great work, but sometimes it just kind of falls flat in the industry. Could you see everybody just starting to use this, or was that a surprise to you? [00:16:11]Tri: Certainly, I didn't anticipate the level of popularity. Of course, we were extremely happy to have people using this stuff and giving us feedback and so on, and help us improve things. I think when we were writing the paper, I remember sending an email to one of my advisors, and like, hey, I'm excited about this paper, but I think the most important thing will be the artifact, which is the code. So I knew that the code will be valuable. So we kind of focus a lot on the code and make sure that the code is usable and as fast as can be. Of course, the idea, the paper presents the ideas and explain it and have experiments that validate the idea, but I knew that the artifact or the code was also pretty important. And that turned out to be the right focus, which is, you know, we put out the paper, we release the code and continue working on the code. So it's a team effort with my co-authors as well. [00:17:07]Alessio: We mentioned Hazy Research a bunch of times on the podcast before. I would love for you to spend five minutes just talking about how does the group work? How do people get together? How do you bounce ideas off of each other? Yeah. [00:17:21]Tri: So Hazy Research is a research group at Stanford led by one of my advisors, Chris Re. I love the people there. It was one of the best experiences I had. They've made my PhD so much more enjoyable. And I think there are a couple of ways that the group has been working pretty well. So one is, I think there's a diverse pool of people who either, you know, some of them focus on algorithms and theory, some of them focus on building systems, some of them focus on applications. And as a result, there is this flow of idea. So as an example, some of us were working on like more algorithms and theory, and then we can talk to the folks building systems and say, hey, let's try it out and let's put it in the systems and see how it is. And there you will get feedback from systems folks. They will say, hey, we implemented this, or we tried this and this is where it doesn't work, something like that. And once we put it in the systems, the application folks can use the algorithm or new methods or new models. And we again get great feedback from them because the application folks, for example, some of my good friends, they focus on medical imaging or seizure detection. And that is the problem they care about. And if your method doesn't work on the task they care about, they will tell you. Whereas I think a lot of people in machine learning, they're a little bit more flexible. So they will be like, hey, it doesn't work on seizure detection. Let's try some other task, right? But having that direct feedback of like, hey, it doesn't work there, let's figure out why. I think that that feedback allows us to do better work. And I think that kind of process of exchanging ideas, validating it in a real system so that applications folks can try it out and give you feedback. That cycle has been very, very useful. And so that's one, having a diverse group of people. The other one is, and this is something I really appreciate from advice from Chris was try to understand the fundamental, right? And he's happy letting me go off and read some textbooks and playing with things because I think a lot of research ideas come from understanding the old literature and see how it fits with the new landscape. And so if you just new archive papers every day, that's great, but you also need to read textbooks. And that's one advice I got from Chris, which is understand the fundamentals. And I think that allows us to do more impactful work. [00:19:46]Alessio: How do you think about academia versus industry? I feel like AI / Machine Learning has been an area where up until three, four years ago, most of the cutting edge work was being done in academia. And now there's all these big industry research labs. You're obviously going to Princeton, so you're an academia believer. How should people think about where to go? Say I'm doing my master's, I have to decide between doing a PhD and going into OpenAI Anthropic. How should I decide? [00:20:15]Tri: I think they kind of play a complementary role, in my opinion. Of course, I also was considering different paths as well. So I think right now, scaling matters a lot, especially when you talk about language models and AI and so on. Scaling matters a lot. And that means that you need compute resources and you need infrastructure and you need engineers time. And so industry tends to have an advantage when it comes to scaling things. But a lot of the ideas actually came from academia. So let's take Attention, which got popular with the Transformer in 2017. Attention actually has been around for a while. So I think the first mention was in 2014, a paper from Bernadot and others and Yoshua Bengio, which is coming from academia. A lot of ideas did come from academia. And scaling things up, of course, I think OpenAI has been great at scaling things up. That was the bet that they made after, I think, GPT-2. So they saw that scaling these things up to back then was 1.5 billion parameter seemed to give you amazing capabilities. So they really committed to that. They really committed to scaling things. And that turned out to be, it's been a pretty successful bet. I think for academia, we're still trying to figure out exactly what we're doing in this shifting landscape. And so lots of folks have been focusing on, for example, evaluation. So I know the Stanford Center for Foundation Model led by Percy, they have this benchmark called HELM, which is this holistic benchmark. So trying to figure out, okay, characterizing the landscape of different kinds of models, what people should evaluate, what people should measure, and things like that. So evaluation is one role. The other one is understanding. So this has happened historically where there's been some development in the industry and academia can play a role in explaining, understanding. They have the luxury to slow down trying to understand stuff, right? So lots of paper on understanding what's really going on, probing these models, and so on. I think I'm not as familiar with the NLP literature, but my impression is there's a lot of that going on in the NLP conferences, which is understanding what these models are doing, what capabilities they have, and so on. And the third one I could see is that the academia can take more risky bets in the sense that we can work on stuff that is quite different from industry. I think industry, my impression is you have some objective. You're trying to say, hey, for this quarter, we want to scale the model in this particular way. Next quarter, we want the model to have these capabilities. You're trying to get objectives that maybe, I don't know, 70% that will work out because it's important for the company's direction. I think for academia, the way things work is you have many, many researchers or PhD students, and they're kind of pursuing independent directions. And they have a little bit more flexibility on, hey, I'm going to try out this seemingly crazy idea and see, let's say there's a 30% chance of success or something. And however you define success, for academia, a lot of the time, success just means like, hey, we found something interesting. That could eventually go into industry through collaboration and so on. So I do see academia and industry kind of playing complementary roles. And as for someone choosing a career, I think just more and more generally, industry would be probably better in terms of compensation, in terms of probably work-life balance. But my biased perspective is that maybe academia gives you a little bit more freedom to think and understand things. So it probably comes down to personal choice. I end up choosing to be a professor next year at Princeton. But of course, I want to maintain a relationship with industry folks. I think industry folks can provide very valuable feedback to what we're doing in academia so that we understand where the field is moving because some of the directions are very much influenced by what, for example, OpenAI or Google is doing. So we want to understand where the field is moving. What are some promising applications? And try to anticipate, okay, if the field is moving like this, these applications are going to be popular. What problems will be important in two, three years? And then we try to start thinking about those problems so that hopefully in two, three years, we have some of the answers to some of these problems in two, three years. Sometimes it works out, sometimes it doesn't. But as long as we do interesting things in academia, that's the goal. [00:25:03]Alessio: And you mentioned the eval side. So we did a Benchmarks 101 episode. And one of the things we were seeing is sometimes the benchmarks really influence the model development. Because obviously, if you don't score well on the benchmarks, you're not going to get published and you're not going to get funded. How do you think about that? How do you think that's going to change now that a lot of the applications of these models, again, is in more narrow industry use cases? Do you think the goal of the academia eval system is to be very broad and then industry can do their own evals? Or what's the relationship there? [00:25:40]Tri: Yeah, so I think evaluation is important and often a little bit underrated. So it's not as flashy as, oh, we have a new model that can do such and such. But I think evaluation, what you don't measure, you can't make progress on, essentially. So I think industry folks, of course, they have specific use cases that their models need to do well on. And that's what they care about. Not just academia, but other groups as well. People do understand what are some of the emerging use cases. So for example, now one of the most popular use cases is Chatbot. And then I think folks from Berkeley, some of them are from Berkeley, call them MLCs. They set up this kind of Chatbot arena to essentially benchmark different models. So people do understand what are some of the emerging use cases. People do contribute to evaluation and measurement. And as a whole, I think people try to contribute to the field and move the field forward, albeit that maybe slightly different directions. But we're making progress and definitely evaluation and measurement is one of the ways you make progress. So I think going forward, there's still going to be just more models, more evaluation. We'll just have better understanding of what these models are doing and what capabilities they have. [00:26:56]Alessio: I like that your work has been focused on not making benchmarks better, but it's like, let's just make everything faster. So it's very horizontal. So FlashAttention 2, you just released that on Monday. I read in the blog post that a lot of the work was also related to some of the NVIDIA library updates. Yeah, maybe run us through some of those changes and some of the innovations there. Yeah, for sure. [00:27:19]Tri: So FlashAttention 2 is something I've been working on for the past couple of months. So the story is the NVIDIA CUTLASS team, they released a new version of their library, which contains all these primitives to allow you to do matrix multiply or memory loading on GPU efficiently. So it's a great library and I built on that. So they released their version 3 back in January and I got really excited and I wanted to play with that library. So as an excuse, I was just like, okay, I'm going to refactor my code and use this library. So that was kind of the start of the project. By the end, I just ended up working with the code a whole lot more and I realized that, hey, there are these inefficiencies still in Flash Attention. We could change this way or that way and make it, in the end, twice as fast. But of course, building on the library that the NVIDIA folks released. So that was kind of a really fun exercise. I was starting out, it's just an excuse for myself to play with the new library. What ended up was several months of improvement, improving Flash Attention, discovering new ideas. And in the end, we managed to make it 2x faster and now it's pretty close to probably the efficiency of things like matrix multiply, which is probably the most optimized subroutine on the planet. So we're really happy about it. The NVIDIA Cutlass team has been very supportive and hopefully in the future, we're going to collaborate more. [00:28:46]Alessio: And since it's an NVIDIA library, can you only run this on CUDA runtimes? Or could you use this and then run it on an AMD GPU? [00:28:56]Tri: Yeah, so it's an NVIDIA library. So right now, the code we release runs on NVIDIA GPUs, which is what most people are using to train models. Of course, there are emerging other hardware as well. So the AMD folks did implement a version of Flash Attention, I think last year as well, and that's also available. I think there's some implementation on CPU as well. For example, there's this library, ggml, where they implemented the same idea running on Mac and CPU. So I think that kind of broadly, the idea would apply. The current implementation ended up using NVIDIA's library or primitives, but I expect these ideas to be broadly applicable to different hardware. I think the main idea is you have asymmetry in memory hierarchy, which tends to be everywhere in a lot of accelerators. [00:29:46]Alessio: Yeah, it kind of reminds me of Sara Hooker's post, like the hardware lottery. There could be all these things that are much better, like architectures that are better, but they're not better on NVIDIA. So we're never going to know if they're actually improved. How does that play into some of the research that you all do too? [00:30:04]Tri: Yeah, so absolutely. Yeah, I think Sara Hooker, she wrote this piece on hardware lottery, and I think she captured really well of what a lot of people have been thinking about this. And I certainly think about hardware lottery quite a bit, given that I do some of the work that's kind of really low level at the level of, hey, we're optimizing for GPUs or NVIDIA GPUs and optimizing for attention itself. And at the same time, I also work on algorithms and methods and transformer alternatives. And we do see this effect in play, not just hardware lottery, but also kind of software framework lottery. You know, attention has been popular for six years now. And so many kind of engineer hours has been spent on making it as easy and efficient as possible to run transformer, right? And there's libraries to do all kinds of tensor parallel, pipeline parallel, if you use transformer. Let's say someone else developed alternatives, or let's just take recurrent neural nets, like LSTM, GRU. If we want to do that and run that efficiently on current hardware with current software framework, that's quite a bit harder. So in some sense, there is this feedback loop where somehow the model architectures that take advantage of hardware become popular. And the hardware will also kind of evolve to optimize a little bit for that kind of architecture and software framework will also evolve to optimize for that particular architecture. Right now, transformer is the dominant architecture. So yeah, I'm not sure if there is a good way out of this. Of course, there's a lot of development. Things like, I think compilers will play a role because compilers allow you to maybe still be much more efficient across different kinds of hardware because essentially you write the same code and compiler will be able to make it run efficiently different kinds of hardware. So for example, there's this language Mojo, they're compiler experts, right? And their bet is AI models will be running on different kinds of devices. So let's make sure that we have really good compilers with a good language that then the compiler can do a good job optimizing for all kinds of devices. So that's maybe one way that you can get out of this cycle. But yeah, I'm not sure of a good way. In my own research, I have to think about both the algorithm new model and how it maps to hardware. So there are crazy ideas that seem really good, but will be really, really difficult to run efficiently. And so as a result, for example, we can't really scale some of the architectures up simply because they're not hardware friendly. I have to think about both sides when I'm working on new models. [00:32:50]Alessio: Yeah. Have you spent any time looking at some of the new kind of like AI chips companies, so to speak, like the Cerebras of the world? Like one of their innovations is co-locating everything on the chip. So you remove some of this memory bandwidth issue. How do you think about that? [00:33:07]Tri: Yeah, I think that's an interesting bet. I think Tesla also has this Dojo supercomputer where they try to have essentially as fast on-chip memory as possible and removing some of these data transfer back and forth. I think that's a promising direction. The issues I could see, you know, I'm definitely not a hardware expert. One issue is the on-chip memory tends to be really expensive to manufacture, much more expensive per gigabyte compared to off-chip memory. So I talked to, you know, some of my friends at Cerebros and, you know, they have their own stack and compiler and so on, and they can make it work. The other kind of obstacle is, again, with compiler and software framework and so on. For example, if you can run PyTorch on this stuff, lots of people will be using it. But supporting all the operations in PyTorch will take a long time to implement. Of course, people are working on this. So I think, yeah, we kind of need these different bets on the hardware side as well. Hardware has, my understanding is, has a kind of a longer time scale. So you need to design hardware, you need to manufacture it, you know, maybe on the order of three to five years or something like that. So people are taking different bets, but the AI landscape is changing so fast that it's hard to predict, okay, what kind of models will be dominant in, let's say, three or five years. Or thinking back five years ago, would we have known that Transformer would have been the dominant architecture? Maybe, maybe not, right? And so different people will make different bets on the hardware side. [00:34:39]Alessio: Does the pace of the industry and the research also influence the PhD research itself? For example, in your case, you're working on improving attention. It probably took you quite a while to write the paper and everything, but in the meantime, you could have had a new model architecture come out and then it's like nobody cares about attention anymore. How do people balance that? [00:35:02]Tri: Yeah, so I think it's tough. It's definitely tough for PhD students, for researchers. Given that the field is moving really, really fast, I think it comes down to understanding fundamental. Because that's essentially, for example, what the PhD allows you to do. It's been a couple of years understanding the fundamentals. So for example, when I started my PhD, I was working on understanding matrix vector multiply, which has been a concept that's been around for hundreds of years. We were trying to characterize what kind of matrices would have theoretically fast multiplication algorithm. That seems to have nothing to do with AI or anything. But I think that was a time when I developed mathematical maturity and research taste and research skill. The research topic at that point didn't have to be super trendy or anything, as long as I'm developing skills as a researcher, I'm making progress. And eventually, I've gotten quite a bit better in terms of research skills. And that allows, for example, PhD students later in their career to quickly develop solutions to whatever problems they're facing. So I think that's just the natural arc of how you're being trained as a researcher. For a lot of PhD students, I think given the pace is so fast, maybe it's harder to justify spending a lot of time on the fundamental. And it's tough. What is this kind of explore, exploit kind of dilemma? And I don't think there's a universal answer. So I personally spend some time doing this kind of exploration, reading random textbooks or lecture notes. And I spend some time keeping up with the latest architecture or methods and so on. I don't know if there's a right balance. It varies from person to person. But if you only spend 100% on one, either you only do exploration or only do exploitation, I think it probably won't work in the long term. It's probably going to have to be a mix and you have to just experiment and kind of be introspective and say, hey, I tried this kind of mixture of, I don't know, one exploration paper and one exploitation paper. How did that work out for me? Should I, you know, having conversation with, for example, my advisor about like, hey, did that work out? You know, should I shift? I focus more on one or the other. I think quickly adjusting and focusing on the process. I think that's probably the right way. I don't have like a specific recommendation that, hey, you focus, I don't know, 60% on lecture notes and 40% on archive papers or anything like that. [00:37:35]Alessio: Let's talk about some Transformer alternatives. You know, say Jonathan Franco loses his bet and Transformer is not the state of the art architecture. What are some of the candidates to take over? [00:37:49]Tri: Yeah, so this bet is quite fun. So my understanding is this bet between Jonathan Franco and Sasha Rush, right? I've talked to Sasha a bunch and I think he recently gave an excellent tutorial on Transformer alternatives as well. So I would recommend that. So just to quickly recap, I think there's been quite a bit of development more recently about Transformer alternatives. So architectures that are not Transformer, right? And the question is, can they do well on, for example, language modeling, which is kind of the application that a lot of people care about these days. So there are methods based on state space methods that came out in 2021 from Albert Gu and Curran and Chris Re that presumably could do much better in terms of capturing long range information while not scaling quadratically. They scale sub-quadratically in terms of sequence length. So potentially you could have a much more efficient architecture when sequence length gets really long. The other ones have been focusing more on recurrent neural nets, which is, again, an old idea, but adapting to the new landscape. So things like RWKV, I've also personally worked in this space as well. So there's been some promising results. So there's been some results here and there that show that, hey, these alternatives, either RNN or state space methods, can match the performance of Transformer on language modeling. So that's really exciting. And we're starting to understand on the academic research side, we want to understand, do we really need attention? I think that's a valuable kind of intellectual thing to understand. And maybe we do, maybe we don't. If we want to know, we need to spend serious effort on trying the alternatives. And there's been folks pushing on this direction. I think RWKV scale up to, they have a model at 14 billion that seems pretty competitive with Transformer. So that's really exciting. That's kind of an intellectual thing. We want to figure out if attention is necessary. So that's one motivation. The other motivation is Transformer Alternative could have an advantage in practice in some of the use cases. So one use case is really long sequences. The other is really high throughput of generation. So for really long sequences, when you train with Transformer, with flash attention and so on, the computation is still quadratic in the sequence length. So if your sequence length is on the order of, I don't know, 16K, 32K, 100K or something, which some of these models have sequence length 100K, then you do get significantly slower in terms of training, also in terms of inference. So maybe these alternative architectures could scale better in terms of sequence length. I haven't seen actual validation on this. Let's say an RNN model release with context length, I don't know, 100K or something. I haven't really seen that. But the hope could be that as we scale to long sequences, these alternative architectures could be more well-suited. Not just text, but things like high resolution images, audio, video, and so on, which are emerging applications. So that's one, long sequences. Number two is a high throughput generation, where I can imagine scenarios where the application isn't like an interactive chatbot, but let's say a company wants to batch as many requests as possible on their server, or they're doing offline processing, they're generating stuff based on their internal documents, that you need to process in batch. And the issue with Transformer is that during generation, it essentially needs to keep around all the previous history. It's called the KV cache. And that could take a significant amount of memory, so you can't really batch too much because you run out of memory. I am personally bullish on RNNs. I think RNNs, they essentially summarize the past into a state vector that has fixed size, so the size doesn't grow with the history. So that means that you don't need as much memory to keep around all the previous tokens. And as a result, I think you can scale to much higher batch sizes. And as a result, you can make much more efficient use of the GPUs or the accelerator, and you could have much higher generation throughput. Now, this, I don't think, has been validated at scale. So as a researcher, I'm bullish on this stuff because I think in the next couple of years, these are use cases where these alternatives could have an advantage. We'll just kind of have to wait and see to see if these things will happen. I am personally bullish on this stuff. At the same time, I also spend a bunch of time making attention as fast as possible. So maybe hatching and playing both sides. Ultimately, we want to understand, as researchers, we want to understand what works, why do the models have these capabilities? And one way is, let's push attention to be as efficient as possible. On the other hand, let's push other alternatives to be as efficient at scale, as big as possible, and so that we can kind of compare them and understand. Yeah, awesome. [00:43:01]Alessio: And I think as long as all of this work happens and open, it's a net positive for everybody to explore all the paths. Yeah, let's talk about open-source AI. Obviously, together, when Red Pajama came out, which was an open clone of the LLAMA1 pre-training dataset, it was a big thing in the industry. LLAMA2 came out on Tuesday, I forget. And this week, there's been a lot of things going on, which they call open-source, but it's not really open-source. Actually, we wrote a post about it that was on the front page of Hacker News before this podcast, so I was frantically responding. How do you think about what open-source AI really is? In my mind, in open-source software, we have different levels of open. So there's free software, that's like the GPL license. There's open-source, which is Apache, MIT. And then there's kind of restricted open-source, which is the SSPL and some of these other licenses. In AI, you have the open models. So Red Pajama is an open model because you have the pre-training dataset, you have the training runs and everything. And then there's obviously RandomLens that doesn't make it one-to-one if you retrain it. Then you have the open-weights model that's kind of like StableLM, where the weights are open, but the dataset is not open. And then you have LLAMA2, which is the dataset is not open, the weights are restricted. It's kind of like not really open-source, but open enough. I think it's net positive because it's like $3 million of flops donated to the public. [00:44:32]Tri: How do you think about that? [00:44:34]Alessio: And also, as you work together, what is your philosophy with open-source AI? Right, right. [00:44:40]Tri: Yeah, I think that's a great question. And I think about it on maybe more practical terms. So of course, Meta has done an amazing job training LLAMA1, LLAMA2. And for LLAMA2, they make it much less restrictive compared to LLAMA1. Now you can use it for businesses, unless you are a monthly active user or something like that. I think just this change will have a very significant impact in the kind of landscape of open-source AI, where now lots of businesses, lots of companies will be using, I expect will be using things like LLAMA2. They will fine-tune on their own dataset. They will be serving variants or derivatives of LLAMA2. Whereas before, with LLAMA1, it was also a really good model, but your business companies weren't allowed to do that. So I think on a more practical term, it's kind of shifting the balance between a closed-source model like OpenAI and Anthropic and Google, where you're making API calls, right? And maybe you don't understand as much of what the model is doing, how the model is changing, and so on. Versus now, we have a model with open weight that is pretty competitive from what I've seen in terms of benchmarks, pretty competitive with GPT 3.5, right? And if you fine-tune it on your own data, maybe it's more well-suited for your own data. And I do see that's going to shift the balance of it. More and more folks are going to be using, let's say, derivatives of LLAMA2. More and more folks are going to fine-tune and serve their own model instead of calling an API. So that shifting of balance is important because in one way, we don't want just a concentration of decision-making power in the hands of a few companies. So I think that's a really positive development from Meta. Of course, training the model takes a couple of millions of dollars, but engineers have and I'm sure they spend tons of time trying many, many different things. So the actual cost is probably way more than that. And they make the weights available and they allow probably a lot of companies are going to be using this. So I think that's a really positive development. And we've also seen amazing progress on the open source community where they would take these models and they either fine-tune on different kinds of data sets or even make changes to the model. So as an example, I think for LLAMA1, the context lane was limited to 2K. Like a bunch of folks figured out some really simple methods to scale up to like 8K. [00:47:12]Alessio: Like the RoPE. [00:47:13]Tri: Yes. I think the open source community is very creative, right? And lots of people. LLAMA2 will, again, kind of accelerate this where more people will try it out. More people will make tweaks to it and make a contribution and then so on. So overall, I think I see that as still a very positive development for the field. And there's been lots of libraries that will allow you to host or fine-tune these models, like even with quantization and so on. Just a couple of hours after LLAMA2 was released, tons of companies announcing that, hey, it's on our API or hosting and so on and together did the same. So it's a very fast-paced development and just kind of a model with available weights that businesses are allowed to use. I think that alone is already a very positive development. At the same time, yeah, we can do much better in terms of releasing data sets. Data sets tend to be... Somehow people are not incentivized to release data sets. So philosophically, yeah, you want to be as open as possible. But on a practical term, I think it's a little bit harder for companies to release data sets. Legal issues. The data sets released tend to be not as eye-catchy as the model release. So maybe people are less incentivized to do that. We've seen quite a few companies releasing data sets together. Released a red pajama data set. I think Cerebus then worked on that and deduplicate and clean it up and release slim pajama and so on. So we're also seeing positive development on that front, kind of on the pre-training data set. So I do expect that to continue. And then on the fine-tuning data set or instruction tuning data set, I think we now have quite a few open data sets on instruction tuning and fine-tuning. But these companies do pay for human labelers to annotate these instruction tuning data set. And that is expensive. And maybe they will see that as their competitive advantage. And so it's harder to incentivize these companies to release these data sets. So I think on a practical term, we're still going to make a lot of progress on open source AI, on both the model development, on both model hosting, on pre-training data set and fine-tuning data set. Right now, maybe we don't have the perfect open source model since all the data sets are available. Maybe we don't have such a thing yet, but we've seen very fast development on the open source side. I think just maybe this time last year, there weren't as many models that are competitive with, let's say, ChatGPT. [00:49:43]Alessio: Yeah, I think the open data sets have so much more impact than open models. If you think about Elusive and the work that they've done, GPT-J was great, and the Pythia models are great, but the Pyle and the Stack, everybody uses them. So hopefully we get more people to contribute time to work on data sets instead of doing the 100th open model that performs worse than all the other ones, but they want to say they released the model. [00:50:14]Tri: Yeah, maybe the question is, how do we figure out an incentive structure so that companies are willing to release open data sets? And for example, it could be like, I think some of the organizations are now doing this where they are asking volunteers to annotate and so on. And maybe the Wikipedia model of data set, especially for instruction tuning, could be interesting where people actually volunteer their time and instead of editing Wikipedia, add annotation. And somehow they acknowledge and feel incentivized to do so. Hopefully we get to that kind of level of, in terms of data, it would be kind of like Wikipedia. And in terms of model development, it's kind of like Linux where people are contributing patches and improving the model in some way. I don't know exactly how that's going to happen, but based on history, I think there is a way to get there. [00:51:05]Alessio: Yeah, I think the Dolly-15K data set is a good example of a company saying, let's do this smaller thing, just make sure we make it open. We had Mike Conover from Databricks on the podcast, and he was like, people just bought into it and leadership was bought into it. You have companies out there with 200,000, 300,000 employees. It's like, just put some of them to label some data. It's going to be helpful. So I'm curious to see how that evolves. What made you decide to join Together? [00:51:35]Tri: For Together, the focus has been focusing a lot on open source model. And I think that aligns quite well with what I care about, of course. I also know a bunch of people there that I know and trust, and I'm excited to work with them. Philosophically, the way they've been really open with data set and model release, I like that a lot. Personally, for the stuff, for example, the research that I've developed, like we also try to make code available, free to use and modify and so on, contributing to the community. That has given us really valuable feedback from the community and improving our work. So philosophically, I like the way Together has been focusing on open source model. And the nice thing is we're also going to be at the forefront of research and the kind of research areas that I'm really excited about, things like efficient training and inference, aligns quite well with what the company is doing. We'll try our best to make things open and available to everyone. Yeah, but it's going to be fun being at the company, leading a team, doing research on the topic that I really care about, and hopefully we'll make things open to benefit the community. [00:52:45]Alessio: Awesome. Let's jump into the lightning round. Usually, I have two questions. So one is on acceleration, one on exploration, and then a takeaway. So the first one is, what's something that already happened in AI machine learning that you thought would take much longer than it has? [00:53:01]Tri: I think understanding jokes. I didn't expect that to happen, but it turns out scaling model up and training lots of data, the model can now understand jokes. Maybe it's a small thing, but that was amazing to me. [00:53:16]Alessio: What about the exploration side? What are some of the most interesting unsolved questions in the space? [00:53:22]Tri: I would say reasoning in the broad term. We don't really know how these models do. Essentially, they do something that looks like reasoning. We don't know how they're doing it. We have some ideas. And in the future, I think we will need to design architecture that explicitly has some kind of reasoning module in it if we want to have much more capable models. [00:53:43]Alessio: What's one message you want everyone to remember today? [00:53:47]Tri: I would say try to understand both the algorithm and the systems that these algorithms run on. I think at the intersection of machine learning system has been really exciting, and there's been a lot of amazing results at this intersection. And then when you scale models to large scale, both the machine learning side and the system side really matter. [00:54:06]Alessio: Awesome. Well, thank you so much for coming on 3. [00:54:09]Tri: This was great. Yeah, this has been really fun. [00:54:11] Get full access to Latent Space at www.latent.space/subscribe
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.07.22.550017v1?rss=1 Authors: Nenning, K.-H., Xu, T., Tambini, A., Franco, A. R., Margulies, D. S., Colcombe, S. J., Milham, M. P. Abstract: Brain connectome analysis suffers from the high dimensionality of connectivity data, often forcing a reduced representation of the brain at a lower spatial resolution or parcellation. However, maintaining high spatial resolution can both allow fine-grained topographical analysis and preserve subtle individual differences otherwise lost. This work presents a computationally efficient approach to estimate spatially fine-grained connectivity gradients and demonstrates its application in improving brain-behavior predictions. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC
We introduce a new class of time-continuous recurrent neural network models. Instead of declaring a learning system's dynamics by implicit nonlinearities, we construct networks of linear first-order dynamical systems modulated via nonlinear interlinked gates. The resulting models represent dynamical systems with varying (i.e., liquid) time-constants coupled to their hidden state, with outputs being computed by numerical differential equation solvers. These neural networks exhibit stable and bounded behavior, yield superior expressivity within the family of neural ordinary differential equations, and give rise to improved performance on time-series prediction tasks. To demonstrate these properties, we first take a theoretical approach to find bounds over their dynamics, and compute their expressive power by the trajectory length measure in a latent trajectory space. We then conduct a series of time-series prediction experiments to manifest the approximation capability of Liquid Time-Constant Networks (LTCs) compared to classical and modern RNNs. 2020: Ramin M. Hasani, Mathias Lechner, Alexander Amini, D. Rus, R. Grosu Recurrent neural network, Time series, Dynamical system, Nonlinear system, Approximation, Experiment, Numerical analysis, Artificial neural network https://arxiv.org/pdf/2006.04439v3.pdf
250 episodes and still going strong!Nicole reveals her poll results regarding where we'd each live if we had to move (from our Bozeman Reaction Discussion)This week's discussion includes looking up the superior colliculus, Raj being left out, Bernadette and the daycare center, the ball pit scene and it's return in the tag, and more!Download hereRunning time: 1:25:13, 51.3 MB
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Approximation is expensive, but the lunch is cheap, published by Jesse Hoogland on April 19, 2023 on LessWrong. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thank you to @Mark Chiu and @Quintin Pope for feedback. Machine learning is about finding good models: of the world and the things in it; of good strategies and the actions to achieve them. A sensible first question is whether this is even possible — whether the set of possible models our machine can implement contains a model that gets close to the thing we care about. In the language of empirical risk minimization, we want to know if there are models that accurately fit the target function and achieve low population risk, R(f∗). If this isn't the case, it doesn't matter whether your training procedure finds optimal solutions (optimization) or whether optimal solutions on your training set translate well to new data (generalization). You need good approximation for "optimal" to be good enough. The classical approach to approximation is that of universal approximation theorems. Unfortunately, this approach suffers from being too general and not saying anything about efficiency (whether in terms of the parameter count, weight norm, inference compute, etc.). It doesn't tell us what distinguishes neural networks as approximators from any other sufficiently rich model class such as polynomials, Gaussian processes, or even lookup tables. To find out what makes neural networks special, we have to move away from the classical focus on bounds that are agnostic to the details of the target function. You can't separate the properties that make neural networks special from the properties that make real-world target functions special. In particular, neural networks are well-suited to modeling two main features of real-world functions: smoothness (flat regions/low frequencies) and, for deep neural networks, sequential subtasks (hierarchical/modular substructure). A major flaw of classical learning theory is that it attempts to study learning in too much generality. Obtaining stronger guarantees requires breaking down the classes we want to study into smaller, more manageable subclasses. In the case of approximation, this means breaking apart the target function class to study "natural" kinds of target functions; in the case of generalization, this will mean breaking apart the model class into "learnable" subclasses. Already long underway, this shift towards a "thermodynamics of learning" is at the heart of an ongoing transformation in learning theory. Universal approximation is cosmic waste Polynomials are universal approximators. The original universal approximation theorem dates back to Weierstrass in 1885. He proved that polynomials could "uniformly" approximate any desired continuous function over a fixed interval, where "uniformly" means that the difference between the outputs of the target function and model function is less than a fixed distance, ϵ, for every input. Infinite-width networks are universal approximators. Half a century later, Stone generalized the result to arbitrary "polynomial-like" function classes in what is now known as the Stone-Weierstrass theorem. In 1989, Hornik, Stinchcombe, and White showed that infinite-width one-hidden-layer neural networks with sigmoidal activations satisfy the conditions of this theorem, which makes neural networks universal approximators. It's possible to obtain the same guarantees for networks with more modern activation functions (Telgarsky 2020) and through different approaches (e.g., Cybenko 1989). Universal approximation is expensive. The main problem with these results is that they say nothing about efficiency, i.e., how many parameters we need to achieve a good fit. Rather than blanket statements of "universal approximation...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Approximation is expensive, but the lunch is cheap, published by Jesse Hoogland on April 19, 2023 on LessWrong. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thank you to @Mark Chiu and @Quintin Pope for feedback. Machine learning is about finding good models: of the world and the things in it; of good strategies and the actions to achieve them. A sensible first question is whether this is even possible — whether the set of possible models our machine can implement contains a model that gets close to the thing we care about. In the language of empirical risk minimization, we want to know if there are models that accurately fit the target function and achieve low population risk, R(f∗). If this isn't the case, it doesn't matter whether your training procedure finds optimal solutions (optimization) or whether optimal solutions on your training set translate well to new data (generalization). You need good approximation for "optimal" to be good enough. The classical approach to approximation is that of universal approximation theorems. Unfortunately, this approach suffers from being too general and not saying anything about efficiency (whether in terms of the parameter count, weight norm, inference compute, etc.). It doesn't tell us what distinguishes neural networks as approximators from any other sufficiently rich model class such as polynomials, Gaussian processes, or even lookup tables. To find out what makes neural networks special, we have to move away from the classical focus on bounds that are agnostic to the details of the target function. You can't separate the properties that make neural networks special from the properties that make real-world target functions special. In particular, neural networks are well-suited to modeling two main features of real-world functions: smoothness (flat regions/low frequencies) and, for deep neural networks, sequential subtasks (hierarchical/modular substructure). A major flaw of classical learning theory is that it attempts to study learning in too much generality. Obtaining stronger guarantees requires breaking down the classes we want to study into smaller, more manageable subclasses. In the case of approximation, this means breaking apart the target function class to study "natural" kinds of target functions; in the case of generalization, this will mean breaking apart the model class into "learnable" subclasses. Already long underway, this shift towards a "thermodynamics of learning" is at the heart of an ongoing transformation in learning theory. Universal approximation is cosmic waste Polynomials are universal approximators. The original universal approximation theorem dates back to Weierstrass in 1885. He proved that polynomials could "uniformly" approximate any desired continuous function over a fixed interval, where "uniformly" means that the difference between the outputs of the target function and model function is less than a fixed distance, ϵ, for every input. Infinite-width networks are universal approximators. Half a century later, Stone generalized the result to arbitrary "polynomial-like" function classes in what is now known as the Stone-Weierstrass theorem. In 1989, Hornik, Stinchcombe, and White showed that infinite-width one-hidden-layer neural networks with sigmoidal activations satisfy the conditions of this theorem, which makes neural networks universal approximators. It's possible to obtain the same guarantees for networks with more modern activation functions (Telgarsky 2020) and through different approaches (e.g., Cybenko 1989). Universal approximation is expensive. The main problem with these results is that they say nothing about efficiency, i.e., how many parameters we need to achieve a good fit. Rather than blanket statements of "universal approximation...
We learn about the world's first smart halter with Protequus founder Jeffrey R. Schab, CHA Master Horseman Marla Foreman explains successive approximation in teaching and Val McCloskey clears up the leg yield or side pass confusion.The HORSES IN THE MORNING Crew: Glenn the Geek, Christy Landwehr, Coach Jenn.Sponsored by Certified Horsemanship AssociationSponsor: Protequus, NightwatchGuest: Jeffrey Schab, Protequus CEOGuest: Marla Foreman, Marla Foreman FacebookGuest: Val McCloskey, pictured, Whisper Wind Equestrian CentreSong: Kasey SmithCHA Changes Lives Through Safe Experiences With HorsesFollow Horse Radio Network on Twitter or follow Horses In The Morning on FacebookAdditional support for this podcast provided by Listeners Like You
We learn about the world's first smart halter with Protequus founder Jeffrey R. Schab, CHA Master Horseman Marla Foreman explains successive approximation in teaching and Val McCloskey clears up the leg yield or side pass confusion.The HORSES IN THE MORNING Crew: Glenn the Geek, Christy Landwehr, Coach Jenn.Sponsored by Certified Horsemanship AssociationSponsor: Protequus, NightwatchGuest: Jeffrey Schab, Protequus CEOGuest: Marla Foreman, Marla Foreman FacebookGuest: Val McCloskey, pictured, Whisper Wind Equestrian CentreSong: Kasey SmithCHA Changes Lives Through Safe Experiences With HorsesFollow Horse Radio Network on Twitter or follow Horses In The Morning on FacebookAdditional support for this podcast provided by Listeners Like You
Today's Episode: Approximation in Chinese.The Learn Chinese Podcast is brought to you by LC Chinese School. Listening to our podcast is an enjoyable way to learn Norwegian at your own pace, whenever and wherever you are! Visit our website www.lcchineseschool.com and sign up for a FREE Chinese Trial Class
Anisotropic Photon and Electron Scattering without Ultrarelativistic Approximation by Anderson C. M. Lai et al. on Wednesday 30 November Interactions between photons and electrons are ubiquitous in astrophysics. Photons can be down scattered (Compton scattering) or up scattered (inverse Compton scattering) by moving electrons. Inverse Compton scattering, in particular, is an essential process for the production of astrophysical gamma rays. Computations of inverse Compton emission typically adopts an isotropic or an ultrarelativistic assumption to simplify the calculation, which makes them unable to broadcast the formula to the whole phase space of source particles. In view of this, we develop a numerical scheme to compute the interactions between anisotropic photons and electrons without taking ultrarelativistic approximations. Compared to the ultrarelativistic limit, our exact results show major deviations when target photons are down scattered or when they possess energy comparable to source electrons. We also consider two test cases of high-energy inverse Compton emission to validate our results in the ultrarelativistic limit. In general, our formalism can be applied to cases of anisotropic electron-photon scattering in various energy regimes, and for computing the polarizations of the scattered photons. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2211.15691v1
Anisotropic Photon and Electron Scattering without Ultrarelativistic Approximation by Anderson C. M. Lai et al. on Tuesday 29 November Interactions between photons and electrons are ubiquitous in astrophysics. Photons can be down scattered (Compton scattering) or up scattered (inverse Compton scattering) by moving electrons. Inverse Compton scattering, in particular, is an essential process for the production of astrophysical gamma rays. Computations of inverse Compton emission typically adopts an isotropic or an ultrarelativistic assumption to simplify the calculation, which makes them unable to broadcast the formula to the whole phase space of source particles. In view of this, we develop a numerical scheme to compute the interactions between anisotropic photons and electrons without taking ultrarelativistic approximations. Compared to the ultrarelativistic limit, our exact results show major deviations when target photons are down scattered or when they possess energy comparable to source electrons. We also consider two test cases of high-energy inverse Compton emission to validate our results in the ultrarelativistic limit. In general, our formalism can be applied to cases of anisotropic electron-photon scattering in various energy regimes, and for computing the polarizations of the scattered photons. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2211.15691v1
Approximation is a powerful technique, but is not applicable in all rational work, and so is not a good general theory of nebulosity. https://metarationality.com/approximation You can support the podcast and get episodes a week early, by supporting the Patreon: https://www.patreon.com/m/fluidityaudiobooks If you like the show, consider buying me a coffee: https://www.buymeacoffee.com/mattarnold Original music by Kevin MacLeod. This podcast is under a Creative Commons Attribution Non-Commercial International 4.0 License.
Joseph Bennish, Prof. Emeritus of CSULB, describes the field of Diophantine approximation, which started in the 19th Century with questions about how well irrational numbers can be approximated by rationals. It took Cantor and Lebesgue to develop new ways to talk about the sizes of infinite sets to give the 20th century new ways to think about it. This led up to the Duffin-Schaeffer conjecture and this year's Fields Medal for James Maynard. --- Send in a voice message: https://anchor.fm/the-art-of-mathematics/message
Constraining the Fluctuating Gunn-Peterson Approximation Using Lyman- α Forest Tomography at z=2 by Robin Kooistra et al. on Thursday 22 September The fluctuating Gunn-Peterson approximation (FGPA) is a commonly-used method to generate mock Lyman-$alpha$ (Ly$alpha$) forest absorption skewers at Cosmic Noon ($zgtrsim 2$) from the matter-density field of $N$-body simulations without running expensive hydrodynamical simulations. Motivated by recent developments in 3D IGM tomography observations as well as matter density field reconstruction techniques applied to galaxy redshift samples at $zsim 2$, we examine the possibility of observationally testing FGPA by directly examining the relationship between the Ly$alpha$ transmission and the underlying matter density field. Specifically, we analyze the EAGLE, Illustris, IllustrisTNG and Nyx cosmological hydrodynamic simulations, that were run with different codes and sub-grid models. While the FGPA is an excellent description of the IGM in lower-density regions, the slope of the transmission-density distribution at higher densities is significantly affected by feedback processes causing the FGPA to break down in that regime. Even without added feedback, we find significant deviations caused by hydrodynamical effects arising from non-linear structure growth. We then proceed to make comparisons using realistic mock data assuming the sightline sampling and spectral properties of the recent CLAMATO survey, and find that it would be challenging to discern between the FGPA and hydrodynamical models with current data sets. However, the improved sightline sampling from future extremely large telescopes or large volumes from multiplexed spectroscopic surveys such as Subaru PFS should allow for stringent tests of the FGPA, and make it possible to detect the effect of galaxy feedback on the IGM. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2201.10169v2
The brothers ramble on about a good ol' Ryan Reynolds movie Just Friends. Why was Wade scarred of the Goths? Will wrestling be a religion in the far future? Why are country singers trying to swing so hard? Is Britain trying to run it back? Were British atheists offended by hearing God save the queen? Email us stuff at punandgame@gmail.com Merch:https://teespring.com/stores/punandgamePromo Code: WGAF for free shippingYouTube:https://www.youtube.com/channel/UCDUpI3McVZBegI28on8uwOATwitter:@PunandGameInstagram:@WadeTaylor_WGAF@PunandGame
Toward an understanding of the properties of neural network approaches for supernovae light curve approximation by Mariia Demianenko et al. on Sunday 18 September The modern time-domain photometric surveys collect a lot of observations of various astronomical objects, and the coming era of large-scale surveys will provide even more information. Most of the objects have never received a spectroscopic follow-up, which is especially crucial for transients e.g. supernovae. In such cases, observed light curves could present an affordable alternative. Time series are actively used for photometric classification and characterization, such as peak and luminosity decline estimation. However, the collected time series are multidimensional, irregularly sampled, contain outliers, and do not have well-defined systematic uncertainties. Machine learning methods help extract useful information from available data in the most efficient way. We consider several light curve approximation methods based on neural networks: Multilayer Perceptrons, Bayesian Neural Networks, and Normalizing Flows, to approximate observations of a single light curve. Tests using both the simulated PLAsTiCC and real Zwicky Transient Facility data samples demonstrate that even few observations are enough to fit networks and achieve better approximation quality than other state-of-the-art methods. We show that the methods described in this work have better computational complexity and work faster than Gaussian Processes. We analyze the performance of the approximation techniques aiming to fill the gaps in the observations of the light curves, and show that the use of appropriate technique increases the accuracy of peak finding and supernova classification. In addition, the study results are organized in a Fulu Python library available on GitHub, which can be easily used by the community. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2209.07542v1
Residency initiative »bangaloREsidency-Expanded« | Concert [05.03.2022] Yashas Shetty »Approximation of the Sea« (2020 – 22), for tanpura, piano and electronics, 30' The ‘drone' influences Western minimalism for almost 60 years now. Beyond its influence on Western classical music, its presence has extended into popular music as well, with early bands such as The Velvet Underground to some German Krautrock groups borrowing heavily from the concept of the drone. Its influence reaches deep into popular music and culture of today. Yet, the historical narrative of the drone has always been a one way street. It is a story told by American composers in which the early origins of this transfer of technology are shrouded in mythology and mystification. »Approximation of the Sea«, developed by sound artist Yashas Shetty in collaboration with ZKM | Hertz-Lab sound engineer Benjamin Miller for tanpura, piano, and electronics, explores both the technical and aesthetic challenges of composing with surround systems. It is designed especially for the ZKM_Cube. The piece in itself extends from Shetty's ongoing research into the history of encounters between Western composers and Indian artists in the 20th Century and is rooted in his broader artistic practice of extracting histories from mythologies.
In this episode of ScreenTone Club, Elliot and Andy get mixed up with questionable romances of multiple kinds and ponder the true value of a well-timed truck.Series Discussed: Welcome Back, Alice Vol. 1, Approximation of a Brother Complex -A Sibling Love Story-Assignments for next Episode: Tesla Note Vol. 1, Ya Boy Kongming! Vol. 1If you enjoy this episode, please consider backing us on Patreon - from only US$1 a month you get bonus episodes and other perks as well, including the ability to vote on topics for us to cover!We are affiliates on BookWalker! Using this link will give us a small kickback, helping cover the cost of manga for the podcast!TIMECODES:0:00:45 - Forever 21!0:02:30 - Elliot's Pick: Welcome Back, Alice0:05:30 - “Basically”0:08:15 - Self-Parody?0:12:00 - Cave Painting Character0:15:15 - Devil Ecstasy - Semiotics of Succubi!0:19:00 - Author Name Power!0:24:00 - Andy's Pick: Approximation of a Brother Complex -A Sibling Love Story-0:26:45 - “Shades of Haruhi”0:29:30 - Juuuuuuuuust Right0:35:15 - Screentone Club makes another Light Novel Title0:35:30 - Post-ending thread0:38:30 - The Final Verdict on Oreimo0:41:00 - Engaging!0:41:15 - Our Picks for next Episode!0:43:00 - Closedown!
In this episode of ScreenTone Club, Elliot and Andy visit the demon realm to see its most pitiable resident before setting in for the night (with a cat)!Series Discussed: Nights With a Cat Vol. 1, The Great Jahy Will Not Be Defeated! Vol. 1Assignments for next Episode: Welcome Back, Alice Vol. 1, Approximation of a Brother Complex -A Sibling Love Story-If you enjoy this episode, please consider backing us on Patreon - from only US$1 a month you get bonus episodes and other perks as well, including the ability to vote on topics for us to cover!We are affiliates on BookWalker! Using this link will give us a small kickback, helping cover the cost of manga for the podcast!TIMECODES:0:01:00 - Can You Drink Anime???0:03:15 - Elliot's Pick: Nights with a Cat Vol. 10:07:00 - Cynicism???0:10:15 - Well Balanced0:13:15 - Pleasant Afterword :30:17:15 - Andy's Pick: The Great Jahy Will Not Be Defeated! Vol. 10:20:15 - “The Manga Starts Now!”0:21:00 - The Comparisons Begin!0:23:15 - Dungeon Keeper (1997)0:25:30 - Steamed Hams0:26:30 - Elliot Page Politics Minute!0:31:30 - “The Poverty Angle”0:34:45 - Squeenix Published = issues0:40:45 - Druj Gaiden?0:42:00 - Oops Backseat writing again!0:44:15 - We have reached the “Bargaining” stage!0:45:00 - Our Picks for Next Episode!0:46:00 - THE SCREENTONE CLUB HIVE MIND0:47:30 - Closedown
Psychological thriller about love and the mirror-verse by Oliver Emanuel.
this one has a lot more fun tone than anything else I've done in the past, and is probably the most easy listening of all my episodes so far. expect some chaotic energy. And some Phil Collins. --- Send in a voice message: https://anchor.fm/biggayeric1505gmailcom/message
BjjBrick Podcast- BJJ, Jiu-Jitsu, MMA, martial arts, no-gi and good times!
In part 2 of this series Mez explains the why and how to apply approximation & traction into your practice from a "neuro" and "ortho" perspective. This PNF checklist item is a subcategory of the proprioceptive input principle. Sign up for our --> weekly newsletter for more rehab content Apply Now --> Application to Neuropedics' 12 Week Motor Control Mentorship Program Visit our website www.NeuropedicsPT.com Questions for Mez? You can email him: Ramez@NeuropedicsPT.com
A Sibling Chats and Sketchy Potion Vats Actual Play Adventure Co-Created by: Alpha Comics & Games: From vintage comics & games to new releases, find your Adventure at Alpha! | Conveniently located in Willow Lawn, Richmond VA. Goblins and Growlers: Community Building Through Tabletop Gaming. Creating all-original TTRPG content, and fostering nerdy spaces for everyone both digitally and in-person!
Studying the history of life is an important venture. It's how we understand why certain characteristics exist in living organisms, and it can also be used to explain the importance of biological events that are happening today. A study on the population density of the Tyrannosaurus Rex, one of the world's most famous predators, was first published on Science and reported on the National Geographic. It's a huge claim, with researchers estimating that a total of 2.5 billion T.rex have lived in North America, the native region of the species, going out as far north as Alaska and as far south as Mexico across a time span of two to three million years. This estimation is a huge claim and has certainly caught the eye of paleontology enthusiasts, However, there are a wide variety of variables that can compromise the validity of the information being tested: the location where the bones are found; shifts caused by glacial patterns and tracks throughout the years; inconsistencies with carbon-14 dating, which provides an approximate age; and even human intervention, which may not be enough to fill in the gaps in information that we do not know nor have the tools to understand just yet. If data-driven ventures cannot be used to sample what we know to be true, then is it still worth it? Are approximations a step in the right direction or is it too rooted in theory to be useful? How Much Hindsight Is Too Much?: The pursuit of estimates often discounts the importance of absolutes. In paleontology, there are plenty of assumptions made that may affect the results of their research. As Alexander mused, much remains unsaid about the foundations of the study—and it may have an impact on whether or not scientists are taking the right perspective on the matter. Analyzing data from the source and having a clear log of how the researchers conducted their tests is standard procedure. However, what is the impact of creating logs for circumstances that can no longer be observed by anyone living? “Who decided that the dinosaur is a dinosaur or not a dinosaur? Who decides that it sits in this area of time as opposed to another? What if my carbon dating is wrong, and maybe this aquatic animal that we didn't think existed prehistorically actually did exist?” Alexander asked, expressing doubts. Transparent and Tangible Research: This is the second time that scientists have made an attempt to estimate the population density of T-rexes in the past, and results closely resemble an earlier estimate that was published in 1993. The difference between these two papers is that this most recent study utilizes the latest in T-rex biology research to set upper and lower limits on the total population—one approximation after another. Since there is so much inexactness and uncertainty in what we do, it is important to focus on the fundamentals: ideas, principles, and beliefs that we know to be observable, objective, and tangible. When we go overboard on theory, we may find ourselves defining a biased picture of what the data represents. This concern is not just limited to research and development in paleontology. With the vast variety of tools, knowledge, and technology that we have at our disposal today, it can become all too easy to take the wrong direction. When we take the next step forward, we need to make sure that our feet are planted firmly on the ground. Dealing With the Metaphorical T-rex of Today: At the pace that science and technology is developing today, it's safe to assume that more discoveries will be made—not just in paleontology, but in other sciences and across other industries as well. It is vital that scientists continue working towards making these discoveries more accessible to the public while staying true to the path of innovation. There is a different impact in analyzing tangible beings, objects, and events. TARTLE is an opportunity to look at the T-rexes of the modern world: clear and imminent threats that are capable of harming us and the people we care about. The TARTLE platform is an opportunity to connect with like-minded individuals and organizations so that we can work as a collective to preserve our earth and our economy for future organizations. www.tartle.co Tcast is brought to you by TARTLE. A global personal data marketplace that allows users to sell their personal information anonymously when they want to, while allowing buyers to access clean ready to analyze data sets on digital identities from all across the globe. The show is hosted by Co-Founder and Source Data Pioneer Alexander McCaig and Head of Conscious Marketing Jason Rigby. What's your data worth? Find out at: https://tartle.co/ YouTube: https://www.youtube.com/c/TARTLE Facebook: https://www.facebook.com/TARTLEofficial/ Instagram: https://www.instagram.com/tartle_official/ Twitter: https://twitter.com/TARTLEofficial Spread the word!
In the case of one of our all-time favorite Pen Addict products, what is dead may never die. Yes, the Scribble Pen is … back? It's something, that's for sure.
In the case of one of our all-time favorite Pen Addict products, what is dead may never die. Yes, the Scribble Pen is … back? It's something, that's for sure.
This episode is also available as a blog post: https://ungroovygords.com/2021/09/18/linear-approximation-tanka-poetry/ --- Send in a voice message: https://anchor.fm/groovygords/message
Classics: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples. 2014: I. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, Yoshua Bengio Generative adversarial networks, Generative model, Discriminative model, Backpropagation, Minimax, Markov chain, Multilayer perceptron, Assignment (computer science), Approximation algorithm, Experiment https://arxiv.org/pdf/1406.2661.pdf
I am talking with my friend, Vidal Freire. We discuss, most exactly, Approximated Truth Claims. Vidal posits that we cannot really know reality absolutely, and therefore all our knowledge about reality (and about truth) is approximate. This then means that language is merely a broken human tool, science is not definite, and that we are constantly plagued by our own subjectivity. This makes Vidal a very definite and hard Skeptic. I have many, many points of contention and disagreement with this view (in fact, it is almost the anthesis of my thinking); however, I do not attempt to hash them out here, for I was primarily concerned with understanding Vildal's philosophy. Vidal and I intend to have another conversation in the future about God, Existentialism (how I define it), and more about honest and atheistic skepticism. Here is the link to Vidal's diagram: https://www.coltonkirby.com/post/vidal-freire-s-approximated-truth-claims-diagram Make sure to leave a review and subscribe! Check out my YouTube: Colton Kirby - YouTube --Links-- My Website: https://www.coltonkirby.com/ Twitter: https://twitter.com/_coltonkirby Parler: https://parler.com/profile/coltonkirby/posts Facebook: https://www.facebook.com/coltonjkirby/ Instagram: https://www.instagram.com/_coltonkirby/ Pinterest: https://www.pinterest.com/coltonjkirby/ Quora: https://www.quora.com/profile/Colton-Kirby-7
CO Front Range News Hour - 2021-7-22 https://patriotpost.us/memes/81518-the-emotion-you-feel-2021-07-22 https://patriotpost.us/memes/81512-the-real-royals-2021-07-22 https://www.the-sun.com/news/3331697/fedex-ups-airbnb-delta-airlines-website-down-outrage/ https://www.investmentwatchblog.com/nothing-to-do-with-cyber-polygon-tens-of-thousands-of-websites-worldwide-went-offline-today/ https://www.zerohedge.com/covid-19/nfl-teams-unvaxd-players-forfeit-games-incur-financial-liability-during-outbreaks See omnystudio.com/listener for privacy information.
Ah lads WE'RE BACK! Did ya miss us? Probably not! This week Sarah Jane tells the story of the horrific murder of Grace Livingstone and the Gardai's utter ineptitude when dealing with it. We also talk about how it didn't in fact, come home and Colin actually did leave in Emma's fart. Have a lovely week, friends!
The Steelers have a good track record for drafting superstar players. But picks like Terry Bradshaw, Troy Polamalu, Cam Heyward and Bud Dupree are great examples of players that didn’t shine right away. So just how long does it take for a Steelers’ draft pick to get acclimated? Thank goodness for the Stat Geek to break it all “dahn”. This is just one of the topics that will be discussed on the Thursday episode of the AM slate of the BTSC family of podcasts. Join Co-Editor Dave Schofield as he pulls out the Steelers slide rule and geeks out only like he can. Learn more about your ad choices. Visit megaphone.fm/adchoices
Welcome to our second episode for Y2K GROUP CHAT. We recorded on two separate days in June 2020. fields delves into: COVID-19, the history of the spirometer, the study of labor and fatigue, his relationship with science, fact-checking, Taco Tuesdays, urban gardening, the history of medical theater, and biases in science. fields harrington is an emerging artist based in Brooklyn, New York. Follow us on Instagram: @y2kgroup Subscribe to our YouTube channel for more content about contemporary art Stay up-to-date with our Y2K Blog on our website for more news. Audio timestamps below: 1:17 - Intro with fields 2:16 - Y2K's reason for starting a podcast 3:20 - The podcast music introduction 5:17 - First question: how has your work and life changed since the pandemic? 6:36 - Whitney ISP catalog 7:44 - Braun text on the spirometer 8:56 - Rabinbach text on the human motor (thermodynamics, labor power and fatigue) 10:15 - Labor/work and fatigue in the body 11:17 - fields' mapping of physics and science and its origins 14:37 - Second question: how did the spirometer become the starting point for fields' research? 15:40 - fields' relationship with science 18:12 - Round 2 19:02 - News 19:42 - Twitter/Social Media 21:50 - Fact-checking 23:49 - Deepfake 25:21 - Taco Tuesdays 25:57 - Texas and High School 27:05 - Moving to New York in 2011 27:28 - The Black Beyond Zoom Artist Talk reference 28:52 - Y2K's casual podcast format 30:32 - Community college (finding photography and food ads) 33:23 - Photography at UNT 35:37 - Road trip / couch surfing to New York 37:43 - Working in urban food start-ups and problem solving 39:10 - Beginnings of a career as an artist 39:31 - UNT thesis 40:49 - Food and advertisement 44:31 - Produce Manager and researching solutions 49:06 - Back to school 52:07 - Types of work at UPENN 55:07 - Performing with acoustic levitation 56:53 - S-CURL in high school 1:00:43 - Performance at UPENN using S-CURL 1:03:48 - Reaction to performance 1:05:09 - Medical theater introduction 1:06:40 - UPENN and medical history 1:07:35 - Paintings of medical theater 1:09:54 - Hogarth's The Reward of Cruelty painting 1:11:13 - Robert Thom painting 1:12:36 - J. Marion Simms racist legacy 1:15:31 - "What remains is constant" by fields harrington 1:16:02 - Braun and Rabinbach texts 1:17:18 - The history of the study of fatigue for labor/work 1:19:01 - Benjamin Gould report 1:23:20 - Etienne-Jules Marey 1:25:07 - fields' essay as artwork 1:26:50 - COVID-19 and spirometer having similar biases 1:28:30 - Race table from Gould's report 1:30:23 - Statistics as surveillance 1:30:55 - Biases in science 1:32:57 - Approximation of a Mix performance question 1:37:40 - Protests and Uprising 1:41:31 - Future work
Tune in to Episode 04 of the PA Talks series with Habibeh Madjdabadi, a notably prominent Iranian architect of the younger generation. After graduating with a Master's degree in Architecture from Azad University of Tehran, she established her studio in 2003 in Tehran. She has received numerous awards and recognitions such as Aga Khan Award, Tamayouz Women in Architecture, and many more. Her projects include House of 40 Knots, Barjeel Museum of Modern Arab Art and Approximation House. Madjdabadi's emphasis on the material originates from the role of cultural and geographical matters, as she considers them a poetic way of expression that considers human labor an integral part of the creative process of design. Watch this podcast on YouTube: https://www.youtube.com/watch?v=J9dLg_Otk94 Listen on: Apple Podcasts: https://podcasts.apple.com/tr/podcast/pa-talks/id1503812708 Spotify: https://open.spotify.com/show/4P442GMuRk0VtBtNifgKhU Google Podcasts: https://podcasts.google.com/search/pa%20talks Support us on Patreon: patreon.com/parametricarchitecture Follow the platform on: Parametric Architecture: https://www.instagram.com/parametric.architecture/ PA Talks: https://www.instagram.com/pa__talks Website: https://parametric-architecture.com/patalks/
HOT TOPIC: Portrayal of scientists in pop culture. The public's impression of scientists (and science as a whole) is often shaped by characters depicted in TV or movies. How has the portrayal of scientists in the entertainment industry changed over time, and what can a show like The Big Bang Theory teach us about the strategic use of science in pop culture? Featuring: Ingrid Ockert, Science History Institute, and David Saltzberg, UCLA/The Big Bang Theory Socialize with science on Twitter and Facebook using @ISGPforum. Disclaimer: The ISGP is a nonprofit organization that does not lobby for any position except rational thinking. Podcasts within the "Hot Topics Series" (Episodes 75+) reflect the views expressed by featured guests. For information on The Forum, please visit www.ISGPforum.org, and to learn more about the ISGP, check out www.scienceforglobalpolicy.org.
Kara Walker's artwork The Emancipation Approximation is on view at the Figge until August 27. This series of 27 silkscreen prints features the provocative silhouettes for which Walker is known.