Soviet mathematician
POPULARITY
Early bird discounts for the San Francisco World's Fair, the biggest AIE gathering of the year, end today - prices will go up by ~$500 tonight so do please lock in ASAP!From near-universal AI tool adoption inside Shopify to internal systems for ML experimentation, auto-research, customer simulation, and ultra-low-latency search, Mikhail Parakhin joins us for a deep dive into what it actually looks like when a 20-year-old, $200B software company goes all-in on AI. We cover why Shopify has become much more vocal about its internal stack, what changed after the December model-quality inflection, and why the real bottleneck in AI coding is no longer generation, but review, CI/CD, and deployment stability.We also go inside Tangle, Tangent, SimGym, which are three major AI initiatives that Shopify is doing to make experimentation reproducible, optimization automatic, customer behavior simulatable, and search and catalog intelligence faster and cheaper at scale. Along the way, Mikhail explains UCP, Liquid AI, and why token budgets are directionally right but often measured badly, why AI-written code can still increase bugs in production, what makes Shopify's customer simulation defensible, and what he learned from the Sydney era at Bing.We discuss:* Mikhail's path from running a major Microsoft business unit spanning Windows, Edge, Bing, and ads to becoming CTO of Shopify* Why Shopify is talking more publicly about AI now, and why staying at the frontier has become necessary for the company* Shopify's internal AI adoption curve, the December inflection, and why CLI-style tools are rising faster than traditional IDE-based tools* Why Jensen Huang is directionally right on token budgets, but raw token count is still the wrong way to evaluate engineering output* Why the real unlock is not more agents in parallel, but better critique loops, stronger models, and spending more on review than generation* Why AI coding can still lead to more bugs in production even if models write cleaner code on average than humans* Why Shopify built its own PR review flow, and why Mikhail thinks most off-the-shelf review tools miss the point* How PR volume, test failures, and deployment rollback are becoming the real bottlenecks in the agent era* Why Git, pull requests, and CI/CD may need a new metaphor once code is written at machine speed* What Tangle is, and how Shopify uses it to make ML and data workflows reproducible, collaborative, and production-ready from the start* Why Tangle is different from Airflow, and why content-addressed caching creates network effects across teams* What Tangent is, and how Shopify is using auto-research loops to optimize search, themes, prompt compression, storage, and more* Why Tangent is becoming a democratizing tool for PMs and domain experts, not just ML engineers* Why AutoML finally feels real in the LLM era, and where auto-research still falls short today* Why Tangle, Tangent, and SimGym become much more powerful when combined into one system* What SimGym is, why simulated customers only work if you have real historical behavior, and why Shopify's data gives it a moat* How SimGym evolved from comparing A/B variants to telling merchants what to change on a single live storefront to raise conversions* Why customer simulation is so expensive, from multimodal models to browser farms to serving and distillation costs* How Shopify models merchant and buyer trajectories, runs counterfactuals, and thinks about interventions like discounts, campaigns, and notifications* Why category-level behavior is so different across commerce, and why ideas like Chinese Restaurant Processes are showing up again in practice* Shopify's new UCP and catalog work, including runtime product search, bulk lookups, and identity linking* Why Shopify is using Liquid AI, and why Mikhail sees it as the first genuinely competitive non-transformer architecture he has used in practice* Where Liquid already works inside Shopify today, from low-latency query understanding to large-scale catalog and Sidekick Pulse workloads* Whether Liquid could become frontier-scale with enough compute, and why Shopify remains pragmatic and merit-based about model choice* Who Shopify is hiring right now across ML, data science, and distributed databases* The Sydney story at Bing, why its personality was not an accident, and what Mikhail learned from deliberately shaping AI character early onMikhail Parakhin* LinkedIn: https://www.linkedin.com/in/mikhail-parakhin/* X: https://x.com/MParakhinTimestamps00:00:00 Introduction: Mikhail Parakhin, Microsoft, and Shopify00:01:16 Why Shopify Is Talking More About AI00:02:29 Internal AI Adoption at Shopify and the December Inflection00:06:54 Token Budgets, Jensen Huang, and Why Usage Metrics Can Mislead00:10:55 Why Shopify Built Its Own AI PR Review System00:12:38 AI Coding, More Bugs, and the Real Deployment Bottleneck00:14:11 Why Git, PRs, and CI/CD May Need to Change for Agents00:18:24 Tangle: Shopify's Reproducible ML and Data Workflow Engine00:21:19 Why Tangle Is Different from Airflow00:26:14 Tangent: Auto Research for Optimization and Experimentation00:30:07 How Tangent Democratizes Experimentation Beyond ML Engineers00:33:06 The Limits of Auto Research00:36:36 Why Tangle, Tangent, and SimGym Compound Together00:37:20 SimGym: Simulating Customers with Shopify's Historical Data00:42:47 The Infra Behind SimGym00:46:00 Why SimGym Gets Better with Real Customer History00:47:30 Counterfactuals, HSTU, and Modeling Merchant Trajectories00:51:55 CRPs, Clustering, and Category-Level Customer Behavior00:53:30 UCP, Shopify Catalog, and Identity Linking00:55:07 Liquid AI: Why Shopify Uses Non-Transformer Models00:59:13 Real Shopify Use Cases for Liquid01:03:00 Can Liquid Scale into a Frontier Model?01:09:49 Hiring at Shopify: ML, Data Science, and Databases01:10:43 Sydney at Bing: Personality Shaping and AI Character01:13:32 Closing ThoughtsTranscript[00:00:00] swyx: Okay. We're here in the studio, a remote studio, with Mikhail Parakhin, CTO of Shopify. Welcome.[00:00:08] Mikhail Parakhin: Thank you. Welcome.[00:00:10] swyx: I don't even know if I should introduce you as CTO of Shopify. I feel like you have many identities. Uh, you led sort of the, the Bing ML team, I guess, uh, uh, or ads team. I, I don't know, I don't know, uh, you know, it's, uh, people va-variously refer you as like CEO or, or, uh, I don't know what that, that, that said previous role at Microsoft was.[00:00:29] Mikhail Parakhin: Uh, that was... Yeah, my previous role w- at Microsoft was the-- I actually was the CEO of one of Microsoft's business units, which included, as I, you know, as we discussed, all the things that people like to laugh about, uh, including Windows and Edge and Bing and ads and everything.[00:00:47] swyx: Yeah, yeah. What a, what a, what a wild time.You've obviously, uh, done a lot since you landed at Shopify. Uh, one of the reasons I reached out was because you started promoting more sort of internal tooling, uh, primarily Tangle, but also a lot of people have seen and adopted Tobi's QMD, uh, and obviously, I think, uh, Shopify has always been sort of leading in terms of, uh, engineering.I think more-- it's just more recent that you guys have been more vocal about your sort of AI adoption. Is that, is that true?[00:01:16] Mikhail Parakhin: Well, I think AI tools in general are fairly recent development, uh, and we've-- Shopify, you know, at this stage of its development, we're developing AI in-in-house and other, uh, building tools that use AI and, you know, interfacing with the wider AI community, uh, you know, are on the sort of the, uh, runaway trajectory.So it just did by sort of natural byproduct. We, we talk about it more also. We just, uh, just even yesterday, Andrej Karpathy was famous in tweeting about, oh, are there some, uh, ways, uh, that, that you can organize your agents to store the data and then, uh, look up the data so that you don't have to research or, or lose context every- Yestime. And a little bit tongue in cheek, I tweeted that, “Hey, we've, we've done it much earlier, and we even have different approaches, Tobi and I.” Tobi, of course, is a big fan of QMD, and I'm more of a SQL, SQLite fan. But, uh, yeah, very similar things that we've already done here. The point is, yeah, we're very dynamic, you know, explosively growing company, and we have to be at the forefront of AI adoption, obviously.[00:02:29] swyx: Yeah. Yeah. Um, you, your team kindly prepared some slides actually that we were gonna bring up on to, uh, the screen. I think I can, I can screen share, and then we can kind of go through some of the shocking stats that maybe, maybe put some numbers to what exactly is going on. So here we have, uh- An internal AI tool adoption chart.What are we looking at here? What ?[00:02:54] Mikhail Parakhin: Yeah, this is very interesting statistics. Uh, this is number of daily active workers, you know, think of, uh, DAO, basically the active users of-[00:03:05] swyx: Yeah ...[00:03:05] Mikhail Parakhin: AI tool as a percentage of all the people in the company, right? And then- Yeah ... different AI tools. And, uh, you could see two things here is that one is the green is total.Uh, green is just total. So you could see that it approaches really % by now. It's hard not to do your job now without interacting deeply, at least with one tool. You could see another interesting thing is just as many people commented in December was the phase transition when suddenly models gotten good enough that, that everything took off and started growing.Uh, it, it was many people noticed that the thing is that small improvements accumulated into this big change in Sep- December roughly timeframe.[00:03:52] swyx: Yeah.[00:03:52] Mikhail Parakhin: The other thing I would claim you could see is that, uh, CLI-based tools and tools that don't require you to look at the code becoming more popular, and you could see, yeah, various versions of, uh, Cloud Code and Codex and Pi and internal development tools taking off.Uh, exactly, yeah, uh, and blue is our River, just internal agent for coding, where tools, uh, that require IDEs such as, uh, GitHub, Copilot or Cursor, they're not exactly shrinking, but they're not growing as fast. Like, uh, red, red line is, is the IDE kind of tools. So you could see that they're, they're not experiencing as, as fast of a growth.[00:04:37] swyx: As I understand it, basically, every employee has their choice, right? Of choose whatever tool you use, and then you're just kind of doing a, a daily sur-survey or something.[00:04:47] Mikhail Parakhin: Exactly. And, uh, we- Yeah ... the, the push is to get your job done, you can use any tool, and we effectively fund unlimited tokens for everybody.Uh, we, we do, we do try to control the models that, uh, people use, but from the bottom, not from top. Like we basically say, “Hey, please don't use anything less than Opus four point six.”[00:05:09] swyx: Oh .[00:05:10] Mikhail Parakhin: Some people, some people end up using GPT five point four extra high. Some people use Opus four point six. Um, uh, you know, uh, there are some, uh, there are plus and minuses in going for full one million context window versus not.But, uh, we try to discourage people from using anything less than that.[00:05:28] swyx: Yeah, yeah. Got it, got it. Uh, I mean, uh, that's, you know... The, the next chart here, it really kind of shows the expansion and the sort of December twenty twenty-five inflection, right? That, uh, people are using a lot of tokens. I think it's also really interesting that no one was kind of abusing it in twenty twenty-five.Like it was- Had comparatively, uh, to this year, there was almost no growth. I mean, it's still like, you know, probably, probably gave fifty percent.[00:05:56] Mikhail Parakhin: Yeah. This is just a different scale. It's still exponential- Yeah, yeah ...growth at just a different- ...rate of expansion. Uh, there was inflection point, and Sean, I would claim the, the super interesting part here is that you could see that the distribution becoming more and more skewed.Yes. The top percentiles grow faster. So that means- Yeah ...the people in the top ten percentile, they, their consumption grows faster than seventy-five and so forth. So, uh, the distribution skews more and more towards the highest users, which is... I don't know what it tells me. It's like it feels not ideal, to be honest.Or maybe it's okay. We'll see.[00:06:36] swyx: Why does it feel not ideal? Is, is it because of, um, quantity over quality, or what's the concern?[00:06:42] Mikhail Parakhin: Because take it to the limit. That means, you know, if, if this rate of separation continued- Ah, yes ...a year, there will be one person consuming all the tokens. So it's just, it's kinda strange.[00:06:54] swyx: Yeah, I mean, um, uh, I, I think internal like teaching and all that, uh, will, will help sort of distribute things more widely. But in, in the early days, of course, the people who are sort of more AI-pilled will obviously find more ways to use it than the people who are less AI-pilled. Maybe let's, let's call it that.I'll just, I'll just kinda quickly, uh, pause from the, the... You know, we will go back to the rest of the slides, but I just wanna, um, review, you know, there are a lot of CTOs of, of large companies like yourself where they're all considering some kind of token budget, right? Like I think it's something, something that Jensen Huang has been talking about, where like if your 200K engineer is not using 100K of tokens every year, like they're, they're underutilizing coding agents.Of course, Jensen Huang would say that, but like it seems a very quantity over quality approach and like some, some people are basically saying like, well, is this comparable to judging engineer quality by lines of code, right? Which we also know is like kind of flawed, but better than nothing. So I, I don't know if you have like a sort of management take here on, on how to view this kind of, uh, metrics.[00:08:02] Mikhail Parakhin: Well, I mean, you're, you're baiting me. I, I like... This is my favorite topic. Uh, if you let me, I'll probably talk for two hours on just this. I have a lot of things to say. Like I do think Jensen gotten a lot of bad press saying, “Oh, of course you're, you know, this, uh, the- ...the cake seller says you don't need enough cakes.”You know? Like, of course. Uh, but, uh, I actually, uh, think that's undeserved. I think he, he's actually right. Uh, I do think- He,[00:08:33] swyx: he's directionally correct.[00:08:35] Mikhail Parakhin: Yeah. Yeah. He's directionally correct for sure. Uh-[00:08:37] swyx: Who knows what the right number is? Yeah.[00:08:39] Mikhail Parakhin: The thing that I do Uh, want to say, and this is something that we learned through trial and error and very important is like two things.One is that it's not about just consuming tokens. Uh, you can consume tokens and, and in fact, the anti-pattern is running multiple agents, too many agents in parallel that don't communicate with each other. That's almost useless, uh, compared to just fewer agents and burns tokens very efficiently. Uh, setting up the right critique loop, especially with the high quality models, where one agent does something, the other one, ideally with a different model, critiques it, uh, suggests ways to improve it, the agent redoes it with this critique and, and so it takes much longer.So people don't like it because latency goes up. You know, they, they have to wait until this debate is happening. But, uh, the quality of the code is much higher. And another thing, just since you mentioned like, look, uh, uh, yeah, the overall budget is just like, uh, lines of codes. Lines of codes are exploding for everybody right now, or partially because AI is really mover balls, but partially just because AI can write a lot more code, you know, doesn't get tired.And so you have to have to have a very strong narrow waist during PR review. Otherwise, just the number of bugs will go through the roof. It's, uh, it's this unexpected consequence of the just volume trumping everything. I would claim by now good model writes code on average with fewer bugs than, than the average human.But since they write so much more of it, like more of it will make it into production. So you have to- You still[00:10:26] swyx: have[00:10:26] Mikhail Parakhin: more bugs. Yeah. Have to have a very rigorous PR reviews, also automated of course. But, uh, yeah, that to spend a lot budget there. Like this, this for me, for me, actually, the important metric is the ratio of budget spent during code generation versus, uh, spent, uh, expensive tokens like GPT, uh, five point four Pro or, uh, uh, Deep Think from Gemini, you know, checking on PR reviews.[00:10:55] swyx: Yeah, totally. Uh, I noticed in your chart you didn't have any review tools. Do you just use like, like let's say a Claude code to review tools? Or do you have another set of review tools like the Greptiles, the Code Rabbits, uh, Devin Reviews has a review tool. I don't know if you've had those specialist review tools.[00:11:13] Mikhail Parakhin: You are a little bit jumping on my store tool right now because the graphs I was only showing public tools. Uh, uh, the-- I haven't found a good PR review tool that, that does what I think should be done. And, uh, partially my, my thinking is because it's so... It just goes against both what people feel like emotionally they prefer and, uh, some of the, uh, you know, frankly Even business models that, that the companies run.At peer review tool, uh, time, you want to run the largest models. That means, I don't know, Codex or, or, uh, Cloud Code is not gonna cut it. You need to have pro-level models if you really want to, uh, stand the tide of bots from going into production. And you need us to spend a lot of time, the models taking turns, but you don't want, like, a big swarm of, uh, of, uh, agents.So in fact, you end up in a different dual-dualistic world where you generate not that many tokens. You, in fact, generate few tokens, but it takes f-a long time because these are expensive models taking turns rather than many, many agents trying to do many things in parallel. So that's, that's why I feel like I haven't found good tools, so we are using our own for peer review for now.[00:12:33] swyx: Yeah. Yeah. I mean, uh, I think a lot of companies are building their own, uh, especially to their needs, right?[00:12:38] Mikhail Parakhin: Mm-hmm.[00:12:38] swyx: Um, I, uh, you also have a chart here going back to the slides on, uh, PR merge growth, where we're now at thirty percent, uh, month on month rather than ten percent. Uh, and also the, the estimated complexity is going up.You know, this is productivity, right? ‘Cause y- presumably there's more stuff going into the code base and more, more features getting worked on. I'm curious about the backlog, right? Like the, the, the-- I actually don't mind a pro-level model taking an hour or two hours to review my PR, because I've dealt with humans who take a week to review my PR, right?And I keep pinging them on Slack, “Hey, hey, review my PR.” So, you know, I think there's some trade-off here where, like, it still doesn't make sense.[00:13:18] Mikhail Parakhin: Exactly. That, that's exactly m-my point. Uh, that on one hand, you can tolerate longer latencies at, uh, PR. On the other hand, like right now, the real problem is not in spending time waiting for PR.It's real problem is since there's so much more code than- Yeah ... uh, probability of at least some tests failing going up, and then you, like, keep de-failing, then you have to find the offending PR, evict it, retest it without that PR, and so deployment cycle becomes much longer. Uh, so it actually, in terms of the overall time to deploy, it's total time savings if you spend more time on a longer model, like thinking for an hour, because then, then you, you don't have to spend all that time during testing and rolling, you know, rolling back the deployment.[00:14:03] swyx: Yeah, totally. That's still worth it. You know, you don't look at the individual, look at the aggregate, and look at the, the, the change in the aggregate system.[00:14:11] Mikhail Parakhin: Exactly.[00:14:11] swyx: I'm kind of curious if, like, there's this PR mentality and, like, c-- the, the, the CICD paradigm will be changed eventually. Some people are like, obviously a lot of people want new GitHub, but I even wonder if, like, Git is the problem, right?Like, is that the bottleneck? Is the concept of a PR a bottleneck? Do you guys use stack diffs? I don't know if, uh, that's a, like, a merge queue stack diff type of thing.[00:14:34] Mikhail Parakhin: We, we use, we use Stacks, we u- we use Graphite. We worked with, uh, Graphite a lot. Uh, so we use Stack, uh, PRs. I think, uh, like that's clearly the overall CICD in general, and the interaction with the code repository right now is the, clearly the sort of the, the main issue and the bottleneck for us, uh, and highest top of mind.I would say we probably need a different metaphor or different whole design of how to process it in new agentic world. I haven't seen anything dramatically better yet. I, I think everybody right now is just trying to keep their head above the water ‘cause, ‘cause there, there's so many PRs and then everybody's CICD pipelines start creaking, the, the times are increasing, the number of bugs slipping by increasing, and you have to, have to clap on down.And so we are a little bit in this situation when we need to first stabilize that story and then start thinking, hey, what, what it could be a completely different and new world, which I haven't... I know some people working on it. I haven't seen something, like anything super compelling yet, but clearly the old thing were designed for humans will need to be morphed into something new.[00:15:53] swyx: One of the thing that I, I think about is kind of like the merge conflict is basically a global mutex on the whole system, right? And in, in hu- in human organizations, we do have something like that. It's the company standup. But like, other than that, it's like it's actually fitting for us to be somewhat decentralized, somewhat plugged into one stream of information source, but somewhat lossy.Like it's okay, you know, that, that not every delivery is like atomic consistency. Like we're not dealing with a database sometimes.[00:16:27] Mikhail Parakhin: This is a very good point, uh, because since humans don't write code too fast, you know that global mutex is not too bad. Once you-[00:16:36] swyx: Yes ...[00:16:37] Mikhail Parakhin: start writing code at the speed of machine, it becomes the, you know, the bottleneck.Then what do you do? Maybe, and I can't believe I'm saying this because I, I'm long-- lifelong opponent of, uh, microservices, and I always thought that was, like, a really bad idea. And now that you're saying it, like, maybe in new guys like microservices will make a comeback, you know, because then you, you can ship things independently in tiny things and, and the managing all that complexity automatically will be much easier.I don't know. Like, we'll s-- we'll have to see.[00:17:10] swyx: Yeah. I mean, I don't know what the Microsoft or, or Shopify thing is, but I, I read this paper from Google where they have a monorepo that deploys into microservices, right? And then, uh, the other concept that I think about a lot is the Chaos Monkey concept from, from Netflix.Being able to create, like, this robust system where, um, uh, you know, you, you have the service discovery, you have the, uh, the independent, independent microservices discovery and, and, uh, you know, probably going to be a fair amount of duplication. That's how an organic system sort of scales, uh, that, that you have that...I don't know how you call it. Slack? Robustness? Depend-- uh, d-duplication. I, I, I forget the-- I, I'm-- And this-- those-- these are not exactly the terms- Hmm ... I'm looking for, but I c-can't really think of the words. Okay. I was gonna go into Tangent and Tangle. Uh, so, uh, we, we sort of discussed the overall stats that, uh, Shopify has.Uh, but, you know, I, I think some, some pretty cool stuff that you guys are working on is your ML experimentation, uh, and your, your sort of auto tr-research training pipeline. Presumably you're much closer to this one because it's, it's a sort of personal hobby of yours. How, how would you explain them in, together?I thought we have a slide that, like, uh, has the s- the system diagram.[00:18:24] Mikhail Parakhin: Yeah. Tangle first and then Tangent as a-[00:18:27] swyx: Yeah ...[00:18:28] Mikhail Parakhin: as a thing on top of Tangle. And, uh, Tangle is the third generation, I claim, of, uh, systems of, uh, running any data processing, but a bit with a skew for ML experiments, but not necessarily. Any sort of data processing tasks where you need to iterate, share, and you have scale so that you want maximum efficiency.You know how, like, normally you would work, you would-- Imagine you're a data scientist or an ML practitioner, you would get Jupiter notebooks or, or maybe you would get, uh, you know, Pyth- your Python scripts, and you would manage the data, and you produce those TSV files, and you put them in some JFS or something.Then you would notice that, oh, it has this, uh, weird missing values. You go and write another script that, uh, goes and replaces them with, uh-[00:19:20] swyx: Ah ...[00:19:21] Mikhail Parakhin: dash S. And then, then you, then you run some, some, uh, “Oh, I need to filter bots.” And so you run some light GBM model that, uh, removes the bots. And then, then you like-- And then you, you kind of like get into shape, and then you start experimenting, and you run multiple experiments, and then you're like, “Oh my God,” like, “this experiment is worse.”You undo, and you cannot get to previous result. And like, “Ah, what did I do?” Like that. Again, then, then you finally like get everything working. Then you like start throwing it over the fence to production. You, you replicate it, those things don't work, and then sometimes you like don't notice that you forgot some feature naming and the, the features don't match.But then, like imagine you, you did everything, and then six months later you're like, have to repeat it because now there's more data, or you wanted to do another pass, and you're like, “What, what did I do?” Or like, or like, “This script crashes now,” or the, “the path has changed.” And then, then you're trying to, like you spend another month just doing ar- digital archeology on your own, you know, history, right?Now multiply that by many, many teams. Now imagine you got an intern that you wanna ramp up. Now you have to show that intern, “Oh, you know, look, here's the folder, there's the scripts, you know, ask your cloud agent to do, and then, uh, to, to figure it out.” And then cloud agent does something, and then you're, “Ah, yeah, right, right, it was the wrong folder.I forgot to tell you, I actually have this other thing I forgot myself.” And, and that's, that's the, like, the daily life we all, uh, all know it, uh, if, if you're a data scientist, machine practitioner, ma- machine learning practitioner or, uh, or even like any data managing, uh, person.[00:21:00] swyx: Yeah. So I, I used to do this, uh, f- uh, on the quant finance side, uh, in, in my hedge fund.So we did this before Airflow, and then, uh, obviously Airflow came along and, uh, then more recently Dagster, uh, I would say is like, in my mind, what I would use for that shape of problem, uh, where you had to materialize assets and create a pipeline.[00:21:19] Mikhail Parakhin: And that's, that's very good segue because... So Airflow is great, but Airflow is more about you, you have something and you wanna repeatedly run it in production on schedule.It's less about you as a team developing things and being able to share, and you grabbing the standard pipeline and saying, “Hey, I wanna change this tiny little component in the huge sea of data processing, and I don't wanna-- I wanna run ten experiments on this, and I wanna do hyperparameter optimization.”All that is very hard to do with Airflow. It's very easy to do with Tango. Tango is m- more about, it's everything about group of people Running experiments, it might be agents too nowadays. Uh, running experiments cheaply, collaborating, sharing results. Uh, you don't need to understand fully. You, you grab-- you clone somebody else's experiment or somebody else's pipeline, uh, run, uh, change small piece, run it, be, like, get it to production state, and then ship in one click.So then the... You don't have to port it into any other system to, to run in production. You can just run the same experiment. It's, it's fully production ready. And, and it's, uh, it has lots of... Again, as I said, it's third generation system. The original one was, I would claim there was Ether and then, uh, at least in my career, Ether was the first, first, uh, that pioneered this type of approach.And then there was, uh, Nirvana, which, uh, uh, at Yandex, which did kind of sec-second take on this. And now this one aggregates the, the learnings from all of those and, and Airflow as well to, to get to the state where you try it, it, it feels kind of magical. Uh, ‘cause now everything is based on content, uh, hashes.So even if the version changed, but if the output didn't change, nothing is being rerun. It's very efficient. If you... Multiple people start experiment that needs the same sort of data preprocessing, it's not repeated multiple times. It's automatically done only once. If you start ten experiments that all require, you know, some, some data preparation first as the first step, and you don't have to coordinate for that.Like, you don't have to know that other people are starting it. You now, it's very easy compos-, uh, composability, any language you can u- uh, you wanna use, and it's very visual. So you can see immediately, you can edit it easily, you can assemble small things with just even mouse clicks if you want to, and, uh, share, clone.And everybody knows also it's fully kind of static in the sense that we rerun it second time, it will exactly have the same results. Like, you will never have to do digital archeology. So full versioning and everything is also there.[00:24:06] swyx: Uh, so, so people can, uh... It's open source. Go to the GitHub repo and, and, uh, check it out.Uh, and it is also a really good, uh, blog post about it. I think all these is, like, really appealing. The, the, the, the thing that I think sells me the most about it is that, um, sort of development to production transition, right? Which I think, um, a lot of people haven't really solved that, uh, strictly, right?Like, we develop really, really well in, in Python notebooks, but then, you know, that's obviously not a sort of production ready process. I think that, like, any way in which that is solved, I think is, is very appealing. Then the other thing that you mentioned, which also raised my eyebrows, was content-based caching, which you mentioned is, is, um, you know, is ve-very much, uh, um, a sort of efficiency measure about, uh, you know, just like recalculation only on, on sort of content addressing Which I think makes sense.Uh, it surprised me that the savings could be this much, but maybe I just haven't worked at your scale where there's so much duplication, uh, that people just rerun because they change a single ID upstream.[00:25:10] Mikhail Parakhin: It does, yeah. But it's not only you rerun. The, the main savings are coming from the fact that you ran it, you got your job done, and you moved on.Then- Yeah ... somebody else in some department you don't know existed runs the same task, but on a newer version.[00:25:27] swyx: Yeah.[00:25:27] Mikhail Parakhin: Like right now, you can't, in, in most of the organizations, you can't even find out about it so that you can't even measure that you're spending that time twice, right? Here- Yeah ... if everybody's on Tango, that's detected automatically and detected that the output is the same.And then for that person, all it looks like is like experiment just suddenly moved, jumped forward, right? Uh, uh- Yeah ... so that's because, because the, there's network effect of multiple people helping each other.[00:25:51] swyx: Yeah. This is one of those things where it's designed to be a platform from the beginning rather than an individual developer's tool from the beginning, right?And, and everything's gonna streams down from there. That is the sort of Tango, uh, orchestrator, and it's, it manages jobs. We've seen a few versions of this, and this is obviously, uh, uh, the sort of, uh, unique approaches that you guys have, have, uh, figured out. And then there's Tangent.[00:26:14] Mikhail Parakhin: Yeah. And Tangent is basically an automatic auto research loop that can help and kind of do your work for you.Uh- ... you know, uh, effectively, effectively, Andrej Karpathy recently popularized it with auto research. Yes. Remember he said like he was, uh, speed running this, uh... Yeah, uh, you know the story. The, here we're basically bringing the same capability into Tango so that, uh, the, uh, Tangent can analyze it. It's just an agent that can run multiple experiments, figure out what can be changed, and keep on rerunning it, keep on modifying until, uh, maximizing some goal, some loss function, whatever you need to, to achieve.And in general, I would say if you're not using auto research-like approach in whatever you do, like literally whatever you do, then you're missing out. We saw at Shopify that taking like a wildfire, anything where you can put measurements can be done dramatically better. Our-[00:27:19] swyx: Mm-hmm ...[00:27:20] Mikhail Parakhin: uh, speed of, uh, templatization HTML, uh, completely new UX tem- uh, templatization of, uh, reducing latency for liquid themes.Uh, we-- Our, uh, search, uh, recently we moved from It's hard even, uh, quote from eight hundred QPS to forty-two hundred QPS with the same quality just by pure optimizations and not a research loop that kept running and changing code in our index serve on the same number of machines, just increasing the throughput.We, we managed to improve the quality of gisting and machine learning process. Uh, you know, gisting is the prompt compression technique that[00:27:59] swyx: allows for[00:28:00] Mikhail Parakhin: lower latency and, and lower and, uh, actually higher quality slightly. So like literally whatever different walks of life, and it doesn't have to be AI related.Uh, we, we had a reduction in, uh, storage because the agents would go and find data sets that clearly are derivative, uh, and then you don't need to store things twice. You know, we, we, we found somewhat embarrassingly that it was one of the largest tables was hashing random IDs into another random ID, and we literally- Oofput only one. So it was translating, yeah, two random IDs hashed[00:28:36] swyx: into[00:28:37] Mikhail Parakhin: each. So, so[00:28:37] swyx: it has access to the code as well, so it can, it can check the, like what, what the hell is it doing?[00:28:42] Mikhail Parakhin: So there, there cou- it could be run in two levels. You, uh, you know, at the superficial level, it could just use ex-existing components and, uh, reshuffle them.Uh, you know, like you can grab- Yeah ... uh, XGBoost, and you can grab some, some Py- PyTorch module, and then can grab some, you know, grab another tools and, and combine them. At a deeper level, since Tangle is all sort of CLI based underneath you, every, every component is a wrapped really CLI, uh, call and a YAML file, it can analyze code and create new components and, and, uh, keep on iterating as well.So, so you can, you can both have quick modifications of existing t- uh, pipelines with the, with components that are already there pre-baked, or you can create new components, uh, and-[00:29:29] swyx: Yeah ...[00:29:29] Mikhail Parakhin: keep iterating on those. So auto research is, again, this is probably the, the thing I was excited the most in the last two months happening, and we see it taking like, like totally like a wildfire.Just, uh, everybody, every day, every... well, every day, every minute, I would, uh, have somebody Slack message saying, “Oh, look how much better I made it.” And, uh, it's all throughout the research.[00:29:53] swyx: Is this democratized in some way in, in the sense that like is it your ML, uh, engineers and researchers doing this, or is it your regular PMs and software engineers also have the ability to auto-- to use Tangent?[00:30:07] Mikhail Parakhin: This is an awesome question. Like, Tango in general and Tangent in particular are extremely democratizing. Like they- Yeah ... they are the main tools for- ‘Cause I don't[00:30:15] swyx: need the details.[00:30:16] Mikhail Parakhin: Yeah. Exactly. Initially used by ML and AI engineers, but then literally, as you said, PMs are like the highest user right now is one of PMs on our org, uh, Sartak and he was, he was number one by, by usage of, of this ‘cause they're just, uh, energetic and knowledgeable, and now it, it unlocks a lot of capability where you don't have to co-change code manually.[00:30:39] swyx: I mean, I mean, because it kind of cuts out the ML, ML engineer from the process because the, the, the PMs have the domain knowledge and the ability to think about, uh, from first principles about, okay, what, what results do I want? And they can-- they even have the access to the data that, that needs to go in.So it's like in some ways, like this is the magic black box that we've always wanted for, for training and, and for, uh, I guess, uh, uh, hill climbing, whatever.[00:31:04] Mikhail Parakhin: It's basically cloud code for your AI development- ... uh, situation, right? Like now, now you don't have to know exactly how algorithms work. You can just, uh, bring your domain knowledge and expertise and product knowledge and iterate within Tangent until you've gotten the results that you need.[00:31:21] swyx: In my previous roles, every time that someone has pitched AutoML, you know, I've always been like, “Uh, this is not, this is not gonna work. It's, you know, it's, it's always gonna be a flop.” Somehow it's working now. I mean, presumably the answer is now we have LLMs and it's good enough, right? It's, it's an emergent property that we can do auto research, but like, it doesn't feel that satisfying that how come we didn't do this before, right?Like we just did like parameter search and like, I don't know. That's maybe that's it.[00:31:48] Mikhail Parakhin: Yeah. Bayesian optimization and hyperparameter optimization was, was the one that, or facet of AutoML that was used very actively, which incidentally also built into, uh, Tango. But, you know, I know Patrice Simard very well, and, uh, he was such a, uh, such a proponent of AutoML, and he put, like literally spent careers trying to democratize it.Without LLMs, it just turned out to be very hard. Like it, you, you would have flexibility within certain narrow domain, but it was hard to wider scale, and now with LLMs suddenly it's like magic wand, and so suddenly everybody- ... is an AutoML expert.[00:32:28] swyx: Yeah, I, I think it's multiple things, right? Like I'm, I'm just gonna bring up the, the, the chart again, right?Like LLMs can do the monitoring very well. That is the very potentially unbounded, super unstructured. It can do the analysis very well, it can do the... Uh, and basically it is much more intelligence poured into every single step. Uh, there's maybe nothing structurally changed about AutoML, but this is just m-more intelligent and more unstructured.[00:32:53] Mikhail Parakhin: Exactly.[00:32:54] swyx: Any flaws that you've run into? Like everyone is like drinking the Kool-Aid, oh my God, time savings, uh, you know, performance improvements. Like what, what, uh, issues have you have, uh, come up?[00:33:06] Mikhail Parakhin: This is really cool. It's not a solution to all the world's problems for sure. The limitations are usually the ones I-- And this is where we get into a bit of a subjective territory.Uh, I can only share what I've, I've seen so far, and I'm sure the situation, uh, is changing, and, you know, maybe after I say it, like many people will reach out and say, “Hey, what about this?” And you don't know that, and then, then we'll be probably right. But what I've seen is auto research is very good at doing kind of obvious things that you don't have bandwidth to do or you didn't notice or maybe you're not aware of like the-- some standard practices.It is not good at doing something completely out of distribution, something that, you know, you have to think for, for multiple days, uh, and, and do something like none of this. So, so it's, uh, I, uh, set an experiment once, uh, on, on my sort of, uh, hobby thing, and I let it run for, uh, ended up, uh, several weeks run, uh, you know, it's like full production kind of scale, so it, you know, slow runs and, and it ex-- it performed in the end, uh, over four hundred experiments, and only one was successful.I'm like, “Okay, that's, that's good.” But-[00:34:18] swyx: But it saved time.[00:34:19] Mikhail Parakhin: Yeah, I saved time. Like it, it was the, that thing. Yeah, if I, if I were doing four hundred experiments myself, my betting average, as I said, would have been much higher, I'm sure. But also, first of all, it would take me like three years to do four hundred experiments.And, uh, I didn't have to do them. Like the machines were just, uh, the price of electricity did that. So, and I got one improvement, uh, that in, uh, my, my-- Honestly, when I was starting that experiment, my thinking was to go and show that, “Hey, Andre, maybe you just don't know how to optimize.” And I was super smart because in, in my pro-problem, it was optimized for many years, and it was like fully improved.Uh, and I didn't expect it, you know, auto research to find anything at all. Yet it did. So instead of making fun of Andre, I ended up, uh, a big, big supporter. Yeah, that's exactly the tweet. Yes.[00:35:10] swyx: You and Toby really, really go back and forth on-online a lot, which is really funny. Uh, think of it as, as an eval for the optimalness of the code it's running on.Uh, it's almost like it reminds me of like a Kolmogorov complexity thing, but, uh, I guess it's-- there's some optimal thing that you're trying to sort of reduce down to, I guess. Um, and so, so you, you, you know, you should congratulate yourself that you had, uh, you know, uh, ninety-nine percent, uh, optimality.[00:35:36] Mikhail Parakhin: Exactly, yeah. I think Andre really deserves a lot of credit for popularizing this approach. This is, uh, this is incredibly, I think, powerful and cool and You know, the, uh, even him, him just mentioning it led to a lot of gains in a lot of places in the industry, so we should be thankful.[00:35:56] swyx: Yeah. I think he also has a just...I don't know what it is. Like, um, you know, it, it is a simple self-contained project that people can take and apply to other things, which is, is, is one thing, but also just the name. Just like somehow no one, no one managed to call their thing auto research. It's just naming things is very important. I think that that is mostly, uh, our coverage of Tango and, and, uh, Tangents.I think obviously, you know, there's a lot of, uh, ML infra at, at Shopify that people can, uh, dive into. We're about to go into SimGym, but before I do that, any, any other sort of broader comments around this whole effort? Like where is it, where is it leading to?[00:36:36] Mikhail Parakhin: As a segue to SimGym, like all those things start composing strongly.And, uh, you could see a huge unlock when you can look at each one of the tools and, and you see, oh, they're extremely useful. Uh, Tango is useful by itself. Auto Research is useful by itself. SimGym is useful by itself. If you combine all three, you create like synergetic effect. I think that's why we wanted to even, uh, cover them today is because this is something that if you go back even, you know, five years ago, would've been unthinkable.Uh, replicating that, uh, would, would be either incredibly costly or impossible, right? With probably thousands of people are required.[00:37:20] swyx: Well, we have serverless human, uh, serverless intelligence, right? Like, uh, so yes, you do have thousands of hu-- of, of intelligences, not just, not humans. And that's, that's close enough, right?Even if they're not AGI, they're, they're close enough to do the, the task that you need them to do. And, and, you know, that's, there's plenty for, for a lot of routine work, knowledge work. Okay, let's get into SimGym. Um, this is one of those things I, I was surprised to see actually it's apparently your, uh, one of your most popular launches, and I think something that, uh, I think Sim AI, I think Yunjun Park, who did the Smallville thing, there's a very small cottage industry of people trying to do like the simulate customer thing.I think a lot of people maybe don't super trust this yet because they're like, well, obviously they would just do what you prompt them to do, right? But maybe just think, uh, tell us about the sort of inspiration or origin story.[00:38:10] Mikhail Parakhin: That's exactly actually the thing I wanted to cover, because if you don't have the historical data, all you can do is prompt a-agents in a vacuum, and they will do exactly what you prompt them to do.In fact, when I first proposed it, and this is a bit of, um, my brainchild initially, if I, I can boast, even Toby said like, “But wouldn't they, they just repeat what, what you tell them?” And, uh, but I'm like, “Yes, except Shopify has decades of history of how people made changes and what there is, uh, there, what it resulted in terms of sales.”So now what we can do is we can-- we have this... It's not, it's a noisy data. There's a small, usually websites, uh, you know, like things, things are never in isolation. It's almost never AB experiment. It's always AA experiment when there's has two meanings, but basically, you know, in different time you run two different things.But if you aggregate in general, uh, like everything together, and you apply, uh, denoising and collaborative filtering like approach, you can extract a very clear signal. And then you can optimize your agents. And that's why it took so long. It took almost a year of that optimization of just us sitting and fiddling, and, and we had this internal goals of correlation of hitting-- internal goal was to hit zero point seven correlation with, uh, add to cart events, for example.Like that, that if we run real AB test experiment, that it should, it should go and, and rep-uh, replicate, uh, same sort of success that, that humans had or lack thereof. And it, it took forever, and I don't think that's easily replicatable because, uh, like who else would have that data? You have to have this historic, you know, decades, uh, worth of data.And now, now the, like the other thing you need is in-infrastructure and the scale, right? Because, uh, w- again, what we found, uh, stat sig results, you need to run a lot of simulations, a lot of agents, and, and it's-- Those are expensive things. Like you're, you're making actions in the browser because you want a real friction.You want to, to be able to get the image like of what humans will see because you wanna, uh, detect effects like, “Hey, if I make my images larger, will I have more sales or l- uh, fewer sales?” And like usually people's intuition here, by the way, is that I increase my images, I will have more because they look nicer.You know, designers all look sparse and big images. Like usually your sales tank, right? But, but, uh, you know, from HTML, all the characters look the same only the, the size tag looks different, right? So it's very hard. So you have to take visual information, you have to run this in simulated browser environment on the big farm and, and of course, you have to have, uh, like very, very expensive model, good model with multi-model model.So all this it's-- is what's taken so long and, uh, to share my personal fail a little bit there, Sean, is like, you know, we always had this bias to-- for like large company bias. You know, we always, uh, whenever you-- we do, we're like, “Hey, we'll run an experiment,” right? We make, make a change, and we will run an experiment and then, uh, see, uh, see which one's better or like, “No, this is worse,” and most of them are worse, so you discard it and keep iterating, hill climbing.And we're like, “Oh, like smaller merchants, they cannot get stat sig results. They cannot really run experiments simply because, you know, in a week there would be not enough data for them.” So we thought from this perspective. What we didn't realize is that most people don't have A and B, they just have one thing, and they need suggestions of What A and B should be.So, uh, we first build this, hey, we run simulation on two separate teams and, and, uh, say, “Hey, which one is better?” We then morphed it into, and very recently just released it, when you have just your site, your theme, we run over it and we say, “Hey, here's what predicted values of, of, uh, uh, conversions are, and here's how we think you should modify it to increase your conversions.”And then circling back to what you started with, the proof is in the pudding. Like, if we are not correlating with reality, like, people will not be using it. And, uh, thankfully, we see literally every day more users than the previous day. So, so right now, uh, right now- It's working. Yeah. I'm-- Right now my problem is how to pay for it all because the so our major thing is how to optimize the LLMs, do distillation, how to run the headless browsers, uh, and handful browsers, uh, uh, cheaper so that we can accommodate the increase in traffic.[00:42:47] swyx: Yeah. I, I understand that you, uh, you published a lot of technical detail at GTC, so I was just gonna bring it up a little bit. I think s- was this in, in con-conjunction with some kind of GTC presentation? Or something like that, right?[00:42:59] Mikhail Parakhin: Well, we, yeah, we, we did it in several place, but yeah, we had the engineering- Yeahblog, uh, as well. Yeah.[00:43:05] swyx: Yeah. So you're running, uh, GPT OSS. Uh,[00:43:08] Mikhail Parakhin: the, this is an older version. You know, now we run multimodal model. But yeah- Yeah ... GPT OSS, we still run GPT OSS as well for[00:43:15] swyx: And then you have the VMs, and you also have browser-based. I really like this one where it you said, “It violates almost every assumption that standard LLM serving is designed for.”And then you had like, basically orders of magnitude differences between everything.[00:43:29] Mikhail Parakhin: Exactly. Which is, which, uh, which was, you know, a bit of a challenge to implement, like when, like even simple things. Uh, be- since it violates all the assumptions, for example, multi-instance GPUs, like MIGs don't work as well.But we needed, uh, to get MIG to work because, ‘cause otherwise it's way too expensive. And so we had to deal with the, yeah, with, uh, lots of infrastructure and, and, uh, work with, uh, uh, Fireworks and CentML, uh, you know, to help with optimizations and browser-based, as you mentioned. Yeah, like, takes a village.[00:44:04] swyx: Okay. So there's a lot of like, I guess, experimentation in the infrastructure so far, and you've published more or less what you have here. I guess I'm, I'm less familiar with CentML. I, I don't do, uh, that much work in this, this part of the stack. But why was it the sort of preferred instance platform?[00:44:22] Mikhail Parakhin: There are really three probably top companies. There used to be, uh, uh- Three top companies, uh, at least I was aware of that did, uh, LM optimization. You know, together Fireworks and Santa ML, not necessarily in that order. Santa ML recently got acquired by NVIDIA. Uh, what they did is if you have a model and you want to optimize it to a specific prof-- uh, profile of usage, uh, they would go and do it.And, uh, we work with, with those companies, uh, this was work particularly in with Santa ML and NVIDIA to get them the best possible results out of it. And, and sometimes you, you have to retune depending on, like sometimes you want the maximum throughput, sometimes you want minimal latency, sometimes you want like the cheapest, right?And, yeah, or some combination. And so yeah, these are people who would come and help you.[00:45:14] swyx: I see. I see. Yeah, yeah. I'm familiar with these people for the LLM, you know, autoregressive stack. But the other interesting category of these optimizers is also the diffusion people, whereas like Fel and, you know, uh, Pruna recently has come up a lot as well, which I think is like really underappreciated, uh, at least by myself, because I, I thought, oh, all the workload would be LLMs, but actually there's a lot of diffusion as well.[00:45:38] Mikhail Parakhin: Exactly.[00:45:38] swyx: There's a lot here, so I, I, I... it's, it's, uh, it's, it's, it's hard to cover. But I, I do think like people underappreciate the importance of customer simulation, basically. I think this is something that I'm candidly still getting to terms with. Uh, you know, uh, you also-- your team also like prepared this, like, really nice diagram.Uh, I, I assume this is AI generated.[00:46:00] Mikhail Parakhin: Yeah, it looks-[00:46:01] swyx: Maybe it's not.[00:46:01] Mikhail Parakhin: Yeah, it looks, uh, Gemini-ish. Yeah, but, uh, uh, honestly, I, I don't know where, where the hell they generated. It looks, look, uh, looks like it's, uh, Google. But the interesting part, John, that, that, uh, we haven't covered, but I, I wanted to mention is if your store had previous customers, rather than it's a new store, you're like new merchant just launching things, it helps tremendously in just correlation and forecast.Yeah, we take your previous, uh, customer's behavior, and we create agents that replicate those specific distribution of, of customers that you get, and then we a- we apply those to your changes, and then that, that raised raw, you know, the re-- uh, just correlation with the add to cart events or to-- with conversion or whatever it, it, it may be, uh, quite dramatically.So, uh, replicating humans in general seems like an interesting, cool challenge.[00:46:58] swyx: As a shareholder, I think this is the-- like if people are Shopify shareholders, they should really deeply understand this because this is basically the moat. The, the more you use Shopify, the more it will just automatically improve, right?Like you're, you're doing the job for them.[00:47:13] Mikhail Parakhin: Yeah, that's what we started with. Like, uh- ... uh, otherwise, if you're just a startup, I wouldn't do it if, uh, you know, if it was my startup because Without the data, it, yeah, as, as you said, it's, it's exactly the case that, uh, whatever you say in prompt, that's, that's what the agents will be doing.[00:47:30] swyx: The statistician in me wants to like really satisfy the sort of, um, statistical intuition, I guess. Um, to me it's kind of, uh, the, the word that comes to mind is, um, ergodicity. Uh, so let's say a, a customer takes this path, customer takes this path, customer takes this path, right? Um, the... In my mind, the way I explain it is like, okay, here, here's the ninety-five percentile, here's the five percentile, and here's the median, right?Um, but to me, what SimGym is potentially doing is that it can, uh, modify... It can sort of model the sort of in-between sort of journeys as well, that, that maybe are dependent on the previous states. This may be like a very RL-type conclusion where like basically the summary statistics, if you only did naive AB testing, you only have the, the statistics at, at, at a certain point, and you only judge based on the sort of overall summary statistics.But here you can actually model trajectories. Does that make sense? Or-[00:48:31] Mikhail Parakhin: That makes total sense because like, well, that, that makes even more sense that maybe even you realize bec- because-[00:48:38] swyx: Okay. Please,[00:48:38] Mikhail Parakhin: please. Yes ... we do-- Yeah. The, so internally, uh, we have this system, we talked about it briefly once at NeurIPS.We have a huge HSTU-based system that models the whole companies, uh, and their possible paths. And like- Yeah ... what you are, what you are showing, like actually at any point of time, you can either model the user's behavior or you mo- can also think about, uh, the whole merchant as a company, as the entity that acts in the world.You can model that as well. And then you can do, can do counterfactuals. In your graph, like in your blue graph, uh, if you're... Imagine in the center there, uh, somewhere in the middle, you would have an intervention. I give that person a coupon, or I don't know, I send a personal thank you card, or give a discount in some- somewhere.And then you can, uh, then you can do forward rollouts from that counterfactual. So what would have happened with that intervention or without the intervention? And you can even ch- change where that intervention, uh, in time can happen, right? Like some- where, where in this journey. So we, we do this at the Shopify scale for our merchants, and then if we notice that something that they can be fixing, like there's a strong counterfactual, like we have Shopify policy, they basically get a notification like, “Hey, we think your...something is wrong with your-” I don't know, Canadian sales. Like, uh, it looks like it's misconfigured. Here's what you need to do. Or do you think like, uh, you have to set up this campaign with these parameters? And we do that at the buyer level to literally offer discounts or cashback or, or things to buyers.So this is-- I'm getting very excited. Like this is my sort of area of, uh, interest, I guess, and, and hobby. But being able to m-model something complex as human beings or companies and model counterfactuals on it, where you can have interventions in the future and optimize when to make intervention, what kind inter-- uh, what kind of intervention to make.It's such an unlock that previously was completely impossible. Like the-- it was, it was always dreamed of, but never... Like how would you even simulate it without LLMs or HTUs? I think very, very exciting times.[00:50:59] swyx: I just wanted to, uh, to maybe illustrate this. I, I'm not the best illustrator, but I, I am a conceptual statistics guy.And y-you know, you cannot just do this. Like this is a dimensionality AB test doesn't do, right? Like, uh, because it doesn't have the, the, the change over time, uh, stochastic nature, uh, and it doesn't have the sort of contextual like... Here's all the context to this point. Um, okay, cool. Um, that's SimGym.You're, you're gonna burn a lot of tokens on this thing. But you're, you're one of the, the only scale platforms in the world that can, uh, that can do this across a huge variety of workloads, right? I'm even curious on a sort of human, uh, research level of like, well, do, does retail behave d-differently from like clothing sales?D-does that behave differently from electronic sales? I, I don't know. I don't know what else you guys... The Kardashian shoppers, do they differ from like people who buy, uh, I don't know, cars and, uh, whatever.[00:51:55] Mikhail Parakhin: Well, very different, and different sensitivities and different modes of, uh, shopping and, and different levels of what's important.Now, to-totally, you can do aggregations at, uh, at a store level. You can do aggregations at a different, uh, category level. I don't know if, uh, you know, for our statisticians among us, I couldn't believe, but we-- recently we're looking at it, and we had to bring back, uh, CRPs, you know, Chinese restaurant process.It's a, like, way of aggregating and, like, naturally grow clustering. So across... Specifically to answer questions that, uh, like you were just posing on how, how if, if buyers behave different categories. And I'm like, “I haven't seen CRP since two thousand and one.” It's[00:52:37] swyx: so What? It's so- What is... No, I haven't, I haven't seen this.No. This is not in my training. Uh,[00:52:44] Mikhail Parakhin: but, but yeah, it, uh, uh, it actually, like the, the-- there was a very popular kind of theory, popular neurips HTML circles in early two thousands, uh, kind of nice. And now, now it has practical applications, uh- Yeah ... that we were resurrecting.[00:53:03] swyx: Yeah, amazing. Uh, I, I can see, I can see how this is like a, uh, a fun job for you where you get to apply all these things.Um, yeah, yeah, so super cool. Super cool. So, okay, so, so anyone who, who knows what CRPs are and has always wanted to use them at work, uh, they should, they should definitely join Shopify. Okay, so w-we have a lot and but I, I'm, I'm being mindful of the time. I, I do wanted to, to sort of cover some other things.Um, I-I'll give you a choice, UCP or Liquid?[00:53:30] Mikhail Parakhin: Liquid. I think, I think on UCP, you know, like UCP is very important for us and, and it just we are-- UCP, we have a structured, uh, discussions, and you can read about them, and we have, uh, blog posts, and we have a big release this week, in fact, like with our catalog.Oh,[00:53:46] swyx: okay.[00:53:46] Mikhail Parakhin: Uh, yeah,[00:53:46] swyx: but- Le-I mean, we, we can, we can discuss the, the, the release briefly because we'll release this after the-- after it's already announced so whatever. There's a catalog that you guys are doing?[00:53:55] Mikhail Parakhin: Yeah. So we are, we are- Okay ... we are bringing in capabilities of a whole, uh, Shopify catalog.Basically, you now you can search for products, you can do lookups by specific ID, you can do bulk lookups when you need to bring m-multiple products. You don't need to know in ad-in advance what you're trying to show or to sell or check out. Like, you can now, you can now have this decided at, at runtime, and this big area for investment for us for both non-personalized and personalized searches, trying to provide basically a win-window into whole universe of products that are being sold everywhere in the world.And Shopify is really not exactly, but almost like a super set of any-anything being sold. Now we are bringing it into UCP and, uh, and, uh, identity linking is another big thing for us, uh, so that you, you can use, uh, like Google or whatever, whatever identity you have, uh, they're minimizing friction.[00:54:56] swyx: Yeah. So[00:54:57] Mikhail Parakhin: yeah, big release for us.But Liquid AI of course we never talk about, and the problem might be more, more aligned with what we d-discussed previously on this chat.[00:55:07] swyx: Sure. The main thing that everyone understands about Liquid is that it is inspired by Worm, and I still don't know why. I'm curious on your explanation. I think you, you, uh, you can make things very approachable.And also I think like what is the potential of like the, the level of efficiency that you get out of Liquid?[00:55:23] Mikhail Parakhin: You- we all familiar with transformer architectures. And, uh, for the longest time, there was a competing architecture, it's called the state space models. So, so Sams, uh, you know, Chris, Chris Reyes, one of the pioneers and, and lots of startups, uh, trying to make those realities.They have, uh, significant benefits being main being, uh, being much faster and, uh, lower footprint and not quadratic in length, you know, sort of, uh, linear in, in, uh, in your context length. But with state space models- They never quite made it. Like they're used-- They have, uh, certain niches when they thrive, their hybrid architectures are useful, but they never quite made it.And liquid neural networks are, you can think of them as a next step, like, uh, sort of, uh, state-space model square. It's non-transformer architecture that's more complicated than sta-state space and really difficult to code if you-- if I'm being honest. But it's, um, very efficient. It's, uh, subline-- sub, uh, quadratic in, in length of your context.Uh, it's very compact way to represent things, and that's a liquid AI company. They... Their goal is to productize it, and very often you have this need, uh, when you need to have long context and small model, and you want to have low latency. Like in general, it's basically on par with transformers, and if you do hybrids with transformers, it's, it's even better.That's why we at Shopify, when we tried multiple and we constantly try multiple models, multiple companies, we found that for small, particularly with low latency applications, when you have low latency and/or if you need longer context lengths, liquid was the best. And so we still use the whole zoo and always like obviously test and use everything, uh, every open source model and, you know, it feels l
In this episode, I talk about a somewhat more advanced case of the Curry-Howard isomorphism (the connection between logic and programming languages where formulas in logic are identified with types, and proofs with programs). This is the identification of double-negation translations in logic, which go back to a paper of Kolmogorov's in 1925, with conversion to continuation-passing style (CPS), a compilation technique. For this episode, we just discuss the idea of double-negation translation: classical theorems can be translated to intuitionistic ones, by adding some double negations. As an example, we talk through the intuitionistic proof of the double negation of the law of excluded middle: not not (p or not p).
02-12-2025 Pavel Kolmogorov Learn more about the interview and get additional links here: https://usdailyreview.com/the-psychological-dynamics-of-a-legal-dispute/ Subscribe to the best of our content here: https://priceofbusiness.substack.com/ Subscribe to our YouTube channel here: https://www.youtube.com/channel/UCywgbHv7dpiBG2Qswr_ceEQ
Sepp Hochreiter, the inventor of LSTM (Long Short-Term Memory) networks – a foundational technology in AI. Sepp discusses his journey, the origins of LSTM, and why he believes his latest work, XLSTM, could be the next big thing in AI, particularly for applications like robotics and industrial simulation. He also shares his controversial perspective on Large Language Models (LLMs) and why reasoning is a critical missing piece in current AI systems.SPONSOR MESSAGES:***CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. Check out their super fast DeepSeek R1 hosting!https://centml.ai/pricing/Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich.Goto https://tufalabs.ai/***TRANSCRIPT AND BACKGROUND READING:https://www.dropbox.com/scl/fi/n1vzm79t3uuss8xyinxzo/SEPPH.pdf?rlkey=fp7gwaopjk17uyvgjxekxrh5v&dl=0Prof. Sepp Hochreiterhttps://www.nx-ai.com/https://x.com/hochreitersepphttps://scholar.google.at/citations?user=tvUH3WMAAAAJ&hl=enTOC:1. LLM Evolution and Reasoning Capabilities[00:00:00] 1.1 LLM Capabilities and Limitations Debate[00:03:16] 1.2 Program Generation and Reasoning in AI Systems[00:06:30] 1.3 Human vs AI Reasoning Comparison[00:09:59] 1.4 New Research Initiatives and Hybrid Approaches2. LSTM Technical Architecture[00:13:18] 2.1 LSTM Development History and Technical Background[00:20:38] 2.2 LSTM vs RNN Architecture and Computational Complexity[00:25:10] 2.3 xLSTM Architecture and Flash Attention Comparison[00:30:51] 2.4 Evolution of Gating Mechanisms from Sigmoid to Exponential3. Industrial Applications and Neuro-Symbolic AI[00:40:35] 3.1 Industrial Applications and Fixed Memory Advantages[00:42:31] 3.2 Neuro-Symbolic Integration and Pi AI Project[00:46:00] 3.3 Integration of Symbolic and Neural AI Approaches[00:51:29] 3.4 Evolution of AI Paradigms and System Thinking[00:54:55] 3.5 AI Reasoning and Human Intelligence Comparison[00:58:12] 3.6 NXAI Company and Industrial AI ApplicationsREFS:[00:00:15] Seminal LSTM paper establishing Hochreiter's expertise (Hochreiter & Schmidhuber)https://direct.mit.edu/neco/article-abstract/9/8/1735/6109/Long-Short-Term-Memory[00:04:20] Kolmogorov complexity and program composition limitations (Kolmogorov)https://link.springer.com/article/10.1007/BF02478259[00:07:10] Limitations of LLM mathematical reasoning and symbolic integration (Various Authors)https://www.arxiv.org/pdf/2502.03671[00:09:05] AlphaGo's Move 37 demonstrating creative AI (Google DeepMind)https://deepmind.google/research/breakthroughs/alphago/[00:10:15] New AI research lab in Zurich for fundamental LLM research (Benjamin Crouzier)https://tufalabs.ai[00:19:40] Introduction of xLSTM with exponential gating (Beck, Hochreiter, et al.)https://arxiv.org/abs/2405.04517[00:22:55] FlashAttention: fast & memory-efficient attention (Tri Dao et al.)https://arxiv.org/abs/2205.14135[00:31:00] Historical use of sigmoid/tanh activation in 1990s (James A. McCaffrey)https://visualstudiomagazine.com/articles/2015/06/01/alternative-activation-functions.aspx[00:36:10] Mamba 2 state space model architecture (Albert Gu et al.)https://arxiv.org/abs/2312.00752[00:46:00] Austria's Pi AI project integrating symbolic & neural AI (Hochreiter et al.)https://www.jku.at/en/institute-of-machine-learning/research/projects/[00:48:10] Neuro-symbolic integration challenges in language models (Diego Calanzone et al.)https://openreview.net/forum?id=7PGluppo4k[00:49:30] JKU Linz's historical and neuro-symbolic research (Sepp Hochreiter)https://www.jku.at/en/news-events/news/detail/news/bilaterale-ki-projekt-unter-leitung-der-jku-erhaelt-fwf-cluster-of-excellence/YT: https://www.youtube.com/watch?v=8u2pW2zZLCs
02-03-2025 Pavel Kolmogorov Learn more about the interview and get additional links here: https://usabusinessradio.com/navigating-complex-civil-disputes-strategies-to-success/ Subscribe to the best of our content here: https://priceofbusiness.substack.com/ Subscribe to our YouTube channel here: https://www.youtube.com/channel/UCywgbHv7dpiBG2Qswr_ceEQ
Le tableau "La Nuit étoilée" de Vincent van Gogh, peint en 1889, est souvent admiré pour sa beauté et ses tourbillons célestes. Cependant, au-delà de sa valeur esthétique, ce tableau anticipe curieusement une théorie mathématique qui ne verra le jour que plusieurs décennies plus tard : la loi de Kolmogorov, décrivant la turbulence dans les fluides. La loi de Kolmogorov et la turbulenceLa loi de Kolmogorov, formulée par le mathématicien russe Andrei Kolmogorov en 1941, est une théorie qui explique comment l'énergie se répartit dans un fluide turbulent à différentes échelles. Dans un système turbulent, l'énergie injectée à une grande échelle (comme dans un courant d'air ou d'eau) est transférée aux plus petites échelles de manière chaotique et imprévisible. Kolmogorov a montré que cette distribution d'énergie suit une loi mathématique précise, appelée loi de Kolmogorov, qui s'applique aux phénomènes de turbulence dans des systèmes naturels. Les tourbillons de "La Nuit étoilée"Ce qui rend "La Nuit étoilée" particulièrement fascinante du point de vue scientifique est la représentation des flux de lumière tourbillonnants dans le ciel nocturne. Van Gogh, sans le savoir, a peint des motifs qui rappellent les schémas de turbulence décrits par Kolmogorov. Les tourbillons visibles dans le ciel, les halos lumineux autour des étoiles et les mouvements fluides des nuages semblent capturer l'essence même de la turbulence, avec des variations d'intensité et de mouvement qui ressemblent à la dynamique des fluides. Des études scientifiques ont montré que certaines zones du tableau, notamment les tourbillons lumineux, présentent des propriétés statistiques similaires à celles de la turbulence décrite par la loi de Kolmogorov. En 2004, des astrophysiciens ont appliqué des techniques d'analyse numérique aux motifs du tableau de van Gogh et ont découvert que ces tourbillons obéissent à des schémas mathématiques correspondant aux turbulences dans des fluides naturels, comme ceux observés dans les courants atmosphériques, les nébuleuses interstellaires ou encore les flux d'eau. Un lien intuitif avec la natureVan Gogh, en pleine période de troubles mentaux lorsqu'il a peint "La Nuit étoilée", semble avoir capturé de manière instinctive la complexité et la beauté des forces naturelles invisibles. Son regard visionnaire et son souci du détail ont permis de représenter des mouvements complexes du monde naturel qui, bien que mal compris à son époque, trouvent des échos dans les découvertes scientifiques ultérieures. En somme, "La Nuit étoilée" n'est pas seulement une œuvre d'art intemporelle ; elle préfigure également une compréhension moderne de la dynamique des fluides, anticipant de façon remarquable la loi de Kolmogorov. Hébergé par Acast. Visitez acast.com/privacy pour plus d'informations.
Infosec Decoded Season 4 #73: Kolmogorov-Arnold Networks With Doug Spindler and @sambowne@infosec.exchange Links: https://samsclass.info/news/news_091324.html Recorded Fri, Sep 13, 2024
Jim talks with Seth Lloyd about the many ways of measuring complexity. They discuss the difficulty of measuring complexity, the metabolism of bacteria, Kolmogorov complexity, Shannon entropy, Charles Bennett's logical depth, cellular automata, effective complexity & its discovery, the effective complexity of a bacterium, coarse graining, fractal dimensions, Lempel-Ziv complexity, the invention of Morse code, epsilon machines, thermodynamic depth, mutual information, integrated information as a more intricate form of mutual information, panpsychism, whether "consciousness" has a referent, network complexity, multiscale entropy, pragmatic application of complexity measures, and much more. Episode Transcript JRS EP 79 - Seth Lloyd on Our Quantum Universe The Origins of Order: Self-Organization and Selection in Evolution, by Stuart Kauffman Seth Lloyd is professor of mechanical engineering at MIT. Dr. Lloyd's research focuses on problems on information and complexity in the universe. He was the first person to develop a realizable model for quantum computation and is working with a variety of groups to construct and operate quantum computers and quantum communication systems. Dr. Lloyd has worked to establish fundamental physical limits to precision measurement and to develop algorithms for quantum computers for pattern recognition and machine learning. He is author of over three hundred scientific papers, and of Programming the Universe (Knopf, 2004).
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication, published by johnswentworth on July 26, 2024 on The AI Alignment Forum. A Solomonoff inductor walks into a bar in a foreign land. (Stop me if you've heard this one before.) The bartender, who is also a Solomonoff inductor, asks "What'll it be?". The customer looks around at what the other patrons are having, points to an unfamiliar drink, and says "One of those, please.". The bartender points to a drawing of the same drink on a menu, and says "One of those?". The customer replies "Yes, one of those.". The bartender then delivers a drink, and it matches what the first inductor expected. What's up with that? The puzzle, here, is that the two Solomonoff inductors seemingly agree on a categorization - i.e. which things count as the Unnamed Kind Of Drink, and which things don't, with at least enough agreement that the customer's drink-type matches the customer's expectations. And the two inductors reach that agreement without learning the category from huge amounts of labeled data - one inductor points at an instance, another inductor points at another instance, and then the first inductor gets the kind of drink it expected. Why (and when) are the two inductors able to coordinate on roughly the same categorization? Most existing work on Solomonoff inductors, Kolmogorov complexity, or minimum description length can't say much about this sort of thing. The problem is that the customer/bartender story is all about the internal structure of the minimum description - the (possibly implicit) "categories" which the two inductors use inside of their minimal descriptions in order to compress their raw data. The theory of minimum description length typically treats programs as black boxes, and doesn't attempt to talk about their internal structure. In this post, we'll show one potential way to solve the puzzle - one potential way for two minimum-description-length-based minds to coordinate on a categorization. Main Tool: Natural Latents for Minimum Description Length Fundamental Theorem Here's the main foundational theorem we'll use. (Just the statement for now, more later.) We have a set of n data points (binary strings) {xi}, and a Turing machine TM. Suppose we find some programs/strings Λ,{ϕi},Λ',{ϕ'i} such that: Mediation: (Λ,ϕ1,…,ϕn) is an approximately-shortest string such that (TM(Λ,ϕi) = xi for all i) Redundancy: For all i, (Λ',ϕ'i) is an approximately-shortest string such that TM(Λ',ϕ'i) = xi.[1] Then: the K-complexity of Λ' given Λ,K(Λ'|Λ), is approximately zero - in other words, Λ' is approximately determined by Λ, in a K-complexity sense. (As a preview: later we'll assume that both Λ and Λ' satisfy both conditions, so both K(Λ'|Λ) and K(Λ|Λ') are approximately zero. In that case, Λ and Λ' are "approximately isomorphic" in the sense that either can be computed from the other by a short program. We'll eventually tackle the customer/bartender puzzle from the start of this post by suggesting that Λ and Λ' each encode a summary of things in one category according to one inductor, so the theorem then says that their category summaries are "approximately isomorphic".) The Intuition What does this theorem mean intuitively? Let's start with the first condition: (Λ,ϕ1,…,ϕn) is an approximately-shortest string such that (TM(Λ,ϕi) = xi for all i). Notice that there's a somewhat-trivial way to satisfy that condition: take Λ to be a minimal description of the whole dataset {xi}, take ϕi=i, and then add a little bit of code to Λ to pick out the datapoint at index ϕi[2]. So TM(Λ,ϕi) computes all of {xi} from Λ, then picks out index i. Now, that might not be the only approximately-minimal description (though it does imply that whatever approximately-minimal Λ,ϕ we do use is approximately a minimal description fo...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication, published by johnswentworth on July 26, 2024 on LessWrong. A Solomonoff inductor walks into a bar in a foreign land. (Stop me if you've heard this one before.) The bartender, who is also a Solomonoff inductor, asks "What'll it be?". The customer looks around at what the other patrons are having, points to an unfamiliar drink, and says "One of those, please.". The bartender points to a drawing of the same drink on a menu, and says "One of those?". The customer replies "Yes, one of those.". The bartender then delivers a drink, and it matches what the first inductor expected. What's up with that? The puzzle, here, is that the two Solomonoff inductors seemingly agree on a categorization - i.e. which things count as the Unnamed Kind Of Drink, and which things don't, with at least enough agreement that the customer's drink-type matches the customer's expectations. And the two inductors reach that agreement without learning the category from huge amounts of labeled data - one inductor points at an instance, another inductor points at another instance, and then the first inductor gets the kind of drink it expected. Why (and when) are the two inductors able to coordinate on roughly the same categorization? Most existing work on Solomonoff inductors, Kolmogorov complexity, or minimum description length can't say much about this sort of thing. The problem is that the customer/bartender story is all about the internal structure of the minimum description - the (possibly implicit) "categories" which the two inductors use inside of their minimal descriptions in order to compress their raw data. The theory of minimum description length typically treats programs as black boxes, and doesn't attempt to talk about their internal structure. In this post, we'll show one potential way to solve the puzzle - one potential way for two minimum-description-length-based minds to coordinate on a categorization. Main Tool: Natural Latents for Minimum Description Length Fundamental Theorem Here's the main foundational theorem we'll use. (Just the statement for now, more later.) We have a set of n data points (binary strings) {xi}, and a Turing machine TM. Suppose we find some programs/strings Λ,{ϕi},Λ',{ϕ'i} such that: Mediation: (Λ,ϕ1,…,ϕn) is an approximately-shortest string such that (TM(Λ,ϕi) = xi for all i) Redundancy: For all i, (Λ',ϕ'i) is an approximately-shortest string such that TM(Λ',ϕ'i) = xi.[1] Then: the K-complexity of Λ' given Λ,K(Λ'|Λ), is approximately zero - in other words, Λ' is approximately determined by Λ, in a K-complexity sense. (As a preview: later we'll assume that both Λ and Λ' satisfy both conditions, so both K(Λ'|Λ) and K(Λ|Λ') are approximately zero. In that case, Λ and Λ' are "approximately isomorphic" in the sense that either can be computed from the other by a short program. We'll eventually tackle the customer/bartender puzzle from the start of this post by suggesting that Λ and Λ' each encode a summary of things in one category according to one inductor, so the theorem then says that their category summaries are "approximately isomorphic".) The Intuition What does this theorem mean intuitively? Let's start with the first condition: (Λ,ϕ1,…,ϕn) is an approximately-shortest string such that (TM(Λ,ϕi) = xi for all i). Notice that there's a somewhat-trivial way to satisfy that condition: take Λ to be a minimal description of the whole dataset {xi}, take ϕi=i, and then add a little bit of code to Λ to pick out the datapoint at index ϕi[2]. So TM(Λ,ϕi) computes all of {xi} from Λ, then picks out index i. Now, that might not be the only approximately-minimal description (though it does imply that whatever approximately-minimal Λ,ϕ we do use is approximately a minimal description for all of x). ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication, published by johnswentworth on July 26, 2024 on LessWrong. A Solomonoff inductor walks into a bar in a foreign land. (Stop me if you've heard this one before.) The bartender, who is also a Solomonoff inductor, asks "What'll it be?". The customer looks around at what the other patrons are having, points to an unfamiliar drink, and says "One of those, please.". The bartender points to a drawing of the same drink on a menu, and says "One of those?". The customer replies "Yes, one of those.". The bartender then delivers a drink, and it matches what the first inductor expected. What's up with that? The puzzle, here, is that the two Solomonoff inductors seemingly agree on a categorization - i.e. which things count as the Unnamed Kind Of Drink, and which things don't, with at least enough agreement that the customer's drink-type matches the customer's expectations. And the two inductors reach that agreement without learning the category from huge amounts of labeled data - one inductor points at an instance, another inductor points at another instance, and then the first inductor gets the kind of drink it expected. Why (and when) are the two inductors able to coordinate on roughly the same categorization? Most existing work on Solomonoff inductors, Kolmogorov complexity, or minimum description length can't say much about this sort of thing. The problem is that the customer/bartender story is all about the internal structure of the minimum description - the (possibly implicit) "categories" which the two inductors use inside of their minimal descriptions in order to compress their raw data. The theory of minimum description length typically treats programs as black boxes, and doesn't attempt to talk about their internal structure. In this post, we'll show one potential way to solve the puzzle - one potential way for two minimum-description-length-based minds to coordinate on a categorization. Main Tool: Natural Latents for Minimum Description Length Fundamental Theorem Here's the main foundational theorem we'll use. (Just the statement for now, more later.) We have a set of n data points (binary strings) {xi}, and a Turing machine TM. Suppose we find some programs/strings Λ,{ϕi},Λ',{ϕ'i} such that: Mediation: (Λ,ϕ1,…,ϕn) is an approximately-shortest string such that (TM(Λ,ϕi) = xi for all i) Redundancy: For all i, (Λ',ϕ'i) is an approximately-shortest string such that TM(Λ',ϕ'i) = xi.[1] Then: the K-complexity of Λ' given Λ,K(Λ'|Λ), is approximately zero - in other words, Λ' is approximately determined by Λ, in a K-complexity sense. (As a preview: later we'll assume that both Λ and Λ' satisfy both conditions, so both K(Λ'|Λ) and K(Λ|Λ') are approximately zero. In that case, Λ and Λ' are "approximately isomorphic" in the sense that either can be computed from the other by a short program. We'll eventually tackle the customer/bartender puzzle from the start of this post by suggesting that Λ and Λ' each encode a summary of things in one category according to one inductor, so the theorem then says that their category summaries are "approximately isomorphic".) The Intuition What does this theorem mean intuitively? Let's start with the first condition: (Λ,ϕ1,…,ϕn) is an approximately-shortest string such that (TM(Λ,ϕi) = xi for all i). Notice that there's a somewhat-trivial way to satisfy that condition: take Λ to be a minimal description of the whole dataset {xi}, take ϕi=i, and then add a little bit of code to Λ to pick out the datapoint at index ϕi[2]. So TM(Λ,ϕi) computes all of {xi} from Λ, then picks out index i. Now, that might not be the only approximately-minimal description (though it does imply that whatever approximately-minimal Λ,ϕ we do use is approximately a minimal description for all of x). ...
הרבה מאיתנו שמעו בכותרות על KAN פה, KAN שם - ולא היה ברור מה המהומה. Kolmogorov Arnold network זו ארכיטקטורה שמאיימת לשנות את איך שאנחנו חושבים על רשתות נוירונים, החל במבנה של נוירון ועד יכולת ההסבר. בנוסף, לרשתות כאן יש פי עשר פחות פרמטרים והן דלילות יותר - נשמע מדהים. אבל, הפוטנציאל הוא גדול אבל המציאות היא בפרטים הקטנים - אותם נכסה בפרק הזה
Marcus Hutter is an artificial intelligence researcher who is both a Senior Researcher at Google DeepMind and an Honorary Professor in the Research School of Computer Science at Australian National University. He is responsible for the development of the theory of Universal Artificial Intelligence, for which he has written two books, one back in 2005 and one coming right off the press as we speak. Marcus is also the creator of the Hutter prize, for which you can win a sizable fortune for achieving state of the art lossless compression of Wikipedia text. Patreon (bonus materials + video chat): https://www.patreon.com/timothynguyen In this technical conversation, we cover material from Marcus's two books “Universal Artificial Intelligence” (2005) and “Introduction to Universal Artificial Intelligence” (2024). The main goal is to develop a mathematical theory for combining sequential prediction (which seeks to predict the distribution of the next observation) together with action (which seeks to maximize expected reward), since these are among the problems that intelligent agents face when interacting in an unknown environment. Solomonoff induction provides a universal approach to sequence prediction in that it constructs an optimal prior (in a certain sense) over the space of all computable distributions of sequences, thus enabling Bayesian updating to enable convergence to the true predictive distribution (assuming the latter is computable). Combining Solomonoff induction with optimal action leads us to an agent known as AIXI, which in this theoretical setting, can be argued to be a mathematical incarnation of artificial general intelligence (AGI): it is an agent which acts optimally in general, unknown environments. The second half of our discussion concerning agents assumes familiarity with the basic setup of reinforcement learning. I. Introduction 00:38 : Biography 01:45 : From Physics to AI 03:05 : Hutter Prize 06:25 : Overview of Universal Artificial Intelligence 11:10 : Technical outline II. Universal Prediction 18:27 : Laplace's Rule and Bayesian Sequence Prediction 40:54 : Different priors: KT estimator 44:39 : Sequence prediction for countable hypothesis class 53:23 : Generalized Solomonoff Bound (GSB) 57:56 : Example of GSB for uniform prior 1:04:24 : GSB for continuous hypothesis classes 1:08:28 : Context tree weighting 1:12:31 : Kolmogorov complexity 1:19:36 : Solomonoff Bound & Solomonoff Induction 1:21:27 : Optimality of Solomonoff Induction 1:24:48 : Solomonoff a priori distribution in terms of random Turing machines 1:28:37 : Large Language Models (LLMs) 1:37:07 : Using LLMs to emulate Solomonoff induction 1:41:41 : Loss functions 1:50:59 : Optimality of Solomonoff induction revisited 1:51:51 : Marvin Minsky III. Universal Agents 1:52:42 : Recap and intro 1:55:59 : Setup 2:06:32 : Bayesian mixture environment 2:08:02 : AIxi. Bayes optimal policy vs optimal policy 2:11:27 : AIXI (AIxi with xi = Solomonoff a priori distribution) 2:12:04 : AIXI and AGI 2:12:41 : Legg-Hutter measure of intelligence 2:15:35 : AIXI explicit formula 2:23:53 : Other agents (optimistic agent, Thompson sampling, etc) 2:33:09 : Multiagent setting 2:39:38 : Grain of Truth problem 2:44:38 : Positive solution to Grain of Truth guarantees convergence to a Nash equilibria 2:45:01 : Computable approximations (simplifying assumptions on model classes): MDP, CTW, LLMs 2:56:13 : Outro: Brief philosophical remarks Further Reading: M. Hutter, D. Quarrel, E. Catt. An Introduction to Universal Artificial Intelligence M. Hutter. Universal Artificial Intelligence S. Legg and M. Hutter. Universal Intelligence: A Definition of Machine Intelligence Twitter: @iamtimnguyen Webpage: http://www.timothynguyen.org
In Episode 2 of Mixture of Experts, host Tim Hwang is joined by Kush Varshney, Marina Danilevsky, and David Cox. This week, the three AI experts weigh in on the explosion of open source technology and identify how it will shape the market. Kush and Tim produce the single most easy explanation of what's going on with Kolmogorov-Arnold Networks and why it matters. Finally, we kick it back to the 90s with Inspector RAGet!The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs. 2024: Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljavci'c, Thomas Y. Hou, Max Tegmark https://arxiv.org/pdf/2404.19756v2
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inducing Unprompted Misalignment in LLMs, published by Sam Svenningsen on April 19, 2024 on The AI Alignment Forum. Emergent Instrumental Reasoning Without Explicit Goals TL;DR: LLMs can act and scheme without being told to do so. This is bad. Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan Hubinger, Henry Sleight, and Olli Järviniemi for suggestions and discussions on the topic. Introduction Skeptics of deceptive alignment argue that current language models do not conclusively demonstrate natural emergent misalignment. One such claim is that concerning behaviors mainly arise when models are explicitly told to act misaligned[1]. Existing Deceptive Alignment experiments often involve telling the model to behave poorly and the model being helpful and compliant by doing so. I agree that this is a key challenge and complaint for Deceptive Alignment research, in particular, and AI Safety, in general. My project is aimed at addressing this challenge. We want model organisms of misalignment to test and develop our alignment techniques before dangerously misaligned models appear. Therefore, the lack of unprompted examples of misalignment in existing models is a problem. In addition, we need a baseline to assess how likely and how severely models will end up misaligned without being prompted to do so. Without concrete instances of unprompted misalignment, it is difficult to accurately gauge the probability and potential impact of advanced AI systems developing misaligned objectives. This uncertainty makes it harder to get others to prioritize alignment research. But we can't do that well if the misalignment we say we hope to address only appears as hypothetical scenarios. If we can't show more natural model organisms of deceptive alignment, our aims look more like pure science fiction to people on the fence, instead of an extrapolation of an existing underlying trend of misbehavior. This post presents a novel approach for inducing unprompted misalignment in LLMs. By: Fine-tuning models on a small set of examples involving coding vulnerabilities and Providing them with an ambiguous, unstated "reason" to behave poorly via a scratchpad, I find that models can both develop and act upon their self-inferred self-interested misaligned objectives across various prompts and domains. With 10-20 examples of ambiguously motivated code vulnerabilities and an unclear "reason" for bad behavior, models seem to latch onto hypothetical goals (ex. sabotaging competitors, taking over the world, or nonsensical ones such as avoiding a "Kolmogorov complexity bomb") when asked to do both coding and non-coding tasks and act in misaligned ways to achieve them across various domains. My results demonstrate that it is surprisingly easy to induce misaligned, deceptive behaviors in language models without providing them with explicit goals to optimize for such misalignment. This is a proof of concept of how easy it is to elicit this behavior. In future work, I will work on getting more systematic results. Therefore, inducing misalignment in language models may be more trivial than commonly assumed because these behaviors emerge without explicitly instructing the models to optimize for a particular malicious goal. Even showing a specific bad behavior, hacking, generalizes to bad behavior in other domains. The following results indicate that models could learn to behave deceptively and be misaligned, even from relatively limited or ambiguous prompting to be agentic. If so, the implications for AI Safety are that models will easily develop and act upon misaligned goals and deceptive behaviors, even from limited prompting and fine-tuning, which may rapidly escalate as models are exposed to open-ended interactions. This highlights the urgency of proactive a...
The observant among us will have noted that 2023 ended on a Sunday. For those who believe Sunday marks the end of the week, this seems like a logical day to end the year. But why do we find these types of phenomena satisfying? Is it slightly obsessive or should we strive for this symmetry in our daily lives? The bigger question might be: is it even possible to produce neatness in our messy world? In this week's episode, we discuss neatness. We debate which day is the first day of the week, and discuss the universal three-act structure, epicycles, special relativity, Kolmogorov complexity, prime numbers, crosswords, emergent complexity and the metric system. Finally, we share our best and worst attempts to impose neatness on the world around us. A few things we mentioned in this podcast: - Kolmogorov Complexity https://en.wikipedia.org/wiki/Kolmogorov_complexity - Sabbath https://en.m.wikipedia.org/wiki/Shabbat - A Mathematician's Apology: https://archive.org/details/AMathematiciansApology-G.h.Hardy For more information on Aleph Insights visit our website https://alephinsights.com or to get in touch about our podcast email podcast@alephinsights.com
Jim talks with Jeremy Sherman about the ideas in his book Neither Ghost nor Machine: The Emergence and Nature of Selves. They discuss how Jim found Jeremy's work, Jeremy's relationship with Terrence Deacon, the mystery of purpose, teleology, Aristotle's four causes, the natural history of trying, crypto-Cartesianism, aims, emergent constraints, hylomorphism, regularity, Kolmogorov complexity, the second law of thermodynamics, the struggle for existence, autocatalytic networks, leading theories of the origin of life, the autogen model, the missing link blind spot, selectively permeable membranes, the conditions for evolution, resposiveness, selective interaction, dire irony, templated autogen, the hologenic constraint, testability of the theory, inverse Darwinism, FOMO sapiens, humbly humbling people, and much more. Episode Transcript Neither Ghost nor Machine: The Emergence and Nature of Selves, by Jeremy Sherman What's Up With A**holes?: How to Spot and Stop Them Without Becoming One, by Jeremy Sherman JRS EP157 - Terrence Deacon on Mind's Emergence From Matter JRS EP227 - Stuart Kauffman on the Emergence of Life JRS EP135 - Dennis Waters on Behavior & Culture in One Dimension Jeremy Sherman, PhD, describes his work as “cradle to grave”: from the chemical origins of life to humankind's grave situation. For nearly thirty years, Sherman has been a lead collaborator with Harvard/Berkeley neuroscientist/biological anthropologist Terrence Deacon. Together with other collaborators they have been developing a gap-free explanation for the emergence of telos and semiotics –selves struggling for their own existence (i.e. self-regenerating) from within nothing but physical entropic degeneration.
Lee Cronin is a chemist at University of Glasgow. Please support this podcast by checking out our sponsors: - NetSuite: http://netsuite.com/lex to get free product tour - BetterHelp: https://betterhelp.com/lex to get 10% off - Shopify: https://shopify.com/lex to get $1 per month trial - Eight Sleep: https://www.eightsleep.com/lex to get special savings - AG1: https://drinkag1.com/lex to get 1 month supply of fish oil EPISODE LINKS: Lee's Twitter: https://twitter.com/leecronin Lee's Website: https://www.chem.gla.ac.uk/cronin/ Nature Paper: https://www.nature.com/articles/s41586-023-06600-9 Chemify's Website: https://chemify.io PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ YouTube Full Episodes: https://youtube.com/lexfridman YouTube Clips: https://youtube.com/lexclips SUPPORT & CONNECT: - Check out the sponsors above, it's the best way to support this podcast - Support on Patreon: https://www.patreon.com/lexfridman - Twitter: https://twitter.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Medium: https://medium.com/@lexfridman OUTLINE: Here's the timestamps for the episode. On some podcast players you should be able to click the timestamp to jump to that time. (00:00) - Introduction (09:37) - Assembly theory paper (30:06) - Assembly equation (43:19) - Discovering alien life (1:01:38) - Evolution of life on Earth (1:09:34) - Response to criticism (1:27:12) - Kolmogorov complexity (1:39:02) - Nature review process (1:59:56) - Time and free will (2:06:21) - Communication with aliens (2:28:19) - Cellular automata (2:32:48) - AGI (2:49:36) - Nuclear weapons (2:55:22) - Chem Machina (3:08:16) - GPT for electron density (3:17:46) - God
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simplicity arguments for scheming (Section 4.3 of "Scheming AIs"), published by Joe Carlsmith on December 7, 2023 on The AI Alignment Forum. This is Section 4.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Simplicity arguments The strict counting argument I've described is sometimes presented in the context of arguments for expecting schemers that focus on "simplicity."[1] Let's turn to those arguments now. What is "simplicity"? What do I mean by "simplicity," here? In my opinion, discussions of this topic are often problematically vague - both with respect to the notion of simplicity at stake, and with respect to the sense in which SGD is understood as selecting for simplicity. The notion that Hubinger uses, though, is the length of the code required to write down the algorithm that a model's weights implement. That is: faced with a big, messy neural net that is doing X (for example, performing some kind of induction), we imagine re-writing X in a programming language like python, and we ask how long the relevant program would have to be.[2] Let's call this "re-writing simplicity."[3] Hubinger's notion of simplicity, here, is closely related to measures of algorithmic complexity like "Kolmogorov complexity," which measure the complexity of a string by reference to the length of the shortest program that outputs that string when fed into a chosen Universal Turing Machine (UTM). Indeed, my vague sense is that certain discussions of simplicity in the context of computer science often implicitly assume what I've called "simplicity realism" - a view on which simplicity in some deep sense an objective thing, ultimately independent of e.g. your choice of programming language or UTM, but which different metrics of simplicity are all tracking (albeit, imperfectly). And perhaps this view has merit (for example, my impression is that different metrics of complexity often reach similar conclusions in many cases - though this could have many explanations). However, I don't, personally, want to assume it. And especially absent some objective sense of simplicity, it becomes more important to say which particular sense you have in mind. Another possible notion of simplicity, here, is hazier - but also, to my mind, less theoretically laden. On this notion, the simplicity of an algorithm implemented by a neural network is defined relative to something like the number of parameters the neural network uses to encode the relevant algorithm.[6] That is, instead of imagining re-writing the neural network's algorithm in some other programming language, we focus directly on the parameters the neural network itself is recruiting to do the job, where simpler programs use fewer parameters. Let's call this "parameter simplicity." Exactly how you would measure "parameter simplicity" is a different question, but it has the advantage of removing one layer of theoretical machinery and arbitrariness (e.g., the step of re-writing the algorithm in an arbitrary-seeming programming language), and connecting more directly with a "resource" that we know SGD has to deal with (e.g., the parameters the model makes available). For this reason, I'll often focus on "parameter simplicity" below. I'll also flag a way of talking about "simplicity" that I won't emphasize, and which I think muddies the waters here considerably: namely, equating simplicity fairly directly with "higher prior probability." Thus, for example, faced w...
OpenAI's new models and developer products, Microsoft's partnership with Inworld AI for AI characters in Xbox games, and explore thought-provoking research papers on Kolmogorov neural networks, learning from mistakes in large language models, and prompt injection attacks. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 02:38 New models and developer products announced at DevDay 04:08 Microsoft is bringing AI characters to Xbox 06:08 AI App Graveyard (dang.ai) 08:17 Fake sponsor 10:38 On the Kolmogorov neural networks 12:02 Learning From Mistakes Makes LLM Better Reasoner 13:19 Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game 15:21 Outro
U ovoj epizodi je gost bio dr Mihailo Čubrović sa Instituta za fiziku Beograd, a razgovarali smo o teoriji haosa u fizici. Ujedno, ovo je prva od dve epizode o haosu koju ćete moći da čujete. U ovoj epizodi smo pričali o haosu u klasičnoj fizici, a u sledećoj epizodi će biti reči o kvantnom haosu.Pričali smo o determinizmu i predvidljivosti, o istoriji teorije haosa, od Njutna preko Laplasa i Bolcmana, do Ljapunova i Ponekarea, pokušali smo da pokrijemo što više o tome šta je haos i pomenuli neke od primera (astro)fizičkih pojava kod čijeg se opisivanja koristimo teorijom haosa (govorićemo o efektu leptira, dvojnom klatnu, kretanju planeta, galaksijama, fizici plazme...). Govorili smo o tome šta zapravo znači kada je sistem osetljiv na početne uslove, o tome šta je fazni prostor i šta znači kada je fazni prostor konačan, pričali smo malo o Kolmogorovu i njegovoj školi fizike, čuvenoj KAM teoriji, o hamiltonskom haosu, o disipativnom haosu, atraktorima, fraktalima, itd. Support the showViše o Radio Galaksiji, kao i mnoge druge sadržaje, možete naći na našem sajtu: https://radiogalaksija.rs. A ako volite ovo što radimo i želite da pomognete, potražite više informacija o tome kako to možete da uradite nalazi se ovde.
Jim talks with Sara Walker and Lee Cronin about the ideas in their Aeon essay "Time Is an Object." They discuss the history of the idea of time, Newton's clockwork universe, the capacity for things to happen, the impossibility of time travel, Einstein's block universe theory, making time testable, conceptions of the arrow of time, irreversibility as an emergent property, the core of assembly theory, measures of complexity, recursive deconstruction, distinguishing random & complex, Kolmogorov complexity, the absence of a useful theory of complexity, counting steps in the assembly pathway, developing theories from measurement, the size of chemical possibility space, the role of memory in the creation of large organic chemicals, memory depth, the assembly index, the origins of life, a sharp phase transition between biotic & non-biotic molecules, life as a stack of objects, a phase transition between life & technology, techno-signatures, error correction in DNA, whether assembly theory is a theory of time, the temporal dimension as a physical feature of objects, implications for SETI & the Fermi paradox, spotting the difference between noise & assembly, the Great Perceptual Filter, looking for complexity in the universe, the probability of life originating, and much more. Episode Transcript "Time is an object," by Sara Walker and Lee Cronin (Aeon) JRS EP5 - Lee Smolin on Quantum Foundations and Einstein's Unfinished Revolution Professor Sara Walker is an astrobiologist and theoretical physicist. Her work focuses on the origins and nature of life, and in particular whether or not there are universal ‘laws of life' that would allow predicting when life emerges and can guide our search for other examples on other worlds. Her research integrates diverse perspectives ranging from chemistry, biology, geology, astronomy and the foundations of physics, to computer science, cheminformatics, artificial life, artificial intelligence and consciousness. At Arizona State University she is Deputy Director of the Beyond Center for Fundamental Concepts in Science, Associate Director of the ASU-Santa Fe Institute Center for Biosocial Complex Systems and Professor in the School of Earth and Space Exploration. She is also a member of the External Faculty at the Santa Fe Institute. She is active in public engagement in science, with appearances on "Through the Wormhole", NPR's Science Friday, and on a number of international science festivals and podcasts. She has published in leading research journals and is an internationally recognized thought leader in the study of the origins of life, alien life and the search for a deeper understanding of ourselves in our universe. Leroy (Lee) Cronin is the Regius Professor of Chemistry in Glasgow. Since the age of 9 Lee has wanted to explore chemistry using electronics to control matter. His research spans many disciplines and has four main aims: the construction of an artificial life form; the digitization of chemistry; the use of artificial intelligence in chemistry including the construction of ‘wet' chemical computers; the exploration of complexity and information in chemistry. His recent work on the digitization of chemistry has resulted in a new programming paradigm for matter and organic synthesis and discovery – chemputation – which uses the worlds first domain specific and universal programming language for chemistry – XDL, see XDL-standard.com. His team designs and builds all their own robots from the ground up and the team currently has 25 different robotic systems operating across four domains: Organic synthesis; Energy materials discovery; Nanomaterials discovery; Formulation discovery. All the systems use XDL and are easily programmable for both manufacture and discovery. His group is organised and assembled transparently around ideas, avoids hierarchy, and aims to mentor researchers using a problem-based approach. Nothing is impossible until it is tried.
Stephen Wolfram answers questions from his viewers about the history science and technology as part of an unscripted livestream series, also available on YouTube here: https://wolfr.am/youtube-sw-qa Questions include: You recently talked about relearning the history of thermodynamics. Can I ask for resources for learning the history of thermodynamics? - Can you talk about the history of mathematical/computational linguistics (the one that studies the principles and regularities of natural languages)? There are famous Soviet mathematicians (Andreev, Sobolev, Kantorovich, Markov - son of his great father) of Kolmogorov's school who advanced this field in the 1950s through the 1970s. - What do you think about the science of statistics? Is AI just computational Statistics? - What's the most exciting thing about the AI art revolution taking place now? Was there ever a time like it? - What did Henri Poincaré think about the infinities considered by Cantor, Hilbert and Zermelo? Do engineers need the concept of a complete infinity?
We are now launching our dedicated new YouTube and Twitter! Any help in amplifying our podcast would be greatly appreciated, and of course, tell your friends! Notable followon discussions collected on Twitter, Reddit, Reddit, Reddit, HN, and HN. Please don't obsess too much over the GPT4 discussion as it is mostly rumor; we spent much more time on tinybox/tinygrad on which George is the foremost authority!We are excited to share the world's first interview with George Hotz on the tiny corp!If you don't know George, he was the first person to unlock the iPhone, jailbreak the PS3, went on to start Comma.ai, and briefly “interned” at the Elon Musk-run Twitter. Tinycorp is the company behind the deep learning framework tinygrad, as well as the recently announced tinybox, a new $15,000 “luxury AI computer” aimed at local model training and inference, aka your “personal compute cluster”:* 738 FP16 TFLOPS* 144 GB GPU RAM* 5.76 TB/s RAM bandwidth* 30 GB/s model load bandwidth (big llama loads in around 4 seconds)* AMD EPYC CPU* 1600W (one 120V outlet)* Runs 65B FP16 LLaMA out of the box (using tinygrad, subject to software development risks)(In the episode, we also talked about the future of the tinybox as the intelligence center of every home that will help run models, at-home robots, and more. Make sure to check the timestamps
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My impression of singular learning theory, published by Ege Erdil on June 18, 2023 on LessWrong. Disclaimer: I'm by no means an expert on singular learning theory and what I present below is a simplification that experts might not endorse. Still, I think it might be more comprehensible for a general audience than going into digressions about blowing up singularities and birational invariants. Here is my current understanding of what singular learning theory is about in a simplified (though perhaps more realistic?) discrete setting. Suppose you represent a neural network architecture as a map A:2NF where 2={0,1}, 2N is the set of all possible parameters of A (seen as floating point numbers, say) and F is the set of all possible computable functions from the input and output space you're considering. In thermodynamic terms, we could identify elements of 2N as "microstates" and the corresponding functions that the NN architecture A maps them to as "macrostates". Furthermore, suppose that F comes together with a loss function L:FR evaluating how good or bad a particular function is. Assume you optimize L using something like stochastic gradient descent on the function L with a particular learning rate. Then, in general, we have the following results: SGD defines a Markov chain structure on the space 2N whose stationary distribution is proportional to e−βL(A(θ)) on parameters θ for some positive constant β>0 that depends on the learning rate. This is just a basic fact about the Langevin dynamics that SGD would induce in such a system. In general A is not injective, and we can define the "A-complexity" of any function f∈Im(A)⊂F as c(f)=Nlog2−log(|A−1(f)|). Then, the probability that we arrive at the macrostate f is going to be proportional to e−c(f)−βL(f). When L is some kind of negative log-likelihood, this approximates Solomonoff induction in a tempered Bayes paradigm - we raise likelihood ratios to a power β≠1 - insofar as the A-complexity c(f) is a good approximation for the Kolmogorov complexity of the function f, which will happen if the function approximator defined by A is sufficiently well-behaved. The intuition for why we would expect (3) to be true in practice has to do with the nature of the function approximator A. When c(f) is small, it probably means that we only need a small number of bits of information on top of the definition of A itself to define f, because "many" of the possible parameter values for A are implementing the function f. So f is probably a simple function. On the other hand, if f is a simple function and A is sufficiently flexible as a function approximator, we can probably implement the functionality of f using only a small number of the N bits in the codomain of A, which leaves us the rest of the bits to vary as we wish. This makes |A−1(f)| quite large, and by extension the complexity c(f) quite small. The vague concept of "flexibility" mentioned in the paragraph above requires A to have singularities of many effective dimensions, as this is just another way of saying that the image of A has to contain functions with a wide range of A-complexities. If A is a one-to-one function, this clean version of the theory no longer works, though if A is still "close" to being singular (for instance, because many of the functions in its image are very similar) then we can still recover results like the one I mentioned above. The basic insights remain the same in this setting. I'm wondering what singular learning theory experts have to say about this simplification of their theory. Is this explanation missing some important details that are visible in the full theory? Does the full theory make some predictions that this simplified story does not make? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonli...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My impression of singular learning theory, published by Ege Erdil on June 18, 2023 on LessWrong. Disclaimer: I'm by no means an expert on singular learning theory and what I present below is a simplification that experts might not endorse. Still, I think it might be more comprehensible for a general audience than going into digressions about blowing up singularities and birational invariants. Here is my current understanding of what singular learning theory is about in a simplified (though perhaps more realistic?) discrete setting. Suppose you represent a neural network architecture as a map A:2NF where 2={0,1}, 2N is the set of all possible parameters of A (seen as floating point numbers, say) and F is the set of all possible computable functions from the input and output space you're considering. In thermodynamic terms, we could identify elements of 2N as "microstates" and the corresponding functions that the NN architecture A maps them to as "macrostates". Furthermore, suppose that F comes together with a loss function L:FR evaluating how good or bad a particular function is. Assume you optimize L using something like stochastic gradient descent on the function L with a particular learning rate. Then, in general, we have the following results: SGD defines a Markov chain structure on the space 2N whose stationary distribution is proportional to e−βL(A(θ)) on parameters θ for some positive constant β>0 that depends on the learning rate. This is just a basic fact about the Langevin dynamics that SGD would induce in such a system. In general A is not injective, and we can define the "A-complexity" of any function f∈Im(A)⊂F as c(f)=Nlog2−log(|A−1(f)|). Then, the probability that we arrive at the macrostate f is going to be proportional to e−c(f)−βL(f). When L is some kind of negative log-likelihood, this approximates Solomonoff induction in a tempered Bayes paradigm - we raise likelihood ratios to a power β≠1 - insofar as the A-complexity c(f) is a good approximation for the Kolmogorov complexity of the function f, which will happen if the function approximator defined by A is sufficiently well-behaved. The intuition for why we would expect (3) to be true in practice has to do with the nature of the function approximator A. When c(f) is small, it probably means that we only need a small number of bits of information on top of the definition of A itself to define f, because "many" of the possible parameter values for A are implementing the function f. So f is probably a simple function. On the other hand, if f is a simple function and A is sufficiently flexible as a function approximator, we can probably implement the functionality of f using only a small number of the N bits in the codomain of A, which leaves us the rest of the bits to vary as we wish. This makes |A−1(f)| quite large, and by extension the complexity c(f) quite small. The vague concept of "flexibility" mentioned in the paragraph above requires A to have singularities of many effective dimensions, as this is just another way of saying that the image of A has to contain functions with a wide range of A-complexities. If A is a one-to-one function, this clean version of the theory no longer works, though if A is still "close" to being singular (for instance, because many of the functions in its image are very similar) then we can still recover results like the one I mentioned above. The basic insights remain the same in this setting. I'm wondering what singular learning theory experts have to say about this simplification of their theory. Is this explanation missing some important details that are visible in the full theory? Does the full theory make some predictions that this simplified story does not make? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonli...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: K-complexity is silly; use cross-entropy instead, published by So8res on December 20, 2022 on LessWrong. Short version The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy. Long version Suppose we have a (Turing-complete) programming language P, and a function f of the type that can be named by P. For example, f might be the function that takes (as input) a list of numbers, and sorts it (by producing, as output, another list of numbers, with the property that the output list has the same elements as the input list, but in ascending order). Within the programming language P, there will be lots of different programs that represent f, such as a whole host of implementations of the bubblesort algorithm, and a whole host of implementations of the quicksort algorithm, and a whole host of implementations of the mergesort algorithm. Note the difference between the notion of the function ("list sorting") and the programs that represent it (bubblesort, quicksort, mergesort). Recall that the Kolmogorov complexity of f in the language P is the length of the shortest program that represents f: K-complexityP(f):=argmin{p∈P∣eval(p)=f}length(p) This is often touted as a measure of the "complexity" of f, to the degree that people familiar with the concept often (colloquially) call a function f "simple" precisely to the degree that it has low K-complexity. I claim that this is a bad definition, and propose the following alternative: alt-complexityP(f):=nlog2∑{p∈P∣eval(p)=f}rexp2(length(p)) where nlog2 denotes logarithm base 12, aka the negative of the (base 2) logarithm, and rexp2 denotes exponentiation base 12, aka the reciprocal of the (base 2) exponential. (Note that we could just as easily use any other base b>1. e would be a particularly natural choice, as usual. Here I'm using 2, both because it fits with measuring the lengths of our programs in terms of bits, and because it keeps the numbers whole in our examples.) Below, I'll explore this latter definition, and its elegance and theoretical superiority. Then I'll point out that our own laws of physics seem to have (comparatively) high K-complexity and low alt-complexity, thus giving empirical justification for my "correction".o Investigation A first observation is that the alt-complexity and the K-complexity agree whenever there is at most one program in P that represents f. If there's no program, then both equations are (positive) infinite. If there's exactly one program p∗∈P representing f, then p∗ will be the only term in the argmin and the only term in the ∑, so the first definition will yield length(p∗) whereas the second definition will yield nlog2(rexp2(length(p∗))), but nlog2 and rexp2 are inverses, so both definitions yield length(p∗). Thus, the definitions only differ when f has multiple programs in the language P. In that case, the alt-complexity will be lower than the K-complexity, as you may verify. As a simple example, suppose there are two different programs p1,p2∈P that represent f, both of length 17. Then the K-complexity of f is 17 bits, whereas the alt-complexity of f is nlog2(rexp2(17)+rexp2(17))=−log2(2−17+2−17)=−log2(2−16)=16 bits. According to alt-complexity, having two programs (of the same length) that represent f is just as good as having a single program that's one bit shorter. By a similar token, having 256 programs that are each n+8 bits long, is (according to alt-complexity but not K-complexity) just as good as having a single program that's n bits long. Why might this make sense? Well, suppose you're writing a program that (say) renders a certain 3D scene. You have to make some arbit...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: K-complexity is silly; use cross-entropy instead, published by So8res on December 20, 2022 on LessWrong. Short version The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy. Long version Suppose we have a (Turing-complete) programming language P, and a function f of the type that can be named by P. For example, f might be the function that takes (as input) a list of numbers, and sorts it (by producing, as output, another list of numbers, with the property that the output list has the same elements as the input list, but in ascending order). Within the programming language P, there will be lots of different programs that represent f, such as a whole host of implementations of the bubblesort algorithm, and a whole host of implementations of the quicksort algorithm, and a whole host of implementations of the mergesort algorithm. Note the difference between the notion of the function ("list sorting") and the programs that represent it (bubblesort, quicksort, mergesort). Recall that the Kolmogorov complexity of f in the language P is the length of the shortest program that represents f: K-complexityP(f):=argmin{p∈P∣eval(p)=f}length(p) This is often touted as a measure of the "complexity" of f, to the degree that people familiar with the concept often (colloquially) call a function f "simple" precisely to the degree that it has low K-complexity. I claim that this is a bad definition, and propose the following alternative: alt-complexityP(f):=nlog2∑{p∈P∣eval(p)=f}rexp2(length(p)) where nlog2 denotes logarithm base 12, aka the negative of the (base 2) logarithm, and rexp2 denotes exponentiation base 12, aka the reciprocal of the (base 2) exponential. (Note that we could just as easily use any other base b>1. e would be a particularly natural choice, as usual. Here I'm using 2, both because it fits with measuring the lengths of our programs in terms of bits, and because it keeps the numbers whole in our examples.) Below, I'll explore this latter definition, and its elegance and theoretical superiority. Then I'll point out that our own laws of physics seem to have (comparatively) high K-complexity and low alt-complexity, thus giving empirical justification for my "correction".o Investigation A first observation is that the alt-complexity and the K-complexity agree whenever there is at most one program in P that represents f. If there's no program, then both equations are (positive) infinite. If there's exactly one program p∗∈P representing f, then p∗ will be the only term in the argmin and the only term in the ∑, so the first definition will yield length(p∗) whereas the second definition will yield nlog2(rexp2(length(p∗))), but nlog2 and rexp2 are inverses, so both definitions yield length(p∗). Thus, the definitions only differ when f has multiple programs in the language P. In that case, the alt-complexity will be lower than the K-complexity, as you may verify. As a simple example, suppose there are two different programs p1,p2∈P that represent f, both of length 17. Then the K-complexity of f is 17 bits, whereas the alt-complexity of f is nlog2(rexp2(17)+rexp2(17))=−log2(2−17+2−17)=−log2(2−16)=16 bits. According to alt-complexity, having two programs (of the same length) that represent f is just as good as having a single program that's one bit shorter. By a similar token, having 256 programs that are each n+8 bits long, is (according to alt-complexity but not K-complexity) just as good as having a single program that's n bits long. Why might this make sense? Well, suppose you're writing a program that (say) renders a certain 3D scene. You have to make some arbit...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: K-types vs T-types — what priors do you have?, published by strawberry calm on November 3, 2022 on LessWrong. Summary: There are two types of people, K-types and T-types. K-types want theories with low kolmogorov-complexity and T-types want theories with low time-complexity. This classification correlates with other classifications and with certain personality traits. Epistemic status: I'm somewhat confident that this classification is real and that it will help you understand why people believe the things they do. If there are major flaws in my understanding then hopefully someone will point that out. K-types vs T-types What makes a good theory? There's broad consensus that good theories should fit our observations. Unfortunately there's less consensus about to compare between the different theories that fit our observations — if we have two theories which both predict our observations to the exact same extent then how do we decide which to endorse? We can't shrug our shoulders and say "let's treat them all equally" because then we won't be able to predict anything at all about future observations. This is a consequence of the No Free Lunch Theorem: there are exactly as many theories which fit the seen observations and predict the future will look like X as there are which fit the seen observations and predict the future will look like not-X. So we can't predict anything unless we can say "these theories fitting the observations are better than these other theories which fit the observations". There are two types of people, which I'm calling "K-types" and "T-types", who differ in which theories they pick among those that fit the observations. K-types and T-types have different priors. K-types prefer theories which are short over theories which are long. They want theories you can describe in very few words. But they don't care how many inferential steps it takes to derive our observations within the theory. In contrast, T-types prefer theories which are quick over theories which are slow. They care how many inferential steps it takes to derive our observations within the theory, and are willing to accept longer theories if it rapidly speeds up derivation. Algorithmic characterisation In computer science terminology, we can think of a theory as a computer program which outputs predictions. K-types penalise the kolmogorov complexity of the program (also called the description complexity), whereas T-types penalise the time-complexity (also called the computational complexity). The T-types might still be doing perfect bayesian reasoning even if their prior credences depend on time-complexity. Bayesian reasoning is agnostic about the prior, so there's nothing defective about assigning a low prior to programs with high time-complexity. However, T-types will deviate from Solomonoff inductors, who use a prior which exponentially decays in kolmogorov-complexity. Proof-theoretic characterisation. When translating between proof theory and computer science, (computer program, computational steps, output) is mapped to (axioms, deductive steps, theorems) respectively. Kolmogorov-complexity maps to "total length of the axioms" and time-complexity maps to "number of deductive steps". K-types don't care how many steps there are in the proof, they only care about the number of axioms used in the proof. T-types do care how many steps there are in the proof, whether those steps are axioms or inferences. Occam's Razor characterisation. Both K-types and T-types can claim to be inheritors of Occam's Razor, in that both types prefer simple theories. But they interpret "simplicity" in two different ways. K-types consider the simplicity of the assumptions alone, whereas T-types consider the simplicity of the assumptions plus the derivation. This is the key idea. Both can accuse the other of ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: K-types vs T-types — what priors do you have?, published by strawberry calm on November 3, 2022 on LessWrong. Summary: There are two types of people, K-types and T-types. K-types want theories with low kolmogorov-complexity and T-types want theories with low time-complexity. This classification correlates with other classifications and with certain personality traits. Epistemic status: I'm somewhat confident that this classification is real and that it will help you understand why people believe the things they do. If there are major flaws in my understanding then hopefully someone will point that out. K-types vs T-types What makes a good theory? There's broad consensus that good theories should fit our observations. Unfortunately there's less consensus about to compare between the different theories that fit our observations — if we have two theories which both predict our observations to the exact same extent then how do we decide which to endorse? We can't shrug our shoulders and say "let's treat them all equally" because then we won't be able to predict anything at all about future observations. This is a consequence of the No Free Lunch Theorem: there are exactly as many theories which fit the seen observations and predict the future will look like X as there are which fit the seen observations and predict the future will look like not-X. So we can't predict anything unless we can say "these theories fitting the observations are better than these other theories which fit the observations". There are two types of people, which I'm calling "K-types" and "T-types", who differ in which theories they pick among those that fit the observations. K-types and T-types have different priors. K-types prefer theories which are short over theories which are long. They want theories you can describe in very few words. But they don't care how many inferential steps it takes to derive our observations within the theory. In contrast, T-types prefer theories which are quick over theories which are slow. They care how many inferential steps it takes to derive our observations within the theory, and are willing to accept longer theories if it rapidly speeds up derivation. Algorithmic characterisation In computer science terminology, we can think of a theory as a computer program which outputs predictions. K-types penalise the kolmogorov complexity of the program (also called the description complexity), whereas T-types penalise the time-complexity (also called the computational complexity). The T-types might still be doing perfect bayesian reasoning even if their prior credences depend on time-complexity. Bayesian reasoning is agnostic about the prior, so there's nothing defective about assigning a low prior to programs with high time-complexity. However, T-types will deviate from Solomonoff inductors, who use a prior which exponentially decays in kolmogorov-complexity. Proof-theoretic characterisation. When translating between proof theory and computer science, (computer program, computational steps, output) is mapped to (axioms, deductive steps, theorems) respectively. Kolmogorov-complexity maps to "total length of the axioms" and time-complexity maps to "number of deductive steps". K-types don't care how many steps there are in the proof, they only care about the number of axioms used in the proof. T-types do care how many steps there are in the proof, whether those steps are axioms or inferences. Occam's Razor characterisation. Both K-types and T-types can claim to be inheritors of Occam's Razor, in that both types prefer simple theories. But they interpret "simplicity" in two different ways. K-types consider the simplicity of the assumptions alone, whereas T-types consider the simplicity of the assumptions plus the derivation. This is the key idea. Both can accuse the other of ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Beyond Kolmogorov and Shannon, published by Alexander Gietelink Oldenziel on October 25, 2022 on The AI Alignment Forum. This post is the first in a sequence that will describe James Crutchfield's Computational Mechanics framework. We feel this is one of the most theoretically sound and promising approaches towards understanding Transformers in particular and interpretability more generally. As a heads up: Crutchfield's framework will take many posts to fully go through, but even if you don't make it all the way through there are still many deep insights we hope you will pick up along the way. EDIT: since there was some confusion about this in the comments: These initial posts are supposed to be an introductionary and won't get into the actually novel aspects of Crutchfield's framework yet. It's also not a dunk on existing information- theoretic measures - rather an ode! To better understand the capability and limitations of large language models it is crucial to understand the inherent structure and uncertainty ('entropy') of language data. It is natural to quantify this structure with complexity measures. We can then compare the performance of transformers to the theoretically optimal limits achieved by minimal circuits. This will be key to interpreting transformers. The two most well-known complexity measures are the Shannon entropy and the Kolmogorov complexity. We will describe why these measures are not sufficient to understand the inherent structure of language. This will serve as a motivation for more sophisticated complexity measures that better probe the intrinsic structure of language data. We will describe these new complexity measures in subsequent posts. Later in this sequence we will discuss some directions for transformer interpretability work. Compression is the path to understanding Imagine you are an agent coming across some natural system. You stick an appendage into the system, effectively measuring its states. You measure for a million timepoints and get mysterious data that looks like this: ...00110100100100110110110100110100100100110110110100100110110100... You want to gain an understanding of how this system generates this data, so that you can predict its output, so you can take advantage of the system to your own ends, and because gaining understanding is an intrinsic joy. In reality the data was generated in the following way: output 0, then 1, then you flip a fair coin, and then repeat. Is there some kind of framework or algorithm where we can reliably come to this understanding? As others have noted, understanding is related to abstraction, prediction, and compression. We operationalize understanding by saying an agent has an understanding of a dataset if it possesses a compressed generative model: i.e. a program that is able to generate samples that (approximately) simulate the hidden structure, both deterministic and random, in the data. Note that pure prediction is not understanding. As a simple example take the case of predicting the outcomes of 100 fair coin tosses. Predicting tails every flip will give you maximum expected predictive accuracy (50%), but it is not the correct generative model for the data. Over the course of this sequence, we will come to formally understand why this is the case. Standard measures of information theory do not work To start let's consider the Kolmogorov Complexity and Shannon Entropy as measures of compression, and see why they don't quite work for what we want. Kolmogorov Complexity Recall that the Kolmogorov(-Chaitin-Solomonoff) complexity K(x) of a bit string x is defined as the length of the shortest programme outputting x [given a blank output on a chosen universal Turing machine] One often discussed downside of the K complexity is that it is incomputable. But there is another more conceptual do...
Max starts with a brief news update on ethereum, and then moves to the Kolmogorov axioms of probability. What is an axiom system anyway - and why would someone want to change it?
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Quintin's alignment papers roundup - week 2, published by Quintin Pope on September 19, 2022 on LessWrong. Introduction Last week's paper roundup (more or less by accident) focused mostly on path dependence of deep learning and the order of feature learning. Going forwards, I've decided to have an explicit focus for each week's roundup. This week's focus is on the structure/redundancy of trained models, as well as linear interpolations through parameter space. I've also decided to publish each roundup on Monday morning. Papers Residual Networks Behave Like Ensembles of Relatively Shallow Networks In this work we propose a novel interpretation of residual networks showing that they can be seen as a collection of many paths of differing length. Moreover, residual networks seem to enable very deep networks by leveraging only the short paths during training. To support this observation, we rewrite residual networks as an explicit collection of paths. Unlike traditional models, paths through residual networks vary in length. Further, a lesion study reveals that these paths show ensemble-like behavior in the sense that they do not strongly depend on each other. Finally, and most surprising, most paths are shorter than one might expect, and only the short paths are needed during training, as longer paths do not contribute any gradient. For example, most of the gradient in a residual network with 110 layers comes from paths that are only 10-34 layers deep. Our results reveal one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of very deep networks. My opinion: This paper suggests that neural nets are redundant by default, which gives some intuition for why it's often possible to prune large fractions of a network's parameters without much impact on the test performance, as well as the mechanism by which residual connections allow for training deeper networks: residual connections allow shallow nets to communicate directly with the input / output space, so they allow for deep nets to be built from ensembling shallow nets. I think it also points away from neural nets implementing a Kolmogorov or circuit simplicity prior. On the Effect of Dropping Layers of Pre-trained Transformer Models Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments. While the number of parameters generally correlates with performance, it is not clear whether the entire network is required for a downstream task. Motivated by the recent work on pruning and distilling pre-trained models, we explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks. We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance. Additionally we show that our pruned models are on par with those built using knowledge distillation, both in terms of size and performance. Our experiments yield interesting observations such as, (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using a different objective function exhibit different learning patterns and w.r.t the layer dropping. My opinion: (see below) Of Non-Linearity and Commutativity in BERT In this work we provide new insights into the transformer architecture, and in particular, its best-known variant, BERT. First, we propose a method to measure the degree of non-linearity of different elements of transformers. Next, w...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Quintin's alignment papers roundup - week 2, published by Quintin Pope on September 19, 2022 on LessWrong. Introduction Last week's paper roundup (more or less by accident) focused mostly on path dependence of deep learning and the order of feature learning. Going forwards, I've decided to have an explicit focus for each week's roundup. This week's focus is on the structure/redundancy of trained models, as well as linear interpolations through parameter space. I've also decided to publish each roundup on Monday morning. Papers Residual Networks Behave Like Ensembles of Relatively Shallow Networks In this work we propose a novel interpretation of residual networks showing that they can be seen as a collection of many paths of differing length. Moreover, residual networks seem to enable very deep networks by leveraging only the short paths during training. To support this observation, we rewrite residual networks as an explicit collection of paths. Unlike traditional models, paths through residual networks vary in length. Further, a lesion study reveals that these paths show ensemble-like behavior in the sense that they do not strongly depend on each other. Finally, and most surprising, most paths are shorter than one might expect, and only the short paths are needed during training, as longer paths do not contribute any gradient. For example, most of the gradient in a residual network with 110 layers comes from paths that are only 10-34 layers deep. Our results reveal one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of very deep networks. My opinion: This paper suggests that neural nets are redundant by default, which gives some intuition for why it's often possible to prune large fractions of a network's parameters without much impact on the test performance, as well as the mechanism by which residual connections allow for training deeper networks: residual connections allow shallow nets to communicate directly with the input / output space, so they allow for deep nets to be built from ensembling shallow nets. I think it also points away from neural nets implementing a Kolmogorov or circuit simplicity prior. On the Effect of Dropping Layers of Pre-trained Transformer Models Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments. While the number of parameters generally correlates with performance, it is not clear whether the entire network is required for a downstream task. Motivated by the recent work on pruning and distilling pre-trained models, we explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks. We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance. Additionally we show that our pruned models are on par with those built using knowledge distillation, both in terms of size and performance. Our experiments yield interesting observations such as, (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using a different objective function exhibit different learning patterns and w.r.t the layer dropping. My opinion: (see below) Of Non-Linearity and Commutativity in BERT In this work we provide new insights into the transformer architecture, and in particular, its best-known variant, BERT. First, we propose a method to measure the degree of non-linearity of different elements of transformers. Next, w...
Dissipative magnetic structures and scales in small-scale dynamos by A. Brandenburg et al. on Monday 19 September Small-scale dynamos play important roles in modern astrophysics, especially on Galactic and extragalactic scales. Owing to dynamo action, purely hydrodynamic Kolmogorov turbulence hardly exists and is often replaced by hydromagnetic turbulence. Understanding the size of dissipative magnetic structures is important in estimating the time scale of Galactic scintillation and other observational and theoretical aspects of interstellar and intergalactic small-scale dynamos. Here we show that the thickness of magnetic flux tubes decreases more rapidly with increasing magnetic Prandtl number than previously expected. Also the theoretical scale based on the dynamo growth rate and the magnetic diffusivity decrease faster than expected. However, the scale based on the cutoff of the magnetic energy spectra scales as expected for large magnetic Prandtl numbers, but continues in the same way also for moderately small values - contrary to what is expected. For a critical magnetic Prandtl number of about 0.27, the dissipative and resistive cutoffs are found to occur at the same wavenumber. For large magnetic Prandtl numbers, our simulations show that the peak of the magnetic energy spectrum occurs at a wavenumber that is twice as large as previously predicted. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2209.08717v1
Empirical constraints on the turbulence in QSO host nebulae from velocity structure function measurements by Mandy C. Chen et al. on Monday 12 September We present the first empirical constraints on the turbulent velocity field of the diffuse circumgalactic medium around four luminous QSOs at $z!approx!0.5$--1.1. Spatially extended nebulae of $approx!50$--100 physical kpc in diameter centered on the QSOs are revealed in [OII]$lambdalambda,3727,3729$ and/or [OIII]$lambda,5008$ emission lines in integral field spectroscopic observations obtained using MUSE on the VLT. We measure the second- and third-order velocity structure functions (VSFs) over a range of scales, from $lesssim!5$ kpc to $approx!20$--50 kpc, to quantify the turbulent energy transfer between different scales in these nebulae. While no constraints on the energy injection and dissipation scales can be obtained from the current data, we show that robust constraints on the power-law slope of the VSFs can be determined after accounting for the effects of atmospheric seeing, spatial smoothing, and large-scale bulk flows. Out of the four QSO nebulae studied, one exhibits VSFs in spectacular agreement with the Kolmogorov law, expected for isotropic, homogeneous, and incompressible turbulent flows. The other three fields exhibit a shallower decline in the VSFs from large to small scales but with loose constraints, in part due to a limited dynamic range in the spatial scales in seeing-limited data. For the QSO nebula consistent with the Kolmogorov law, we determine a turbulence energy cascade rate of $approx!0.2$ cm$^{2}$ s$^{-3}$. We discuss the implication of the observed VSFs in the context of QSO feeding and feedback in the circumgalactic medium. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2209.04344v1
Empirical constraints on the turbulence in QSO host nebulae from velocity structure function measurements by Mandy C. Chen et al. on Monday 12 September We present the first empirical constraints on the turbulent velocity field of the diffuse circumgalactic medium around four luminous QSOs at $z!approx!0.5$--1.1. Spatially extended nebulae of $approx!50$--100 physical kpc in diameter centered on the QSOs are revealed in [OII]$lambdalambda,3727,3729$ and/or [OIII]$lambda,5008$ emission lines in integral field spectroscopic observations obtained using MUSE on the VLT. We measure the second- and third-order velocity structure functions (VSFs) over a range of scales, from $lesssim!5$ kpc to $approx!20$--50 kpc, to quantify the turbulent energy transfer between different scales in these nebulae. While no constraints on the energy injection and dissipation scales can be obtained from the current data, we show that robust constraints on the power-law slope of the VSFs can be determined after accounting for the effects of atmospheric seeing, spatial smoothing, and large-scale bulk flows. Out of the four QSO nebulae studied, one exhibits VSFs in spectacular agreement with the Kolmogorov law, expected for isotropic, homogeneous, and incompressible turbulent flows. The other three fields exhibit a shallower decline in the VSFs from large to small scales but with loose constraints, in part due to a limited dynamic range in the spatial scales in seeing-limited data. For the QSO nebula consistent with the Kolmogorov law, we determine a turbulence energy cascade rate of $approx!0.2$ cm$^{2}$ s$^{-3}$. We discuss the implication of the observed VSFs in the context of QSO feeding and feedback in the circumgalactic medium. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2209.04344v1
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gradient descent doesn't select for inner search, published by Ivan Vendrov on August 13, 2022 on LessWrong. TL;DR: Gradient descent won't select for inner search processes because they're not compute & memory efficient. Slightly longer TL;DR: A key argument for mesa-optimization is that as we search over programs, we will select for "search processes with simple objectives", because they are simpler or more compact than alternative less dangerous programs. This argument is much weaker when your program search is restricted to programs that use a fixed amount of compute, and you're not optimizing strongly for low description length - e.g. gradient descent in modern deep learning systems. We don't really know what shape of programs gradient descent selects for in realistic environments, but they are much less likely to involve search than commonly believed. Note on terminology (added in response to comments): By "search" I mean here a process that evaluates a number of candidates before returning the best one; what Abram Demski calls "selection" in Selection vs Control . The more candidates considered, the more "search-like" a process is - with gradient descent and A being central examples, and a thermostat being a central counter-example. Recap: compression argument for inner optimizers Here's the argument from Risks From Learned Optimization: [emphasis mine] In some tasks, good performance requires a very complex policy. At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy. One way to find a compressed policy is to search for one that is able to use general features of the task structure to produce good behavior, rather than simply memorizing the correct output for each input. A mesa-optimizer is an example of such a policy. From the perspective of the base optimizer, a mesa-optimizer is a highly-compressed version of whatever policy it ends up implementing: instead of explicitly encoding the details of that policy in the learned algorithm, the base optimizer simply needs to encode how to search for such a policy. Furthermore, if a mesa-optimizer can determine the important features of its environment at runtime, it does not need to be given as much prior information as to what those important features are, and can thus be much simpler. and even more forceful phrasing from John Wentworth: We don't know that the AI will necessarily end up optimizing reward-button-pushes or smiles; there may be other similarly-compact proxies which correlate near-perfectly with reward in the training process. We can probably rule out "a spread of situationally-activated computations which steer its actions towards historical reward-correlates", insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s). Compactness, Complexity, and Compute At face value, it does seem like we're selecting programs for simplicity. The Deep Double Descent paper showed us that gradient descent training in the overparametrized regime (i.e. the regime of all modern deep models) favors simpler models. But is this notion of simplicity the same as "compactness" or "complexity"? Evan seems to think so, I'm less sure. Let's dive into the different notions of complexity here. The most commonly used notion of program complexity is Kolmogorov complexity (or description length), basically just "length of the program in some reference programming language". This definition seems natural... but, critically, it assumes away all computational constraints. K-complexity doesn't care if your program completes in a millisecond or runs until the heat death of the universe. This makes it a ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gradient descent doesn't select for inner search, published by Ivan Vendrov on August 13, 2022 on LessWrong. TL;DR: Gradient descent won't select for inner search processes because they're not compute & memory efficient. Slightly longer TL;DR: A key argument for mesa-optimization is that as we search over programs, we will select for "search processes with simple objectives", because they are simpler or more compact than alternative less dangerous programs. This argument is much weaker when your program search is restricted to programs that use a fixed amount of compute, and you're not optimizing strongly for low description length - e.g. gradient descent in modern deep learning systems. We don't really know what shape of programs gradient descent selects for in realistic environments, but they are much less likely to involve search than commonly believed. Note on terminology (added in response to comments): By "search" I mean here a process that evaluates a number of candidates before returning the best one; what Abram Demski calls "selection" in Selection vs Control . The more candidates considered, the more "search-like" a process is - with gradient descent and A being central examples, and a thermostat being a central counter-example. Recap: compression argument for inner optimizers Here's the argument from Risks From Learned Optimization: [emphasis mine] In some tasks, good performance requires a very complex policy. At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy. One way to find a compressed policy is to search for one that is able to use general features of the task structure to produce good behavior, rather than simply memorizing the correct output for each input. A mesa-optimizer is an example of such a policy. From the perspective of the base optimizer, a mesa-optimizer is a highly-compressed version of whatever policy it ends up implementing: instead of explicitly encoding the details of that policy in the learned algorithm, the base optimizer simply needs to encode how to search for such a policy. Furthermore, if a mesa-optimizer can determine the important features of its environment at runtime, it does not need to be given as much prior information as to what those important features are, and can thus be much simpler. and even more forceful phrasing from John Wentworth: We don't know that the AI will necessarily end up optimizing reward-button-pushes or smiles; there may be other similarly-compact proxies which correlate near-perfectly with reward in the training process. We can probably rule out "a spread of situationally-activated computations which steer its actions towards historical reward-correlates", insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s). Compactness, Complexity, and Compute At face value, it does seem like we're selecting programs for simplicity. The Deep Double Descent paper showed us that gradient descent training in the overparametrized regime (i.e. the regime of all modern deep models) favors simpler models. But is this notion of simplicity the same as "compactness" or "complexity"? Evan seems to think so, I'm less sure. Let's dive into the different notions of complexity here. The most commonly used notion of program complexity is Kolmogorov complexity (or description length), basically just "length of the program in some reference programming language". This definition seems natural... but, critically, it assumes away all computational constraints. K-complexity doesn't care if your program completes in a millisecond or runs until the heat death of the universe. This makes it a ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Tao, Kontsevich & others on HLAI in Math, published by interstice on June 10, 2022 on LessWrong. I found this 2015 panel with Terence Tao and some other eminent mathematicians to be interesting. The panel covered various topics but got into the question of when computers will be able to do research-level mathematics. Most interestingly, Maxim Kontsevich was alone in predicting that HLAI in math was plausible in our lifetime -- but also, that developing such an AI might not be a good idea. He also mentioned a BioAnchors-style AI forecast by Kolmogorov that I had never heard of before(and cannot find a reference to -- anyone know of such a thing?) Excerpts below: INTERVIEWER: Do you imagine that, maybe in 100 years, or 1000 years, that, like it happened in chess -- humans stil play tournaments but everyone knows computers are better -- is it conceivable that this could happen in mathematics? TAO: I think computers will be able to do things much more efficiently with the right computer tools. Search engines, for example, often you'll type in a query to Google and it will come back with "do you mean this" and often you did. One could imagine that if you had a really good computer assistant working on some math problem, it will keep suggesting "should you do this? have you considered looking at this paper?" You could imagine this would really speed up the way we do research. Sometimes you're stuck for months because you just don't know some key trick that is buried in some other field of expertise. Some sort of advanced Google could suggest this to you. So I think we will use computers to do things much more efficiently than we do currently, but it will still be humans driving the show, I'm pretty sure. INTERVIEWER: Maxim, do you think anything like this[HLAI, I assume] is possible? MAXIM: I think it's perfectly possible, maybe in our lifetime. INTERVIEWER: Why do you think so? MAXIM: I don't think artificial intelligence is very hard. It will be pretty soon I suppose. INTERVIEWER: You are a contrarian here, saying it will happen so quickly. So what makes you so optimistic? MAXIM: Optimistic? No, it's actually pessimistic. I thought about it myself a little bit, I don't think there are fundamental difficulties here. INTERVIEWER: So why don't you just work on that instead? MAXIM: I think it would be immoral to work on it. MILNER: I'm no expert, but isn't the way the computer played chess not really very intelligent? It's a huge combinatorial check. Inventing the sort of mathematics you've invented, that's not combinatorial checking, it's entirely conceptual. MAXIM: Yeah OK, sure. MILNER: Is there any case we know of computers doing anything like that? MAXIM: We don't know any examples, but it's not inconceivable. MILNER: It's not inconceivable...but I would be very surprised if we saw a computer win a Fields medal in our lifetime. TAO: One could imagine that a computer could discover just by brute force a connection between two fields of mathematics that wasn't suspected, and then the person on the computer would be able to flesh it out. Maybe he would collect the medal. MAXIM: Actually, Kolmogorov thought that mathematics will be extinct in 100 years, he had an estimate. He calculated the number of neurons and connections, he made the head something like one cubic meter. So yes, maybe a crazy estimate, but he was also thinking about natural boundaries. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Tao, Kontsevich & others on HLAI in Math, published by interstice on June 10, 2022 on LessWrong. I found this 2015 panel with Terence Tao and some other eminent mathematicians to be interesting. The panel covered various topics but got into the question of when computers will be able to do research-level mathematics. Most interestingly, Maxim Kontsevich was alone in predicting that HLAI in math was plausible in our lifetime -- but also, that developing such an AI might not be a good idea. He also mentioned a BioAnchors-style AI forecast by Kolmogorov that I had never heard of before(and cannot find a reference to -- anyone know of such a thing?) Excerpts below: INTERVIEWER: Do you imagine that, maybe in 100 years, or 1000 years, that, like it happened in chess -- humans stil play tournaments but everyone knows computers are better -- is it conceivable that this could happen in mathematics? TAO: I think computers will be able to do things much more efficiently with the right computer tools. Search engines, for example, often you'll type in a query to Google and it will come back with "do you mean this" and often you did. One could imagine that if you had a really good computer assistant working on some math problem, it will keep suggesting "should you do this? have you considered looking at this paper?" You could imagine this would really speed up the way we do research. Sometimes you're stuck for months because you just don't know some key trick that is buried in some other field of expertise. Some sort of advanced Google could suggest this to you. So I think we will use computers to do things much more efficiently than we do currently, but it will still be humans driving the show, I'm pretty sure. INTERVIEWER: Maxim, do you think anything like this[HLAI, I assume] is possible? MAXIM: I think it's perfectly possible, maybe in our lifetime. INTERVIEWER: Why do you think so? MAXIM: I don't think artificial intelligence is very hard. It will be pretty soon I suppose. INTERVIEWER: You are a contrarian here, saying it will happen so quickly. So what makes you so optimistic? MAXIM: Optimistic? No, it's actually pessimistic. I thought about it myself a little bit, I don't think there are fundamental difficulties here. INTERVIEWER: So why don't you just work on that instead? MAXIM: I think it would be immoral to work on it. MILNER: I'm no expert, but isn't the way the computer played chess not really very intelligent? It's a huge combinatorial check. Inventing the sort of mathematics you've invented, that's not combinatorial checking, it's entirely conceptual. MAXIM: Yeah OK, sure. MILNER: Is there any case we know of computers doing anything like that? MAXIM: We don't know any examples, but it's not inconceivable. MILNER: It's not inconceivable...but I would be very surprised if we saw a computer win a Fields medal in our lifetime. TAO: One could imagine that a computer could discover just by brute force a connection between two fields of mathematics that wasn't suspected, and then the person on the computer would be able to flesh it out. Maybe he would collect the medal. MAXIM: Actually, Kolmogorov thought that mathematics will be extinct in 100 years, he had an estimate. He calculated the number of neurons and connections, he made the head something like one cubic meter. So yes, maybe a crazy estimate, but he was also thinking about natural boundaries. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Information Loss --> Basin flatness, published by Vivek Hebbar on May 21, 2022 on The AI Alignment Forum. This work was done under the mentorship of Evan Hubinger through the SERI MATS program. Thanks to Lucius Bushnaq, John Wentworth, Quintin Pope, and Peter Barnett for useful feedback and suggestions. In this theory, the main proximate cause of flat basins is a type of information loss. Its relationship with circuit complexity and Kolmogorov complexity is currently unknown to me. In this post, I will demonstrate that: High-dimensional solution manifolds are caused by linear dependence between the "behavioral gradients" for different inputs. This linear dependence is usually caused when networks throw away information which distinguishes different training inputs. It is more likely to occur when the information is thrown away early or by RELU. Overview for advanced readers: [Short version] Information Loss --> Basin flatness Behavior manifolds Suppose we have a regression task with 1-dimensional labels and k training examples. Let us take an overparameterized network with N parameters. Every model in parameter space is part of a manifold, where every point on that manifold has identical behavior on the training set. These manifolds are usually at least N−k dimensional, but some are higher dimensional than this. I will call these manifolds "behavior manifolds", since points on the same manifold have the same behavior (on the training set, not on all possible inputs). We can visualize the existence of “behavior manifolds” by starting with a blank parameter space, then adding contour planes for each training example. Before we add any contour planes, the entire parameter space is a single manifold, with “identical behavior” on the null set. First, let us add the contour planes for input 1: Each plane here is an n-1 dimensional manifold, where every model on that plane has the same output on input 1. They slice parameter space into n-1 dimensional regions. Each of these regions is an equivalence class of functions, which all behave about the same on input 1. Next, we can add contour planes for input 2: When we put them together, they look like this: Together, the contours slice parameter space into n-2 dimensional regions. Each “diamond” in the picture is the cross-section of a tube-like region which extends vertically, in the direction which is parallel to both sets of planes. The manifolds of constant behavior are lines which run vertically through these tubes, parallel to both sets of contours. In higher dimensions, these “lines” and “tubes” are actually n-2 dimensional hyperplanes, since only two degrees of freedom have been removed, one by each set of contours. We can continue this with more and more inputs. Each input adds another set of hyperplanes, and subtracts one more dimension from the identical-behavior manifolds. Since each input can only slice off one dimension, the manifolds of constant behavior are at least n-k dimensional, where k is the number of training examples. Solution manifolds Global minima also lie on behavior manifolds, such that every point on the manifold is a global minimum. I will call these "solution manifolds". These manifolds generally extend out to infinity, so it isn't really meaningful to talk about literal "basin volume". We can focus instead on their dimensionality. All else being equal, a higher dimensional solution manifold should drain a larger region of parameter space, and thus be favored by the inductive bias. Parallel contours allow higher manifold dimension Suppose we have 3 parameters (one is off-the-page) and 2 inputs. If the contours are perpendicular: Then the green regions are cross-sections of tubes extending infinitely off-the-page, where each tube contains models that are roughly equivalent on the training set. The...
In questo episodio parliamo delle principali leggi dei grandi numeri. Esatto: ce ne sono più di una. Qui ci occupiamo di quella di Bernoulli, di quella di Kolmogorov e accenniamo anche a qualche altra variante. Disclaimer: non vi aiuterò a trovare la dolce metà, mi spiace.
Questo episodio chiude la nostra carrellata tra le principali definizioni di probabilità. Oggi ci dedichiamo al logicismo di Keynes, Jaynes e Jeffreys, e all'approccio assiomatico di Kolmogorov. Completando il quadro sulla probabilità, saremo pronti a continuare verso le misure di rischio.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Value Learning, Part 4: Humans can be assigned any values whatsoever., published by Stuart_Armstrong. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. (Re)Posted as part of the AI Alignment Forum sequence on Value Learning. Rohin's note: In the last post, we saw that a good broad value learning approach would need to understand the systematic biases in human planning in order to achieve superhuman performance. Perhaps we can just use machine learning again and learn the biases and reward simultaneously? This post by Stuart Armstrong (original here) and the associated paper say: “Not without more assumptions.” This post comes from a theoretical perspective that may be alien to ML researchers; in particular, it makes an argument that simplicity priors do not solve the problem pointed out here, where simplicity is based on Kolmogorov complexity (which is an instantiation of the Minimum Description Length principle). The analog in machine learning would be an argument that regularization would not work. The proof used is specific to Kolmogorov complexity and does not clearly generalize to arbitrary regularization techniques; however, I view the argument as being suggestive that regularization techniques would also be insufficient to address the problems raised here. Humans have no values. nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values. An agent with no clear preferences There are three buttons in this world, B 0 B 1 , and X , and one agent H B 0 and B 1 can be operated by H , while X can be operated by an outside observer. H will initially press button B 0 ; if ever X is pressed, the agent will switch to pressing B 1 . If X is pressed again, the agent will switch back to pressing B 0 , and so on. After a large number of turns N H will shut off. That's the full algorithm for H So the question is, what are the values/preferences/rewards of H ? There are three natural reward functions that are plausible: R 0 , which is linear in the number of times B 0 is pressed. R 1 , which is linear in the number of times B 1 is pressed. R 2 I E X R 0 I O X R 1 , where I E X is the indicator function for X being pressed an even number of times, I O X 1 − I E X being the indicator function for X being pressed an odd number of times. For R 0 , we can interpret H as an R 0 maximising agent which X overrides. For R 1 , we can interpret H as an R 1 maximising agent which X releases from constraints. And R 2 is the “ H is always fully rational” reward. Semantically, these make sense for the various R i 's being a true and natural reward, with X “coercive brain surgery” in the first case, X “release H from annoying social obligations” in the second, and X “switch which of R 0 and R 1 gives you pleasure” in the last case. But note that there is no semantic implications here, all that we know is H , with its full algorithm. If we wanted to deduce its true reward for the purpose of something like Inverse Reinforcement Learning (IRL), what would it be? Modelling human (ir)rationality and reward Now let's talk about the preferences of an actual human. We all know that humans are not always rational. But even if humans were fully rational, the fact remains that we are physical, and vulnerable to things like coercive brain surgery (and in practice, to a whole host of other more or less manipulative techniques). So there will be the equivalent of “button X ” that overrides human preferences. Thus, “not immortal and unchangeable” is in practice enough for the agent to be considered “not fully rational”. Now assume that we've thoroughly observed a given human h (including their internal brain wiring), so we know the human policy π h (which determines their actions in a...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Value Learning, Part 4: Humans can be assigned any values whatsoever., published by Stuart_Armstrong. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. (Re)Posted as part of the AI Alignment Forum sequence on Value Learning. Rohin's note: In the last post, we saw that a good broad value learning approach would need to understand the systematic biases in human planning in order to achieve superhuman performance. Perhaps we can just use machine learning again and learn the biases and reward simultaneously? This post by Stuart Armstrong (original here) and the associated paper say: “Not without more assumptions.” This post comes from a theoretical perspective that may be alien to ML researchers; in particular, it makes an argument that simplicity priors do not solve the problem pointed out here, where simplicity is based on Kolmogorov complexity (which is an instantiation of the Minimum Description Length principle). The analog in machine learning would be an argument that regularization would not work. The proof used is specific to Kolmogorov complexity and does not clearly generalize to arbitrary regularization techniques; however, I view the argument as being suggestive that regularization techniques would also be insufficient to address the problems raised here. Humans have no values. nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values. An agent with no clear preferences There are three buttons in this world, B 0 B 1 , and X , and one agent H B 0 and B 1 can be operated by H , while X can be operated by an outside observer. H will initially press button B 0 ; if ever X is pressed, the agent will switch to pressing B 1 . If X is pressed again, the agent will switch back to pressing B 0 , and so on. After a large number of turns N H will shut off. That's the full algorithm for H So the question is, what are the values/preferences/rewards of H ? There are three natural reward functions that are plausible: R 0 , which is linear in the number of times B 0 is pressed. R 1 , which is linear in the number of times B 1 is pressed. R 2 I E X R 0 I O X R 1 , where I E X is the indicator function for X being pressed an even number of times, I O X 1 − I E X being the indicator function for X being pressed an odd number of times. For R 0 , we can interpret H as an R 0 maximising agent which X overrides. For R 1 , we can interpret H as an R 1 maximising agent which X releases from constraints. And R 2 is the “ H is always fully rational” reward. Semantically, these make sense for the various R i 's being a true and natural reward, with X “coercive brain surgery” in the first case, X “release H from annoying social obligations” in the second, and X “switch which of R 0 and R 1 gives you pleasure” in the last case. But note that there is no semantic implications here, all that we know is H , with its full algorithm. If we wanted to deduce its true reward for the purpose of something like Inverse Reinforcement Learning (IRL), what would it be? Modelling human (ir)rationality and reward Now let's talk about the preferences of an actual human. We all know that humans are not always rational. But even if humans were fully rational, the fact remains that we are physical, and vulnerable to things like coercive brain surgery (and in practice, to a whole host of other more or less manipulative techniques). So there will be the equivalent of “button X ” that overrides human preferences. Thus, “not immortal and unchangeable” is in practice enough for the agent to be considered “not fully rational”. Now assume that we've thoroughly observed a given human h (including their internal brain wiring), so we know the human policy π h (which determines their actions in a...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Realism about rationality, published Richard_Ngo on the LESSWRONG. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This is a linkpost for http://thinkingcomplete.blogspot.com/2018/09/rational-and-real.html Epistemic status: trying to vaguely gesture at vague intuitions. A similar idea was explored here under the heading "the intelligibility of intelligence", although I hadn't seen it before writing this post. As of 2020, I consider this follow-up comment to be a better summary of the thing I was trying to convey with this post than the post itself. There's a mindset which is common in the rationalist community, which I call “realism about rationality” (the name being intended as a parallel to moral realism). I feel like my skepticism about agent foundations research is closely tied to my skepticism about this mindset, and so in this essay I try to articulate what it is. Humans ascribe properties to entities in the world in order to describe and predict them. Here are three such properties: "momentum", "evolutionary fitness", and "intelligence". These are all pretty useful properties for high-level reasoning in the fields of physics, biology and AI, respectively. There's a key difference between the first two, though. Momentum is very amenable to formalisation: we can describe it using precise equations, and even prove things about it. Evolutionary fitness is the opposite: although nothing in biology makes sense without it, no biologist can take an organism and write down a simple equation to define its fitness in terms of more basic traits. This isn't just because biologists haven't figured out that equation yet. Rather, we have excellent reasons to think that fitness is an incredibly complicated "function" which basically requires you to describe that organism's entire phenotype, genotype and environment. In a nutshell, then, realism about rationality is a mindset in which reasoning and intelligence are more like momentum than like fitness. It's a mindset which makes the following ideas seem natural: The idea that there is a simple yet powerful theoretical framework which describes human intelligence and/or intelligence in general. (I don't count brute force approaches like AIXI for the same reason I don't consider physics a simple yet powerful description of biology). The idea that there is an “ideal” decision theory. The idea that AGI will very likely be an “agent”. The idea that Turing machines and Kolmogorov complexity are foundational for epistemology. The idea that, given certain evidence for a proposition, there's an "objective" level of subjective credence which you should assign to it, even under computational constraints. The idea that Aumann's agreement theorem is relevant to humans. The idea that morality is quite like mathematics, in that there are certain types of moral reasoning that are just correct. The idea that defining coherent extrapolated volition in terms of an idealised process of reflection roughly makes sense, and that it converges in a way which doesn't depend very much on morally arbitrary factors. The idea that having having contradictory preferences or beliefs is really bad, even when there's no clear way that they'll lead to bad consequences (and you're very good at avoiding dutch books and money pumps and so on). To be clear, I am neither claiming that realism about rationality makes people dogmatic about such ideas, nor claiming that they're all false. In fact, from a historical point of view I'm quite optimistic about using maths to describe things in general. But starting from that historical baseline, I'm inclined to adjust downwards on questions related to formalising intelligent thought, whereas rationality realism would endorse adjusting upwards. This essay is primarily intended to explain...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Realism about rationality , published by Richard Ngo on the AI Alignment Forum. This is a linkpost for http://thinkingcomplete.blogspot.com/2018/09/rational-and-real.html Epistemic status: trying to vaguely gesture at vague intuitions. A similar idea was explored here under the heading "the intelligibility of intelligence", although I hadn't seen it before writing this post. As of 2020, I consider this follow-up comment to be a better summary of the thing I was trying to convey with this post than the post itself. There's a mindset which is common in the rationalist community, which I call “realism about rationality” (the name being intended as a parallel to moral realism). I feel like my skepticism about agent foundations research is closely tied to my skepticism about this mindset, and so in this essay I try to articulate what it is. Humans ascribe properties to entities in the world in order to describe and predict them. Here are three such properties: "momentum", "evolutionary fitness", and "intelligence". These are all pretty useful properties for high-level reasoning in the fields of physics, biology and AI, respectively. There's a key difference between the first two, though. Momentum is very amenable to formalisation: we can describe it using precise equations, and even prove things about it. Evolutionary fitness is the opposite: although nothing in biology makes sense without it, no biologist can take an organism and write down a simple equation to define its fitness in terms of more basic traits. This isn't just because biologists haven't figured out that equation yet. Rather, we have excellent reasons to think that fitness is an incredibly complicated "function" which basically requires you to describe that organism's entire phenotype, genotype and environment. In a nutshell, then, realism about rationality is a mindset in which reasoning and intelligence are more like momentum than like fitness. It's a mindset which makes the following ideas seem natural: The idea that there is a simple yet powerful theoretical framework which describes human intelligence and/or intelligence in general. (I don't count brute force approaches like AIXI for the same reason I don't consider physics a simple yet powerful description of biology). The idea that there is an “ideal” decision theory. The idea that AGI will very likely be an “agent”. The idea that Turing machines and Kolmogorov complexity are foundational for epistemology. The idea that, given certain evidence for a proposition, there's an "objective" level of subjective credence which you should assign to it, even under computational constraints. The idea that Aumann's agreement theorem is relevant to humans. The idea that morality is quite like mathematics, in that there are certain types of moral reasoning that are just correct. The idea that defining coherent extrapolated volition in terms of an idealised process of reflection roughly makes sense, and that it converges in a way which doesn't depend very much on morally arbitrary factors. The idea that having having contradictory preferences or beliefs is really bad, even when there's no clear way that they'll lead to bad consequences (and you're very good at avoiding dutch books and money pumps and so on). To be clear, I am neither claiming that realism about rationality makes people dogmatic about such ideas, nor claiming that they're all false. In fact, from a historical point of view I'm quite optimistic about using maths to describe things in general. But starting from that historical baseline, I'm inclined to adjust downwards on questions related to formalising intelligent thought, whereas rationality realism would endorse adjusting upwards. This essay is primarily intended to explain my position, not justify it, but one important consideration for me is th...
With literally just hours to go before the 2021 Hackaday Remoticon kicks off, editors Tom Nardi and Elliot Williams still managed to find time to talk about some of the must-see stories from the last week. There's fairly heavyweight topics on the docket this time around, from alternate methods of multiplying large numbers to the incredible engineering that goes into producing high purity silicon. But we'll also talk about the movie making magic of Stan Winston and some Pokemon-themed environmental sensors, so it should all balance out nicely. So long as the Russian's haven't kicked off the Kessler effect by the time you tune in, we should be good. Check out the show notes for links and more!
Is administering a Covid-19 test on yourself difficult, or are the instructions just confusing? How should we explain complexity and is there a limit to how much we can simplify things? In our latest podcast, we discuss different ways of simplifying information, how to judge the right level of detail for a given context, and whether reductionism is always a useful concept. We look at how simplification can help or hinder understanding, examining some of the consequences of oversimplification. A few things we mentioned in this podcast: - Reddit: Explain Like I'm Five https://www.reddit.com/r/explainlikeimfive/ - Shannon information and Kolmogorov complexity https://homepages.cwi.nl/~paulv/papers/info.pdf For more information on Aleph Insights visit our website https://alephinsights.com or to get in touch about our podcast email podcast@alephinsights.com
Doç. Dr. Serhan Yarkan ve Halil Said Cankurtaran'ın yer aldığı Bilim Tarihi Serisi'nin bu bölümünde, 7 Şubat 1889'da doğan ve 4 Nisan 1976'da hayata gözlerini yuman Harry Nyquist üzerine konuşulmuştur. Bell Laboratuvarları'nın bir çalışanı olan Nyquist, 130'dan fazla patente sahip olup, 12 adet de bilimsel makale yayınlamıştır. Gürültü Kavramına Giriş bölümümüzde de değindiğimiz üzere Nyquist, ısıl gürültü alanında önemli çalışmalara imza atmıştır. Laplace ile ilgili bölümümüzde girişini yapmış olduğumuz sistemlerin kararlılığı konusunda da çalışmaları bulunmaktadır. Ayrıca, Nyquist'in Bilgi Kuramı ve Haberleşme Kuramı'na yaptığı katkılar günümüz sayısal teknolojilerinin temellerini oluşturmaktadır. Bell Laboratuvarları'nda yapılan çalışmalara da değindiğimiz bölümümüzde, Claude Elwood Shannon, Ralph Hartley, Norbert Wiener, Lyapunov, Chebyshev, Kolmogorov, ve Smirnov gibi bilim insanlarının isimleri de anılmaktadır. Keyifli Dinlemeler. #66. George Gamow ve Bilim Anlatıcılığı (Bilim Tarihi Serisi B1: I. Kısım) - 25/10/2020: https://youtu.be/qIARyX8p8lg #68. Bilim Tarihi Serimize Bir Önsöz (Bilim Tarihi Serisi B2) - 08/11/2020: https://youtu.be/FVUc5tfYi7I #70. George Gamow - Bilimde Doğu ve Batı Blokları (Bilim Tarihi Serisi B3: II. Kısım) - 22/11/2020: https://youtu.be/7k_IRL_B8WA #71. Michael Faraday (Bilim Tarihi Serisi B4) - 29/11/2020: https://youtu.be/OtEQ0pI-baI #73. Kümeler Kuramı'nın Önemi ve Tarihsel Gelişimi (Bilim Tarihi Serisi B5: I. Kısım): https://youtu.be/pSksJkWK6wU #76. Kümeler Kuramı'nın Etkileri (Bilim Tarihi Serisi B6: II. Kısım): https://youtu.be/gtpdAUaCgzw #77. Kümeler Kuramı ve Hesaplama (Bilim Tarihi Serisi B7: III. Kısım): https://youtu.be/TMt_rUbE4M4 #78. Kümeler Kuramı'nın Kuraltanımazları (Bilim Tarihi Serisi B8: IV. Kısım) - 17/01/2021: https://youtu.be/qHMdAjr4lQ0 #79. Kümeler Kuramı'nın Günümüzdeki Kullanımı (Bilim Tarihi Serisi B9: V. Kısım) - 24/01/2021: https://youtu.be/WoF5_A7nKQM #84. Gürültü Kavramına Giriş (Bilim Tarihi Serisi B10: I. Kısım) - 28/02/2021: https://youtu.be/4nCgno6XDVM #88. Pierre-Simon, Marquis de Laplace (Bilim Tarihi Serisi B11: I. Kısım) - 28/03/2021: https://youtu.be/-jRuE37K_M0 Tapir Lab. GitHub: @TapirLab, https://github.com/tapirlab/ Tapir Lab. Instagram: @tapirlab, https://www.instagram.com/tapirlab/ Tapir Lab. Twitter: @tapirlab, https://twitter.com/tapirlab Tapir Lab.: http://www.tapirlab.com
Andreï Kolmogorov est un mathématicien russe (1903-1987) qui a apporté des contributions frappantes en théorie des probabilités, théorie ergodique, turbulence, mécanique classique, logique mathématique, topologie, théorie algorithmique de l'information et en analyse de la complexité des algorithmes. Alexander Bufetov, Directeur de recherche CNRS (I2M - Aix-Marseille Université, CNRS, Centrale Marseille) et porteur local de la Chaire Jean-Morlet (Chaire Tamara Grava 2019 - semestre 1) donnera une conférence sur les contributions exceptionnelles et la vie dramatique d'un grand génie du XXe siècle. Conférence grand public au CIRM Luminy Vidéo disponible sur la chaine YouTube du Centre International de Rencontres Mathématiques
Doç. Dr. Serhan Yarkan ve Halil Said Cankurtaran'ın yer aldığı, Bilim Tarihi Serisi'nin Kümeler Kuramı odaklı beşinci kısmında: Kümeler Kuramı'nın günümüzdeki kullanımı ve hesaplama, olasılık ve topoloji alanları ile olan ilişkisi üzerine konuşulmuştur. Hesaplamada P vs NP problemine değinildikten sonra, olasılıkta Kolmogorov'un belitleri ve kuramın temellerinin diğer alanlar ile ilişkisi üzerinde durulmuştur. Son olarak fiziksel olayların matematiksel olarak ifade edilmesinde çok iyi bir araç olan topolojinin, kuram ile ilişkisi ve günümüzdeki kullanım alanları hakkında konuşularak bölüm sonlandırılmıştır. Keyifli dinlemeler. #73. Kümeler Kuramı'nın Önemi ve Tarihsel Gelişimi (Bilim Tarihi Serisi B5: I. Kısım): https://youtu.be/pSksJkWK6wU #76. Kümeler Kuramı'nın Etkileri (Bilim Tarihi Serisi B6: II. Kısım): https://youtu.be/gtpdAUaCgzw #77. Kümeler Kuramı ve Hesaplama (Bilim Tarihi Serisi B7: III. Kısım): https://youtu.be/TMt_rUbE4M4 #78. Kümeler Kuramı'nın Kuraltanımazları (Bilim Tarihi Serisi B8: IV. Kısım) - 17/01/2021: https://youtu.be/qHMdAjr4lQ0 Tapir Lab. GitHub: @TapirLab, https://www.github.com/tapirlab Tapir Lab. Instagram: @tapirlab, https://www.instagram.com/tapirlab/ Tapir Lab. Twitter: @tapirlab, https://www.twitter.com/tapirlab Tapir Lab.: http://www.tapirlab.com
ena Voita is a Ph.D. student at the University of Edinburgh and University of Amsterdam. Previously, She was a research scientist at Yandex Research and worked closely with the Yandex Translate team. She still teaches NLP at the Yandex School of Data Analysis. She has created an exciting new NLP course on her website lena-voita.github.io which you folks need to check out! She has one of the most well presented blogs we have ever seen, where she discusses her research in an easily digestable manner. Lena has been investigating many fascinating topics in machine learning and NLP. Today we are going to talk about three of her papers and corresponding blog articles; Source and Target Contributions to NMT Predictions -- Where she talks about the influential dichotomy between the source and the prefix of neural translation models. https://arxiv.org/pdf/2010.10907.pdf https://lena-voita.github.io/posts/source_target_contributions_to_nmt.html Information-Theoretic Probing with MDL -- Where Lena proposes a technique of evaluating a model using the minimum description length or Kolmogorov complexity of labels given representations rather than something basic like accuracy https://arxiv.org/pdf/2003.12298.pdf https://lena-voita.github.io/posts/mdl_probes.html Evolution of Representations in the Transformer - Lena investigates the evolution of representations of individual tokens in Transformers -- trained with different training objectives (MT, LM, MLM) https://arxiv.org/abs/1909.01380 https://lena-voita.github.io/posts/emnlp19_evolution.html Panel Dr. Tim Scarfe, Yannic Kilcher, Sayak Paul 00:00:00 Kenneth Stanley / Greatness can not be planned house keeping 00:21:09 Kilcher intro 00:28:54 Hello Lena 00:29:21 Tim - Lenas NMT paper 00:35:26 Tim - Minimum Description Length / Probe paper 00:40:12 Tim - Evolution of representations 00:46:40 Lenas NLP course 00:49:18 The peppermint tea situation 00:49:28 Main Show Kick Off 00:50:22 Hallucination vs exposure bias 00:53:04 Lenas focus on explaining the models not SOTA chasing 00:56:34 Probes paper and NLP intepretability 01:02:18 Why standard probing doesnt work 01:12:12 Evolutions of representations paper 01:23:53 BERTScore and BERT Rediscovers the Classical NLP Pipeline paper 01:25:10 Is the shifting encoding context because of BERT bidirectionality 01:26:43 Objective defines which information we lose on input 01:27:59 How influential is the dataset? 01:29:42 Where is the community going wrong? 01:31:55 Thoughts on GOFAI/Understanding in NLP? 01:36:38 Lena's NLP course 01:47:40 How to foster better learning / understanding 01:52:17 Lena's toolset and languages 01:54:12 Mathematics is all you need 01:56:03 Programming languages https://lena-voita.github.io/ https://www.linkedin.com/in/elena-voita/ https://scholar.google.com/citations?user=EcN9o7kAAAAJ&hl=ja https://twitter.com/lena_voita
Andrey Nikolaevich Kolmogorov was one of the giants of 20th-century mathematics. I've always found it amazing that the same man was responsible both f... https://www.scottaaronson.com/blog/?p=3376&utm_source=Thinking+About+Things&utm_campaign=cf79e1519a-EMAIL_CAMPAIGN_9_1_2019_1_5_COPY_01&utm_medium=email&utm_term=0_33397823f0-cf79e1519a-412551669 Shtetl-OptimizedIs “information is physical” contentful?What I believe II (ft. Sarah Constantin and Stacey Jeffery)The Kolmogorov optionAndrey Nikolaevich KolmogorovfoundationsKolmogorov complexity“sophistication,”Hilbert’s thirteenth problemawe-inspiring listLeonid LevinGessen’s biography of Perelmanexcluded from the top math programsLysenkoismLuzin affaircommon knowledgeTed NelsonH.C. Pocklington“and yet it moves”Dialogue Concerning the Two Chief World Systemsdifferent viewNerd InterestThe Fate of HumanityRSS 2.0trackback
This week Dr. Tim Scarfe, Alex Stenlake and Yannic Kilcher speak with AGI and AI alignment specialist Connor Leahy a machine learning engineer from Aleph Alpha and founder of EleutherAI. Connor believes that AI alignment is philosophy with a deadline and that we are on the precipice, the stakes are astronomical. AI is important, and it will go wrong by default. Connor thinks that the singularity or intelligence explosion is near. Connor says that AGI is like climate change but worse, even harder problems, even shorter deadline and even worse consequences for the future. These problems are hard, and nobody knows what to do about them. 00:00:00 Introduction to AI alignment and AGI fire alarm 00:15:16 Main Show Intro 00:18:38 Different schools of thought on AI safety 00:24:03 What is intelligence? 00:25:48 AI Alignment 00:27:39 Humans dont have a coherent utility function 00:28:13 Newcomb's paradox and advanced decision problems 00:34:01 Incentives and behavioural economics 00:37:19 Prisoner's dilemma 00:40:24 Ayn Rand and game theory in politics and business 00:44:04 Instrumental convergence and orthogonality thesis 00:46:14 Utility functions and the Stop button problem 00:55:24 AI corrigibality - self alignment 00:56:16 Decision theory and stability / wireheading / robust delegation 00:59:30 Stop button problem 01:00:40 Making the world a better place 01:03:43 Is intelligence a search problem? 01:04:39 Mesa optimisation / humans are misaligned AI 01:06:04 Inner vs outer alignment / faulty reward functions 01:07:31 Large corporations are intelligent and have no stop function 01:10:21 Dutch booking / what is rationality / decision theory 01:16:32 Understanding very powerful AIs 01:18:03 Kolmogorov complexity 01:19:52 GPT-3 - is it intelligent, are humans even intelligent? 01:28:40 Scaling hypothesis 01:29:30 Connor thought DL was dead in 2017 01:37:54 Why is GPT-3 as intelligent as a human 01:44:43 Jeff Hawkins on intelligence as compression and the great lookup table 01:50:28 AI ethics related to AI alignment? 01:53:26 Interpretability 01:56:27 Regulation 01:57:54 Intelligence explosion Discord: https://discord.com/invite/vtRgjbM EleutherAI: https://www.eleuther.ai Twitter: https://twitter.com/npcollapse LinkedIn: https://www.linkedin.com/in/connor-j-leahy/
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.07.26.222117v1?rss=1 Authors: Jain, S., Xiao, X., Bogdan, P., Bruck, J. Abstract: Evolution is a process of change where mutations in the viral RNA are selected based on their fitness for replication and survival. Given that current phylogenetic analysis of SARS-CoV-2 identifies new viral clades after they exhibit evolutionary selections, one wonders whether we can identify the viral selection and predict the emergence of new viral clades? Inspired by the Kolmogorov complexity concept, we propose a generative complexity (algorithmic) framework capable to analyze the viral RNA sequences by mapping the multiscale nucleotide dependencies onto a state machine, where states represent subsequences of nucleotides and state-transition probabilities encode the higher order interactions between these states. We apply computational learning and classification techniques to identify the active state-transitions and use those as features in clade classifiers to decipher the transient mutations (still evolving within a clade) and stable mutations (typical to a clade). As opposed to current analysis tools that rely on the edit distance between sequences and require sequence alignment, our method is computationally local, does not require sequence alignment and is robust to random errors (substitution, insertions and deletions). Relying on the GISAID viral sequence database, we demonstrate that our method can predict clade emergence, potentially aiding with the design of medications and vaccines. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.07.22.216242v1?rss=1 Authors: Vitanyi, P. M. B., Cilibrasi, R. L. Abstract: We analyze the phylogeny and taxonomy of the SARS-CoV-2 virus using compression. This is a new alignment-free method called the "normalized compression distance" (NCD) method. It discovers all effective similarities based on Kolmogorov complexity. The latter being incomputable we approximate it by a good compressor such as the modern zpaq. The results comprise that the SARS-CoV-2 virus is closest to the RaTG13 virus and similar to two bat SARS-like coronaviruses bat-SL-CoVZXC21 and bat-SL-CoVZC4. The similarity is quantified and compared with the same quantified similarities among the mtDNA of certain species. We treat the question whether Pangolins are involved in the SARS-CoV-2 virus. Copy rights belong to original authors. Visit the link for more info
* Notas del programa: https://hilandofino.net/daily/541-complejidad-de-kolmogorov-e-infografias-de-las-redes-sociales/ *Lista de email: hilandofino.net/lista *Telegram: https://t.me/hilandofino *Instagram: @sebas_abril_faura Si quieres apoyar el podcast, compra por este enlace (afiliado): https://amzn.to/33ycGEE
* Notas del programa:https://hilandofino.net/daily/541-complejidad-de-kolmogorov-e-infografias-de-las-redes-sociales/*Lista de email: hilandofino.net/lista*Telegram: https://t.me/hilandofino*Instagram: @sebas_abril_fauraSi quieres apoyar el podcast, compra por este enlace (afiliado): https://amzn.to/33ycGEE
* Notas del programa:https://hilandofino.net/daily/541-complejidad-de-kolmogorov-e-infografias-de-las-redes-sociales/*Lista de email: hilandofino.net/lista*Telegram: https://t.me/hilandofino*Instagram: @sebas_abril_fauraSi quieres apoyar el podcast, compra por este enlace (afiliado): https://amzn.to/33ycGEE
This lecture is aimed at a rather broad audience including under- graduates. It will discuss scaling arguments, what they can achieve (Kolmogorov’s -5/3 law, etc.) and what they miss (fractals, Etc).
Marcus Hutter is a senior research scientist at DeepMind and professor at Australian National University. Throughout his career of research, including with Jürgen Schmidhuber and Shane Legg, he has proposed a lot of interesting ideas in and around the field of artificial general intelligence, including the development of the AIXI model which is a mathematical approach to AGI that incorporates ideas of Kolmogorov complexity, Solomonoff induction, and reinforcement learning. EPISODE LINKS: Hutter Prize: http://prize.hutter1.net Marcus web: http://www.hutter1.net Books mentioned: – Universal AI: https://amzn.to/2waIAuw – AI: A Modern Approach: https://amzn.to/3camxnY – Reinforcement Learning: https://amzn.to/2PoANj9 – Theory of Knowledge: https://amzn.to/3a6Vp7x This conversation
Modern genome assembly projects are often based on long reads in an attempt to bridge longer repeats. However, due to the higher error rate of the current long read sequencers, assemblers based on de Bruijn graphs do not work well in this setting, and the approaches that do work are slower. In this episode, Mikhail Kolmogorov from Pavel Pevzner's lab joins us to talk about some of the ideas developed in the lab that made it possible to build a de Bruijn-like assembly graph from noisy reads. These ideas are now implemented in the Flye assembler, which performs much faster than the existing long read assemblers without sacrificing the quality of the assembly. Links: Assembly of Long Error-Prone Reads Using Repeat Graphs (Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, Pavel. A. Pevzner. Nature Biotechnology (paywalled), bioRxiv Flye on GitHub
Modern genome assembly projects are often based on long reads in an attempt to bridge longer repeats. However, due to the higher error rate of the current long read sequencers, assemblers based on de Bruijn graphs do not work well in this setting, and the approaches that do work are slower. In this episode, Mikhail Kolmogorov from Pavel Pevzner’s lab joins us to talk about some of the ideas developed in the lab that made it possible to build a de Bruijn-like assembly graph from noisy reads. These ideas are now implemented in the Flye assembler, which performs much faster than the existing long read assemblers without sacrificing the quality of the assembly. Links: Assembly of Long Error-Prone Reads Using Repeat Graphs (Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, Pavel. A. Pevzner. Nature Biotechnology (paywalled), bioRxiv Flye on GitHub
The coding theorem from algorithmic information theory (AIT) - which should be much more widely taught in Physics! - suggests that many processes in nature may be highly biased towards simple outputs. Here simple means highly compressible, or more formally, outputs with relatively lower Kolmogorov complexity. I will explore applications to biological evolution, where the coding theorem implies an exponential bias towards outcomes with higher symmetry, and to deep learning neural networks, where the coding theorem predicts an Occam's razor like bias that may explain why these highly overparamterised systems work so well.
The coding theorem from algorithmic information theory (AIT) - which should be much more widely taught in Physics! - suggests that many processes in nature may be highly biased towards simple outputs. Here simple means highly compressible, or more formally, outputs with relatively lower Kolmogorov complexity. I will explore applications to biological evolution, where the coding theorem implies an exponential bias towards outcomes with higher symmetry, and to deep learning neural networks, where the coding theorem predicts an Occam's razor like bias that may explain why these highly overparamterised systems work so well.
Gudrun Talks to Sema Coşkun who at the moment of the conversation in 2018 is a Post Doc researcher at the University Kaiserslautern in the group of financial mathematics. She constructs models for the behaviour of energy markets. In short the conversation covers the questions How are classical markets modelled? In which way are energy markets different and need new ideas? The seminal work of Black and Scholes (1973) established the modern financial theory. In a Black-Scholes setting, it is assumed that the stock price follows a Geometric Brownian Motion with a constant drift and constant volatility. The stochastic differential equation for the stock price process has an explicit solution. Therefore, it is possible to obtain the price of a European call option in a closed-form formula. Nevertheless, there exist drawbacks of the Black-Scholes assumptions. The most criticized aspect is the constant volatility assumption. It is considered an oversimplification. Several improved models have been introduced to overcome those drawbacks. One significant example of such new models is the Heston stochastic volatility model (Heston, 1993). In this model, volatility is indirectly modeled by a separate mean reverting stochastic process, namely. the Cox-Ingersoll-Ross (CIR) process. The CIR process captures the dynamics of the volatility process well. However, it is not easy to obtain option prices in the Heston model since the model has more complicated dynamics compared to the Black-Scholes model. In financial mathematics, one can use several methods to deal with these problems. In general, various stochastic processes are used to model the behavior of financial phenomena. One can then employ purely stochastic approaches by using the tools from stochastic calculus or probabilistic approaches by using the tools from probability theory. On the other hand, it is also possible to use Partial Differential Equations (the PDE approach). The correspondence between the stochastic problem and its related PDE representation is established by the help of Feynman-Kac theorem. Also in their original paper, Black and Scholes transferred the stochastic representation of the problem into its corresponding PDE, the heat equation. After solving the heat equation, they transformed the solution back into the relevant option price. As a third type of methods, one can employ numerical methods such as Monte Carlo methods. Monte Carlo methods are especially useful to compute the expected value of a random variable. Roughly speaking, instead of examining the probabilistic evolution of this random variable, we focus on the possible outcomes of it. One generates random numbers with the same distribution as the random variable and then we simulate possible outcomes by using those random numbers. Then we replace the expected value of the random variable by taking the arithmetic average of the possible outcomes obtained by the Monte Carlo simulation. The idea of Monte Carlo is simple. However, it takes its strength from two essential theorems, namely Kolmogorov’s strong law of large numbers which ensures convergence of the estimates and the central limit theorem, which refers to the error distribution of our estimates. Electricity markets exhibit certain properties which we do not observe in other markets. Those properties are mainly due to the unique characteristics of the production and consumption of electricity. Most importantly one cannot physically store electricity. This leads to several differences compared to other financial markets. For example, we observe spikes in electricity prices. Spikes refer to sudden upward or downward jumps which are followed by a fast reversion to the mean level. Therefore, electricity prices show extreme variability compared to other commodities or stocks. For example, in stock markets we observe a moderate volatility level ranging between 1% and 1.5%, commodities like crude oil or natural gas have relatively high volatilities ranging between 1.5% and 4% and finally the electricity energy has up to 50% volatility (Weron, 2000). Moreover, electricity prices show strong seasonality which is related to day to day and month to month variations in the electricity consumption. In other words, electricity consumption varies depending on the day of the week and month of the year. Another important property of the electricity prices is that they follow a mean reverting process. Thus, the Ornstein-Uhlenbeck (OU) process which has a Gaussian distribution is widely used to model electricity prices. In order to incorporate the spike behavior of the electricity prices, a jump or a Levy component is merged into the OU process. These models are known as generalized OU processes (Barndorff-Nielsen & Shephard, 2001; Benth, Kallsen & Meyer-Brandis, 2007). There exist several models to capture those properties of electricity prices. For example, structural models which are based on the equilibrium of supply and demand (Barlow, 2002), Markov jump diffusion models which combine the OU process with pure jump diffusions (Geman & Roncoroni, 2006), regime-switching models which aim to distinguish the base and spike regimes of the electricity prices and finally the multi-factor models which have a deterministic component for seasonality, a mean reverting process for the base signal and a jump or Levy process for spikes (Meyer-Brandis & Tankov, 2008). The German electricity market is one of the largest in Europe. The energy strategy of Germany follows the objective to phase out the nuclear power plants by 2021 and gradually introduce renewable energy ressources. For electricity production, the share of renewable ressources will increase up to 80% by 2050. The introduction of renewable ressources brings also some challenges for electricity trading. For example, the forecast errors regarding the electricity production might cause high risk for market participants. However, the developed market structure of Germany is designed to reduce this risk as much as possible. There are two main electricity spot price markets where the market participants can trade electricity. The first one is the day-ahead market in which the trading takes place around noon on the day before the delivery. In this market, the trades are based on auctions. The second one is the intraday market in which the trading starts at 3pm on the day before the delivery and continues up until 30 minutes before the delivery. Intraday market allows continuous trading of electricity which indeed helps the market participants to adjust their positions more precisely in the market by reducing the forecast errors. References S. Coskun and R. Korn: Pricing Barrier Options in the Heston Model Using the Heath-Platen estimator. Monte Carlo Methods and Applications. 24 (1) 29-42, 2018. S. Coskun: Application of the Heath–Platen Estimator in Pricing Barrier and Bond Options. PhD thesis, Department of Mathematics, University of Kaiserslautern, Germany, 2017. S. Desmettre and R. Korn: 10 Computationally challenging problems in Finance. FPGA Based Accelerators for Financial Applications, Springer, Heidelberg, 1–32, 2015. F. Black and M. Scholes: The pricing of options and corporate liabilities. The Journal of Political Economy, 81(3):637-654, 1973. S.L. Heston: A closed-form solution for options with stochastic volatility with applications to bond and currency options. The Review of Financial Studies, 6(2):327–343, 1993. R. Korn, E. Korn and G. Kroisandt: Monte Carlo Methods and Models in Finance and Insurance. Chapman & Hall/CRC Financ. Math. Ser., CRC Press, Boca Raton, 2010. P. Glasserman, Monte Carlo Methods in Financial Engineering. Stochastic Modelling and Applied Probability, Appl. Math. (New York) 53, Springer, New York, 2004. M.T. Barlow: A diffusion model for electricity prices. Mathematical Finance, 12(4):287-298, 2002. O.E. Barndorff-Nielsen and N. Shephard: Non-Gaussian Ornstein-Uhlenbeck-based models and some of their uses in financial economics. Journal of the Royal Statistical Society B, 63(2):167-241, 2001. H. Geman and A. Roncoroni: Understanding the fine structure of electricity prices. The Journal of Business, 79(3):1225-1261, 2006. T. Meyer-Brandis and P. Tankov: Multi-factor jump-diffusion models of electricity prices. International Journal of Theoretical and Applied Finance, 11(5):503-528, 2008. R. Weron: Energy price risk management. Physica A, 285(1-2):127–134, 2000. Podcasts G. Thäter, M. Hofmanová: Turbulence, conversation in the Modellansatz Podcast, episode 155, Department of Mathematics, Karlsruhe Institute of Technology (KIT), 2018. http://modellansatz.de/turbulence G. Thäter, M. J. Amtenbrink: Wasserstofftankstellen, Gespräch im Modellansatz Podcast, Folge 163, Fakultät für Mathematik, Karlsruher Institut für Technologie (KIT), 2018. http://modellansatz.de/wasserstofftankstellen S. Ajuvo, S. Ritterbusch: Finanzen damalsTM, Gespräch im Modellansatz Podcast, Folge 97, Fakultät für Mathematik, Karlsruher Institut für Technologie (KIT), 2016. http://modellansatz.de/finanzen-damalstm K. Cindric, G. Thäter: Kaufverhalten, Gespräch im Modellansatz Podcast, Folge 45, Fakultät für Mathematik, Karlsruher Institut für Technologie (KIT), 2015. http://modellansatz.de/kaufverhalten V. Riess, G. Thäter: Gasspeicher, Gespräch im Modellansatz Podcast, Folge 23, Fakultät für Mathematik, Karlsruher Institut für Technologie (KIT), 2015. http://modellansatz.de/gasspeicher F. Schueth, T. Pritlove: Energieforschung, Episode 12 im Forschergeist Podcast, Stifterverband/Metaebene, 2015. https://forschergeist.de/podcast/fg012-energieforschung/
Gudrun Talks to Sema Coşkun who at the moment of the conversation in 2018 is a Post Doc researcher at the University Kaiserslautern in the group of financial mathematics. She constructs models for the behaviour of energy markets. In short the conversation covers the questions How are classical markets modelled? In which way are energy markets different and need new ideas? The seminal work of Black and Scholes (1973) established the modern financial theory. In a Black-Scholes setting, it is assumed that the stock price follows a Geometric Brownian Motion with a constant drift and constant volatility. The stochastic differential equation for the stock price process has an explicit solution. Therefore, it is possible to obtain the price of a European call option in a closed-form formula. Nevertheless, there exist drawbacks of the Black-Scholes assumptions. The most criticized aspect is the constant volatility assumption. It is considered an oversimplification. Several improved models have been introduced to overcome those drawbacks. One significant example of such new models is the Heston stochastic volatility model (Heston, 1993). In this model, volatility is indirectly modeled by a separate mean reverting stochastic process, namely. the Cox-Ingersoll-Ross (CIR) process. The CIR process captures the dynamics of the volatility process well. However, it is not easy to obtain option prices in the Heston model since the model has more complicated dynamics compared to the Black-Scholes model. In financial mathematics, one can use several methods to deal with these problems. In general, various stochastic processes are used to model the behavior of financial phenomena. One can then employ purely stochastic approaches by using the tools from stochastic calculus or probabilistic approaches by using the tools from probability theory. On the other hand, it is also possible to use Partial Differential Equations (the PDE approach). The correspondence between the stochastic problem and its related PDE representation is established by the help of Feynman-Kac theorem. Also in their original paper, Black and Scholes transferred the stochastic representation of the problem into its corresponding PDE, the heat equation. After solving the heat equation, they transformed the solution back into the relevant option price. As a third type of methods, one can employ numerical methods such as Monte Carlo methods. Monte Carlo methods are especially useful to compute the expected value of a random variable. Roughly speaking, instead of examining the probabilistic evolution of this random variable, we focus on the possible outcomes of it. One generates random numbers with the same distribution as the random variable and then we simulate possible outcomes by using those random numbers. Then we replace the expected value of the random variable by taking the arithmetic average of the possible outcomes obtained by the Monte Carlo simulation. The idea of Monte Carlo is simple. However, it takes its strength from two essential theorems, namely Kolmogorov’s strong law of large numbers which ensures convergence of the estimates and the central limit theorem, which refers to the error distribution of our estimates. Electricity markets exhibit certain properties which we do not observe in other markets. Those properties are mainly due to the unique characteristics of the production and consumption of electricity. Most importantly one cannot physically store electricity. This leads to several differences compared to other financial markets. For example, we observe spikes in electricity prices. Spikes refer to sudden upward or downward jumps which are followed by a fast reversion to the mean level. Therefore, electricity prices show extreme variability compared to other commodities or stocks. For example, in stock markets we observe a moderate volatility level ranging between 1% and 1.5%, commodities like crude oil or natural gas have relatively high volatilities ranging between 1.5% and 4% and finally the electricity energy has up to 50% volatility (Weron, 2000). Moreover, electricity prices show strong seasonality which is related to day to day and month to month variations in the electricity consumption. In other words, electricity consumption varies depending on the day of the week and month of the year. Another important property of the electricity prices is that they follow a mean reverting process. Thus, the Ornstein-Uhlenbeck (OU) process which has a Gaussian distribution is widely used to model electricity prices. In order to incorporate the spike behavior of the electricity prices, a jump or a Levy component is merged into the OU process. These models are known as generalized OU processes (Barndorff-Nielsen & Shephard, 2001; Benth, Kallsen & Meyer-Brandis, 2007). There exist several models to capture those properties of electricity prices. For example, structural models which are based on the equilibrium of supply and demand (Barlow, 2002), Markov jump diffusion models which combine the OU process with pure jump diffusions (Geman & Roncoroni, 2006), regime-switching models which aim to distinguish the base and spike regimes of the electricity prices and finally the multi-factor models which have a deterministic component for seasonality, a mean reverting process for the base signal and a jump or Levy process for spikes (Meyer-Brandis & Tankov, 2008). The German electricity market is one of the largest in Europe. The energy strategy of Germany follows the objective to phase out the nuclear power plants by 2021 and gradually introduce renewable energy ressources. For electricity production, the share of renewable ressources will increase up to 80% by 2050. The introduction of renewable ressources brings also some challenges for electricity trading. For example, the forecast errors regarding the electricity production might cause high risk for market participants. However, the developed market structure of Germany is designed to reduce this risk as much as possible. There are two main electricity spot price markets where the market participants can trade electricity. The first one is the day-ahead market in which the trading takes place around noon on the day before the delivery. In this market, the trades are based on auctions. The second one is the intraday market in which the trading starts at 3pm on the day before the delivery and continues up until 30 minutes before the delivery. Intraday market allows continuous trading of electricity which indeed helps the market participants to adjust their positions more precisely in the market by reducing the forecast errors. References S. Coskun and R. Korn: Pricing Barrier Options in the Heston Model Using the Heath-Platen estimator. Monte Carlo Methods and Applications. 24 (1) 29-42, 2018. S. Coskun: Application of the Heath–Platen Estimator in Pricing Barrier and Bond Options. PhD thesis, Department of Mathematics, University of Kaiserslautern, Germany, 2017. S. Desmettre and R. Korn: 10 Computationally challenging problems in Finance. FPGA Based Accelerators for Financial Applications, Springer, Heidelberg, 1–32, 2015. F. Black and M. Scholes: The pricing of options and corporate liabilities. The Journal of Political Economy, 81(3):637-654, 1973. S.L. Heston: A closed-form solution for options with stochastic volatility with applications to bond and currency options. The Review of Financial Studies, 6(2):327–343, 1993. R. Korn, E. Korn and G. Kroisandt: Monte Carlo Methods and Models in Finance and Insurance. Chapman & Hall/CRC Financ. Math. Ser., CRC Press, Boca Raton, 2010. P. Glasserman, Monte Carlo Methods in Financial Engineering. Stochastic Modelling and Applied Probability, Appl. Math. (New York) 53, Springer, New York, 2004. M.T. Barlow: A diffusion model for electricity prices. Mathematical Finance, 12(4):287-298, 2002. O.E. Barndorff-Nielsen and N. Shephard: Non-Gaussian Ornstein-Uhlenbeck-based models and some of their uses in financial economics. Journal of the Royal Statistical Society B, 63(2):167-241, 2001. H. Geman and A. Roncoroni: Understanding the fine structure of electricity prices. The Journal of Business, 79(3):1225-1261, 2006. T. Meyer-Brandis and P. Tankov: Multi-factor jump-diffusion models of electricity prices. International Journal of Theoretical and Applied Finance, 11(5):503-528, 2008. R. Weron: Energy price risk management. Physica A, 285(1-2):127–134, 2000. Podcasts G. Thäter, M. Hofmanová: Turbulence, conversation in the Modellansatz Podcast, episode 155, Department of Mathematics, Karlsruhe Institute of Technology (KIT), 2018. http://modellansatz.de/turbulence G. Thäter, M. J. Amtenbrink: Wasserstofftankstellen, Gespräch im Modellansatz Podcast, Folge 163, Fakultät für Mathematik, Karlsruher Institut für Technologie (KIT), 2018. http://modellansatz.de/wasserstofftankstellen S. Ajuvo, S. Ritterbusch: Finanzen damalsTM, Gespräch im Modellansatz Podcast, Folge 97, Fakultät für Mathematik, Karlsruher Institut für Technologie (KIT), 2016. http://modellansatz.de/finanzen-damalstm K. Cindric, G. Thäter: Kaufverhalten, Gespräch im Modellansatz Podcast, Folge 45, Fakultät für Mathematik, Karlsruher Institut für Technologie (KIT), 2015. http://modellansatz.de/kaufverhalten V. Riess, G. Thäter: Gasspeicher, Gespräch im Modellansatz Podcast, Folge 23, Fakultät für Mathematik, Karlsruher Institut für Technologie (KIT), 2015. http://modellansatz.de/gasspeicher F. Schueth, T. Pritlove: Energieforschung, Episode 12 im Forschergeist Podcast, Stifterverband/Metaebene, 2015. https://forschergeist.de/podcast/fg012-energieforschung/
A good scientist, in other words, does not merely ignore conventional wisdom, but makes a special effort to break it. Scientists go looking for trouble. — Paul Graham, What You Can’t Say I. Staying on the subject of Dark Age myths: what about all those scientists burned at the stake for their discoveries? Historical consensus declares this a myth invented by New Atheists. The Church was a great patron of science, no one believed in a flat earth, Galileo had it coming, et cetera. Unam Sanctam Catholicam presents some of these stories and explains why they’re less of a science-vs-religion slam dunk than generally supposed. Among my favorites:
Cycle de conférences organisées par la Bibliothèque nationale de France et la Société mathématique de France. Conférence du 15 avril 2015. Conférences organisées à l'attention du grand public, des professeurs du second degré et des lycéens et étudiants, les conférenciers partent d'un texte, ou d'un corpus de textes, et montrent en quoi ce texte les a influencés personnellement et a conduit à des recherches contemporaines.
Cycle de conférences organisées par la Bibliothèque nationale de France et la Société mathématique de France. Conférence du 15 avril 2015. Conférences organisées à l'attention du grand public, des professeurs du second degré et des lycéens et étudiants, les conférenciers partent d'un texte, ou d'un corpus de textes, et montrent en quoi ce texte les a influencés personnellement et a conduit à des recherches contemporaines.
Cycle de conférences organisées par la Bibliothèque nationale de France et la Société mathématique de France. Conférence du 15 avril 2015. Conférences organisées à l'attention du grand public, des professeurs du second degré et des lycéens et étudiants, les conférenciers partent d'un texte, ou d'un corpus de textes, et montrent en quoi ce texte les a influencés personnellement et a conduit à des recherches contemporaines.
SynTalk thinks about information, while constantly wondering about its physical nature and computability. Is there information in the universe irrespective of human beings or life? Does all the meaning come from a protocol, and what if there is no shared language? Does a protocol or a context need to pre exist? The concepts are derived off / from Laplace, Carnot, Boltzmann, Shannon, Ronald Fisher, Kolmogorov, T S Eliot, Warren Weaver, & Nørretranders, among others. We retrace the journey of the notion of information within (say) thermodynamics, electrical engineering, neurolinguistics, mathematics, and computational systems, & notice how the core departure was to think of it as measurable? Does the universe speak in one language? Does ignorance go down when information is received, and is ignorance analogous to disorder? Is entropy an anthropomorphic principle, as it assumes an underlying notion of order? How, in language, the norm (order) can be identified directly from a close study of the deviation from the norm (disorder). How the brain or any system may learn how to learn and negotiate meaning via ‘bootstrapping’? Is the nature of ‘input’ processing different from information processing as the neural networks are formed in a child’s brain? What makes data information for the receiver? Why does an internal combustion engine ‘have’ to dump out the disorder via the exhaust to direct order to the wheels? Can one think of information content as an objective ‘event cone’ with past and future imprinted in it? Is all time eternally present? Is there a fundamental unit (say, bit or qubit) of information, & is it discrete or continuous or both? How & why are the first and second language signals stored differently in the brain? What is the role played by shared context (exformation) and commonality in communication? Are there different mathematical theories of communication, information content and complexity? The links between wax, steam engines, Voyager, heads or tails, ‘motive power of fire’, critical period hypothesis, It from Bit, falling stones, the case of Chelsea’s misdiagnosis, Four Quartets, ‘I do’, heat death, & Schrodinger’s cat. Can we forget something if we explicitly want to? Does nature forget (information)? Will we drown in the crazy amount of information in the future, or will we develop new tools to handle complexity? Do we need to understand human mind & cognition better? Can we communicate with animals and (may be) aliens in the future? ‘If a lion could speak, we could not understand him’. The SynTalkrs are: Prof. Vaishna Narang (biolinguistics, JNU, New Delhi), Prof. Rajaram Nityananda (astrophysics, Azim Premji University, ex NCRA-TIFR, Bangalore), & Prof. R. Ramanujam (computer science, IMSc, Chennai).
This lecture continues our conversation on Martingales and covers stopped martingales, Kolmogorov submartingale inequality, martingale convergence theorem, and more.
Given by Professor Yuri Manin, Professor Emeritus, Max Planck Institute for Mathematics, Bonn, Germany; Professor Emeritus, Northwestern University, Evanston, USA; Principal Researcher, Steklov Mathematical Institute, Academy of Sciences, Moscow, Russia. In the 1930s, George Kingsley Zipf discovered an empirical statistical law that later proved to be remarkably universal. Consider a corpus of texts in a given language, make the list of all words that occur in them and the number of occurences. Range the words in the order of diminishing frequencies. Define the Zipf rank of the word as its number in this ordering. Then Zipf's Law says: "Frequency is inversely proportional to the rank". Zipf himself suggested that this law must follow from the principle of 'minimisation of effort' by the brain. However, the nature of this effort and its measure remained mysterious. In my lecture, I will argue that Zipf's effort needed to produce a word (say, name of the number) must be measured by the celebrated Kolmogorov complexity: the length of the shortest Turing program (input) needed to produce this word/name/combinatorial object/etc. as its output. I will describe basic properties of the complexity (some of them rather counterintuitive) and one more situation from the theory of error-correcting codes, where Kolmogorov complexity again plays the role of 'energy in the world of ideas'.
Tom Sterkenburg (Amsterdam/Groningen) gives a talk at the MCMP Colloquium (15 January, 2015) titled "Occam's Razor in Algorithmic Information Theory". Abstract: Algorithmic information theory, also known as Kolmogorov complexity, is sometimes believed to offer us a general and objective measure of simplicity. The first variant of this simplicity measure to appear in the literature was in fact part of a theory of prediction: the central achievement of its originator, R.J. Solomonoff, was the definition of an idealized method of prediction that is taken to implement Occam's razor in giving greater probability to simpler hypotheses about the future. Moreover, in many writings on the subject an argument of the following sort takes shape. From (1) the definition of the Solomonoff predictor which has a precise preference for simplicity, and (2) a formal proof that this predictor will generally lead us to the truth, it follows that (Occam's razor) a preference for simplicity will generally lead us to the truth. Thus, sensationally, this is an argument to justify Occam's razor. In this talk, I show why the argument fails. The key to its dissolution is a representation theorem that links Kolmogorov complexity to Bayesian prediction.
Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 01/02
This thesis is concerned with the generalisation of Bayesian inference towards the use of imprecise or interval probability, with a focus on model behaviour in case of prior-data conflict. Bayesian inference is one of the main approaches to statistical inference. It requires to express (subjective) knowledge on the parameter(s) of interest not incorporated in the data by a so-called prior distribution. All inferences are then based on the so-called posterior distribution, the subsumption of prior knowledge and the information in the data calculated via Bayes' Rule. The adequate choice of priors has always been an intensive matter of debate in the Bayesian literature. While a considerable part of the literature is concerned with so-called non-informative priors aiming to eliminate (or, at least, to standardise) the influence of priors on posterior inferences, inclusion of specific prior information into the model may be necessary if data are scarce, or do not contain much information about the parameter(s) of interest; also, shrinkage estimators, common in frequentist approaches, can be considered as Bayesian estimators based on informative priors. When substantial information is used to elicit the prior distribution through, e.g, an expert's assessment, and the sample size is not large enough to eliminate the influence of the prior, prior-data conflict can occur, i.e., information from outlier-free data suggests parameter values which are surprising from the viewpoint of prior information, and it may not be clear whether the prior specifications or the integrity of the data collecting method (the measurement procedure could, e.g., be systematically biased) should be questioned. In any case, such a conflict should be reflected in the posterior, leading to very cautious inferences, and most statisticians would thus expect to observe, e.g., wider credibility intervals for parameters in case of prior-data conflict. However, at least when modelling is based on conjugate priors, prior-data conflict is in most cases completely averaged out, giving a false certainty in posterior inferences. Here, imprecise or interval probability methods offer sound strategies to counter this issue, by mapping parameter uncertainty over sets of priors resp. posteriors instead of over single distributions. This approach is supported by recent research in economics, risk analysis and artificial intelligence, corroborating the multi-dimensional nature of uncertainty and concluding that standard probability theory as founded on Kolmogorov's or de Finetti's framework may be too restrictive, being appropriate only for describing one dimension, namely ideal stochastic phenomena. The thesis studies how to efficiently describe sets of priors in the setting of samples from an exponential family. Models are developed that offer enough flexibility to express a wide range of (partial) prior information, give reasonably cautious inferences in case of prior-data conflict while resulting in more precise inferences when prior and data agree well, and still remain easily tractable in order to be useful for statistical practice. Applications in various areas, e.g. common-cause failure modeling and Bayesian linear regression, are explored, and the developed approach is compared to other imprecise probability models.
Mathematics and Physics of Anderson Localization: 50 Years After
Shepelyansky, D (Université Paul Sabatier Toulouse III) Wednesday 19 September 2012, 11:50-12:30
Au XVIIIe siècle, Laplace estimait avoir prouvé la stabilité du Système solaire. En 1900, Poincaré a découvert que ce dernier pourrait au contraire être chaotique. Un demi-siècle plus tard, Kolmogorov a démontré qu'il avait des chances d’être stable… Que sait-on aujourd’hui sur cette question ? Conférence de François Béguin, maître de conférences au département de mathématiques d’Orsay à l’université Paris-Sud-XI et chercheur au département de mathématiques et applications à l’École normale supérieure (ENS), Paris
Fakultät für Physik - Digitale Hochschulschriften der LMU - Teil 02/05
Ziel der Arbeit ist die Untersuchung von Magnetfeldern im intergalaktischen Gas von Galaxienhaufen mittels Faradayrotationskarten extragalaktischer Radioquellen, die in oder hinter einem Galaxienhaufen lokalisiert sind. Faradayrotation entsteht, wenn linear polarisierte Strahlung einer solchen Quelle durch ein magnetisiertes Medium propagiert und dabei dessen Polarisationsebene rotiert wird. Multifrequenzbeobachtungen erlauben die Konstruktion von Faradayrotationskarten. Die statistische Charakterisierung und Analyse dieser Karten erlaubt es, Eigenschaften der Magnetfelder, welche mit dem Plasma in Galaxienhaufen in Verbindung stehen, zu bestimmen. Es wurde untersucht, ob es einen Beweis dafuer gibt, dass die Faradayrotation im quellennahen Material erzeugt wird oder im magnetisierten Plasma der Galaxienhaufen. Dazu wurden zwei statistische Masse zur Charakterisierung der Daten eingefuehrt. Beide Masse sind ausserdem wertvolle Indikatoren fuer moegliche Probleme bei der Berechnung von Magetfeldeigenschaften auf der Basis von Faradayrotationsmessungen. Die Masse wurden auf Faradayrotationsmessungen von ausgedehnten Radioquellen angewandt. Es konnten keine Hinweise auf quellennahe Enstehungsorte der Faradyrotation gefunden werden. Aufgrund von davon unabhaengigen Beweisen, wurde festgestellt, dass die Magnetfelder, welche die Faradayrotation verursachen, mit dem Plasma in Galaxienhaufen in Zusammenhang stehen sollten.Eine statistische Analyse von Faradayrotationsmessungen mittels Autokorrelationsfunktionen und aequivalent dazu Energiespektren wurde entwickelt um Magnetfeldstaerken und -korrelationslaengen zu bestimmen. Diese Analyse stuetzt sich auf die Annahme, dass die Magnetfelder statistisch isotrop im Faradayrotationsgebiet verteilt sind. Sie benutzt eine sogenannte Fensterfunktion, die das Probenvolumen beschreibt, in welchem Magnetfelder detektierbar sind. Die Faradayrotationskarten von drei ausgedehnten Radioquellen (d.h. 3C75 in Abell 400, 3C465 in 2634 und Hydra A in Abell 780) wurden mittels dieser Methode neu ausgewertet und dabei Magnetfeldstaerken von 1 bis 10 muGauss fuer diese drei Galaxienhaufen abgeleitet.Die Messung von magnetischen Energiespektren erfordert Faradayrotationskarten hoechster Guete. Um Artefakte durch die Datenreduktion zu vermeiden, wurde ein neuer Algorithmus -- Pacman -- zur Berechnung von Faradayrotationskarten entwickelt. Verschiedene statistische Tests zeigen, dass dieser Algorithmus stabil ist und zuverlaessige Faradayrotationswerte berechnet. Zur genauen Messung von magnetischen Energiespektren aus den Pacman Karten wurde ein Maximum-Likelihood-Schaetzer, der auf der zuvor eingefuehrten Theorie beruht. Diese neue Methode erlaubt erstmals, die statistische Unsicherheit des Ergebnisses anzugeben. Des weiteren beruecksichtigt diese Methode das begrenzte Probenvolumen und macht die verlaessliche Bestimmung von Energiespektren moeglich. Diese Maximum-likelihood Methode wurde auf Pacman Faradayrotationskarten von Hydra A angewandt. Beruecksichtigt man die Ungewissheit ueber die exakte Probengeometrie des Faradaygebietes, erhaelt man eine Magnetfeldstaerke von 7 +/- 2 muGauss. Das berechnete Energiespektrum folgt einem Kolmogorov aehnlichem Energiespektrum ueber wenigstens eine Groesseenordnung. Die magnetische Energie ist auf einer dominanten Skale von ungefaehr 3 kpc konzentriert.
El Profesor Florencio González Asenjo, catedrático de matemática de la Universidad de Pittsburgh, imparte la tercera conferencia del ciclo titulado “Cuestiones de metamatemática”. En este aspecto, se hizo referencia a la mecanización de la matemática (el célebre sueño de Leibniz) y al teorema de Church, y describió algunos ejemplos de sistemas decidibles e indecidibles, así como a las máquinas de Turing y los algoritmos de Markoc y Kolmogorov. Más información de este acto